June 22, 2010

Prediction of protein–RNA binding sites by a random forest method with combined features

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:08 am

Protein–RNA interactions play a key role in a number of biological processes, such as protein synthesis, mRNA processing, mRNA assembly, ribosome function and eukaryotic spliceosomes. As a result, a reliable identification of RNA binding site of a protein is important for functional annotation and site-directed mutagenesis. Accumulated data of experimental protein–RNA interactions reveal that a RNA binding residue with different neighbor amino acids often exhibits different preferences for its RNA partners, which in turn can be assessed by the interacting interdependence of the amino acid fragment and RNA nucleotide.

Results: In this work, we propose a novel classification method to identify the RNA binding sites in proteins by combining a new interacting feature (interaction propensity) with other sequence- and structure-based features. Specifically, the interaction propensity represents a binding specificity of a protein residue to the interacting RNA nucleotide by considering its two-side neighborhood in a protein residue triplet. The sequence as well as the structure-based features of the residues are combined together to discriminate the interaction propensity of amino acids with RNA. We predict RNA interacting residues in proteins by implementing a well-built random forest classifier. The experiments show that our method is able to detect the annotated protein–RNA interaction sites in a high accuracy. Our method achieves an accuracy of 84.5%, F-measure of 0.85 and AUC of 0.92 prediction of the RNA binding residues for a dataset containing 205 non-homologous RNA binding proteins, and also outperforms several existing RNA binding residue predictors, such as RNABindR, BindN, RNAProB and PPRint, and some alternative machine learning methods, such as support vector machine, naive Bayes and neural network in the comparison study. Furthermore, we provide some biological insights into the roles of sequences and structures in protein–RNA interactions by both evaluating the importance of features for their contributions in predictive accuracy and analyzing the binding patterns of interacting residues.

Availability: All the source data and code are available at or