June 30, 2010

Discover regulatory DNA elements using chromatin signatures and artificial neural network

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 7:17 am

Recent large-scale chromatin states mapping efforts have revealed characteristic chromatin modification signatures for various types of functional DNA elements. Given the important influence of chromatin states on gene regulation and the rapid accumulation of genome-wide chromatin modification data, there is a pressing need for computational methods to analyze these data in order to identify functional DNA elements. However, existing computational tools do not exploit data transformation and feature extraction as a means to achieve a more accurate prediction.

Results: We introduce a new computational framework for identifying functional DNA elements using chromatin signatures. The framework consists of a data transformation and a feature extraction step followed by a classification step using time-delay neural network. We implemented our framework in a software tool CSI-ANN (chromatin signature identification by artificial neural network). When applied to predict transcriptional enhancers in the ENCODE region, CSI-ANN achieved a 65.5% sensitivity and 66.3% positive predictive value, a 5.9% and 11.6% improvement, respectively, over the previously best approach.

Availability and Implementation: CSI-ANN is implemented in Matlab. The source code is freely available at

June 28, 2010

Automated analysis of protein subcellular location in time series images

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:03 am

Image analysis, machine learning and statistical modeling have become well established for the automatic recognition and comparison of the subcellular locations of proteins in microscope images. By using a comprehensive set of features describing static images, major subcellular patterns can be distinguished with near perfect accuracy. We now extend this work to time series images, which contain both spatial and temporal information. The goal is to use temporal features to improve recognition of protein patterns that are not fully distinguishable by their static features alone.

Results: We have adopted and designed five sets of features for capturing temporal behavior in 2D time series images, based on object tracking, temporal texture, normal flow, Fourier transforms and autoregression. Classification accuracy on an image collection for 12 fluorescently tagged proteins was increased when temporal features were used in addition to static features. Temporal texture, normal flow and Fourier transform features were most effective at increasing classification accuracy. We therefore extended these three feature sets to 3D time series images, but observed no significant improvement over results for 2D images. The methods for 2D and 3D temporal pattern analysis do not require segmentation of images into single cell regions, and are suitable for automated high-throughput microscopy applications.

Availability: Images, source code and results will be available upon publication at

June 26, 2010

Learning combinatorial transcriptional dynamics from gene expression data

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:49 am

mRNA transcriptional dynamics is governed by a complex network of transcription factor (TF) proteins. Experimental and theoretical analysis of this process is hindered by the fact that measurements of TF activity in vivo is very challenging. Current models that jointly infer TF activities and model parameters rely on either of the two main simplifying assumptions: either the dynamics is simplified (e.g. assuming quasi-steady state) or the interactions between TFs are ignored, resulting in models accounting for a single TF.

Results: We present a novel approach to reverse engineer the dynamics of multiple TFs jointly regulating the expression of a set of genes. The model relies on a continuous time, differential equation description of transcriptional dynamics where TFs are treated as latent on/off variables and are modelled using a switching stochastic process (telegraph process). The model can not only incorporate both activation and repression, but allows any non-trivial interaction between TFs, including AND and OR gates. By using a factorization assumption within a variational Bayesian treatment we formulate a framework that can reconstruct both the activity profiles of the TFs and the type of regulation from time series gene expression data. We demonstrate the identifiability of the model on a simple but non-trivial synthetic example, and then use it to formulate non-trivial predictions about transcriptional control during yeast metabolism.


June 25, 2010

Classification of DNA sequences using Bloom filters

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 4:47 am

New generation sequencing technologies producing increasingly complex datasets demand new efficient and specialized sequence analysis algorithms. Often, it is only the ‘novel’ sequences in a complex dataset that are of interest and the superfluous sequences need to be removed.

Results: A novel algorithm, fast and accurate classification of sequences (FACSs), is introduced that can accurately and rapidly classify sequences as belonging or not belonging to a reference sequence. FACS was first optimized and validated using a synthetic metagenome dataset. An experimental metagenome dataset was then used to show that FACS achieves comparable accuracy as BLAT and SSAHA2 but is at least 21 times faster in classifying sequences.

Availability: Source code for FACS, Bloom filters and MetaSim dataset used is available at The Bloom::Faster 1.6 Perl module can be downloaded from CPAN at

June 23, 2010

Pathgroups, a dynamic data structure for genome reconstruction problems

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:56 am

Ancestral gene order reconstruction problems, including the median problem, quartet construction, small phylogeny, guided genome halving and genome aliquoting, are NP hard. Available heuristics dedicated to each of these problems are computationally costly for even small instances.

Results: We present a data structure enabling rapid heuristic solution to all these ancestral genome reconstruction problems. A generic greedy algorithm with look-ahead based on an automatically generated priority system suffices for all the problems using this data structure. The efficiency of the algorithm is due to fast updating of the structure during run time and to the simplicity of the priority scheme. We illustrate with the first rapid algorithm for quartet construction and apply this to a set of yeast genomes to corroborate a recent gene sequence-based phylogeny.


June 22, 2010

Prediction of protein–RNA binding sites by a random forest method with combined features

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:08 am

Protein–RNA interactions play a key role in a number of biological processes, such as protein synthesis, mRNA processing, mRNA assembly, ribosome function and eukaryotic spliceosomes. As a result, a reliable identification of RNA binding site of a protein is important for functional annotation and site-directed mutagenesis. Accumulated data of experimental protein–RNA interactions reveal that a RNA binding residue with different neighbor amino acids often exhibits different preferences for its RNA partners, which in turn can be assessed by the interacting interdependence of the amino acid fragment and RNA nucleotide.

Results: In this work, we propose a novel classification method to identify the RNA binding sites in proteins by combining a new interacting feature (interaction propensity) with other sequence- and structure-based features. Specifically, the interaction propensity represents a binding specificity of a protein residue to the interacting RNA nucleotide by considering its two-side neighborhood in a protein residue triplet. The sequence as well as the structure-based features of the residues are combined together to discriminate the interaction propensity of amino acids with RNA. We predict RNA interacting residues in proteins by implementing a well-built random forest classifier. The experiments show that our method is able to detect the annotated protein–RNA interaction sites in a high accuracy. Our method achieves an accuracy of 84.5%, F-measure of 0.85 and AUC of 0.92 prediction of the RNA binding residues for a dataset containing 205 non-homologous RNA binding proteins, and also outperforms several existing RNA binding residue predictors, such as RNABindR, BindN, RNAProB and PPRint, and some alternative machine learning methods, such as support vector machine, naive Bayes and neural network in the comparison study. Furthermore, we provide some biological insights into the roles of sequences and structures in protein–RNA interactions by both evaluating the importance of features for their contributions in predictive accuracy and analyzing the binding patterns of interacting residues.

Availability: All the source data and code are available at or

June 21, 2010

Repitools: an R package for the analysis of enrichment-based epigenomic data

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:30 am

: Epigenetics, the study of heritable somatic phenotypic changes not related to DNA sequence, has emerged as a critical component of the landscape of gene regulation. The epigenetic layers, such as DNA methylation, histone modifications and nuclear architecture are now being extensively studied in many cell types and disease settings. Few software tools exist to summarize and interpret these datasets. We have created a toolbox of procedures to interrogate and visualize epigenomic data (both array- and sequencing-based) and make available a software package for the cross-platform R language.

Availability: The package is freely available under LGPL from the R-Forge web site (

June 19, 2010

Modeling RNA loops using sequence homology and geometric constraints

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: ,

RNA loop regions are essential structural elements of RNA molecules influencing both their structural and functional properties. We developed RLooM, a web application for homology-based modeling of RNA loops utilizing template structures extracted from the PDB. RLooM allows the insertion and replacement of loop structures of a desired sequence into an existing RNA structure. Furthermore, a comprehensive database of loops in RNA structures can be accessed through the web interface.

Availability and Implementation: The application was implemented in Python, MySQL and Apache. A web interface to the database and loop modeling application is freely available at

June 18, 2010

GPU computing for systems biology

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 4:40 am

The development of detailed, coherent, models of complex biological systems is recognized as a key requirement for integrating the increasing amount of experimental data. In addition, in-silico simulation of bio-chemical models provides an easy way to test different experimental conditions, helping in the discovery of the dynamics that regulate biological systems. However, the computational power required by these simulations often exceeds that available on common desktop computers and thus expensive high performance computing solutions are required. An emerging alternative is represented by general-purpose scientific computing on graphics processing units (GPGPU), which offers the power of a small computer cluster at a cost of $400. Computing with a GPU requires the development of specific algorithms, since the programming paradigm substantially differs from traditional CPU-based computing. In this paper, we review some recent efforts in exploiting the processing power of GPUs for the simulation of biological systems

June 17, 2010

TEAM: efficient two-locus epistasis tests in human genome-wide association study

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: ,

As a promising tool for identifying genetic markers underlying phenotypic differences, genome-wide association study (GWAS) has been extensively investigated in recent years. In GWAS, detecting epistasis (or gene–gene interaction) is preferable over single locus study since many diseases are known to be complex traits. A brute force search is infeasible for epistasis detection in the genome-wide scale because of the intensive computational burden. Existing epistasis detection algorithms are designed for dataset consisting of homozygous markers and small sample size. In human study, however, the genotype may be heterozygous, and number of individuals can be up to thousands. Thus, existing methods are not readily applicable to human datasets. In this article, we propose an efficient algorithm, TEAM, which significantly speeds up epistasis detection for human GWAS. Our algorithm is exhaustive, i.e. it does not ignore any epistatic interaction. Utilizing the minimum spanning tree structure, the algorithm incrementally updates the contingency tables for epistatic tests without scanning all individuals. Our algorithm has broader applicability and is more efficient than existing methods for large sample study. It supports any statistical test that is based on contingency tables, and enables both family-wise error rate and false discovery rate controlling. Extensive experiments show that our algorithm only needs to examine a small portion of the individuals to update the contingency tables, and it achieves at least an order of magnitude speed up over the brute force approach.

Next Page »