May 31, 2010

Arcadia: a visualization tool for metabolic pathways

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: ,

Arcadia translates text-based descriptions of biological networks (SBML files) into standardized diagrams (SBGN PD maps). Users can view the same model from different perspectives and easily alter the layout to emulate traditional textbook representations.

Availability and Implementation: Arcadia is written in C++. The source code is available (along with Mac OS and Windows binaries) under the GPL from


May 30, 2010

ParaSAM: a parallelized version of the significance analysis of microarrays algorithm

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: ,

Significance analysis of microarrays (SAM) is a widely used permutation-based approach to identifying differentially expressed genes in microarray datasets. While SAM is freely available as an Excel plug-in and as an R-package, analyses are often limited for large datasets due to very high memory requirements.

Summary: We have developed a parallelized version of the SAM algorithm called ParaSAM to overcome the memory limitations. This high performance multithreaded application provides the scientific community with an easy and manageable client-server Windows application with graphical user interface and does not require programming experience to run. The parallel nature of the application comes from the use of web services to perform the permutations. Our results indicate that ParaSAM is not only faster than the serial version, but also can analyze extremely large datasets that cannot be performed using existing implementations.

Availability:A web version open to the public is available at For local installations, both the windows and web implementations of ParaSAM are available for free at

May 29, 2010

DistanceScan: a tool for promoter modeling

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: , ,

The state of the art in promoter modeling for higher eukaryotes is predicting not single transcription factor binding sites (TFBSs), but their combinations. The new tool utilizes a previously developed method of distance distributions of TFBS pairs. We model the random distribution of distances and compare it with the distribution observed in the query sequences. Comparison of the profiles allows filtering out the ‘noise’ and retaining the potentially functional combinations. This approach has proved its usefulness as a filtering technique for the selection of TFBS pairs for promoter modeling and is now implemented as a tool in R. As an input, it can use the outputs of three different TFBS- and motif-predictive tools (Gibbs Sampler for motifs, MatchTM and MEME/FIMO for PWM-based search). The output is a list of predicted pairs on overrepresented distances with assigned scores, P-values and plots showing the distribution of pairs in the input sequences.

Availability: The tool is available at

May 28, 2010

Identifying duplicate content using statistically improbable phrases

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: ,

Document similarity metrics such as PubMed’s ‘Find related articles’ feature, which have been primarily used to identify studies with similar topics, can now also be used to detect duplicated or potentially plagiarized papers within literature reference databases. However, the CPU-intensive nature of document comparison has limited MEDLINE text similarity studies to the comparison of abstracts, which constitute only a small fraction of a publication’s total text. Extending searches to include text archived by online search engines would drastically increase comparison ability. For large-scale studies, submitting short phrases encased in direct quotes to search engines for exact matches would be optimal for both individual queries and programmatic interfaces. We have derived a method of analyzing statistically improbable phrases (SIPs) for assistance in identifying duplicate content.

Results: When applied to MEDLINE citations, this method substantially improves upon previous algorithms in the detection of duplication citations, yielding a precision and recall of 78.9% (versus 50.3% for eTBLAST) and 99.6% (versus 99.8% for eTBLAST), respectively.

Availability: Similar citations identified by this work are freely accessible in the Déjà vu database, under the SIP discovery method category at

May 26, 2010

HIV classification using the coalescent theory

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 6:00 pm
Tags: , ,

Existing coalescent models and phylogenetic tools based on them are not designed for studying the genealogy of sequences like those of HIV, since in HIV recombinants with multiple cross-over points between the parental strains frequently arise. Hence, ambiguous cases in the classification of HIV sequences into subtypes and circulating recombinant forms (CRFs) have been treated with ad hoc methods in lack of tools based on a comprehensive coalescent model accounting for complex recombination patterns.

Results: We developed the program ARGUS that scores classifications of sequences into subtypes and recombinant forms. It reconstructs ancestral recombination graphs (ARGs) that reflect the genealogy of the input sequences given a classification hypothesis. An ARG with maximal probability is approximated using a Markov chain Monte Carlo approach. ARGUS was able to distinguish the correct classification with a low error rate from plausible alternative classifications in simulation studies with realistic parameters. We applied our algorithm to decide between two recently debated alternatives in the classification of CRF02 of HIV-1 and find that CRF02 is indeed a recombinant of Subtypes A and G.

Availability: ARGUS is implemented in C++ and the source code is available at

May 21, 2010

ACCUSA—accurate SNP calling on draft genomes

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:47 am
Tags: ,

Next generation sequencing technologies facilitate genome-wide analysis of several biological processes. We are interested in whole-genome genotyping. To our knowledge, none of the existing single nucleotide polymorphism (SNP) callers consider the quality of the reference genome, which is not necessary for high-quality assemblies of well-studied model organisms. However, most genome projects will remain in draft status with little to no genome assembly improvement due to time and financial constraints. Here, we present a simple yet elegant solution (‘ACCUSA’) that considers both the read qualities as well as the reference genome’s quality using a Bayesian framework. We demonstrate that ACCUSA is as good as the current SNP calling software in detecting true SNPs. More importantly, ACCUSA does not call spurious SNPs, which originate from a poor reference sequence.

Availability: ACCUSA is available free of charge to academic users and may be obtained from ACCUSA is programmed in JAVA 6 and runs on any platform with JAVA support.

May 20, 2010

partDSA: deletion/substitution/addition algorithm for partitioning the covariate space in prediction

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:30 am
Tags: , ,

Until now, much of the focus in cancer has been on biomarker discovery and generating lists of univariately significant genes, as well as epidemiological and clinical measures. These approaches, although significant on their own, are not effective for elucidating the synergistic qualities of the numerous components in complex diseases. These components do not act one at a time, but rather in concert with numerous others. A compelling need exists to develop analytically sound and computationally advanced methods that elucidate a more biologically meaningful understanding of the mechanisms of cancer initiation and progression by taking these interactions into account.

Results: We propose a novel algorithm, partDSA, for prediction when several variables jointly affect the outcome. In such settings, piecewise constant estimation provides an intuitive approach by elucidating interactions and correlation patterns in addition to main effects. As well as generating ‘and’ statements similar to previously described methods, partDSA explores and chooses the best among all possible ‘or’ statements. The immediate benefit of partDSA is the ability to build a parsimonious model with ‘and’ and ‘or’ conjunctions that account for the observed biological phenomena. Importantly, partDSA is capable of handling categorical and continuous explanatory variables and outcomes. We evaluate the effectiveness of partDSA in comparison to several adaptive algorithms in simulations; additionally, we perform several data analyses with publicly available data and introduce the implementation of partDSA as an R package.


May 19, 2010

A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:30 am
Tags: ,

The performance of classifiers is often assessed using Receiver Operating Characteristic ROC [or (AC) accumulation curve or enrichment curve] curves and the corresponding areas under the curves (AUCs). However, in many fundamental problems ranging from information retrieval to drug discovery, only the very top of the ranked list of predictions is of any interest and ROCs and AUCs are not very useful. New metrics, visualizations and optimization tools are needed to address this ‘early retrieval’ problem.

Results: To address the early retrieval problem, we develop the general concentrated ROC (CROC) framework. In this framework, any relevant portion of the ROC (or AC) curve is magnified smoothly by an appropriate continuous transformation of the coordinates with a corresponding magnification factor. Appropriate families of magnification functions confined to the unit square are derived and their properties are analyzed together with the resulting CROC curves. The area under the CROC curve (AUC[CROC]) can be used to assess early retrieval. The general framework is demonstrated on a drug discovery problem and used to discriminate more accurately the early retrieval performance of five different predictors. From this framework, we propose a novel metric and visualization—the CROC(exp), an exponential transform of the ROC curve—as an alternative to other methods. The CROC(exp) provides a principled, flexible and effective way for measuring and visualizing early retrieval performance with excellent statistical power. Corresponding methods for optimizing early retrieval are also described in the Appendix.

Availability: Datasets are publicly available. Python code and command-line utilities implementing CROC curves and metrics are available at

May 18, 2010

An integer programming formulation to identify the sparse network architecture governing differentiation of embryonic stem cells

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:30 am
Tags: , ,

Primary purpose of modeling gene regulatory networks for developmental process is to reveal pathways governing the cellular differentiation to specific phenotypes. Knowledge of differentiation network will enable generation of desired cell fates by careful alteration of the governing network by adequate manipulation of cellular environment.

Results: We have developed a novel integer programming-based approach to reconstruct the underlying regulatory architecture of differentiating embryonic stem cells from discrete temporal gene expression data. The network reconstruction problem is formulated using inherent features of biological networks: (i) that of cascade architecture which enables treatment of the entire complex network as a set of interconnected modules and (ii) that of sparsity of interconnection between the transcription factors. The developed framework is applied to the system of embryonic stem cells differentiating towards pancreatic lineage. Experimentally determined expression profile dynamics of relevant transcription factors serve as the input to the network identification algorithm. The developed formulation accurately captures many of the known regulatory modes involved in pancreatic differentiation. The predictive capacity of the model is tested by simulating an in silico potential pathway of subsequent differentiation. The predicted pathway is experimentally verified by concurrent differentiation experiments. Experimental results agree well with model predictions, thereby illustrating the predictive accuracy of the proposed algorithm.

May 17, 2010

Meta-analysis for pathway enrichment analysis when combining multiple genomic studies

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:30 am
Tags: , ,

Many pathway analysis (or gene set enrichment analysis) methods have been developed to identify enriched pathways under different biological states within a genomic study. As more and more microarray datasets accumulate, meta-analysis methods have also been developed to integrate information among multiple studies. Currently, most meta-analysis methods for combining genomic studies focus on biomarker detection and meta-analysis for pathway analysis has not been systematically pursued.

Results: We investigated two approaches of meta-analysis for pathway enrichment (MAPE) by combining statistical significance across studies at the gene level (MAPE_G) or at the pathway level (MAPE_P). Simulation results showed increased statistical power of meta-analysis approaches compared to a single study analysis and showed complementary advantages of MAPE_G and MAPE_P under different scenarios. We also developed an integrated method (MAPE_I) that incorporates advantages of both approaches. Comprehensive simulations and applications to real data on drug response of breast cancer cell lines and lung cancer tissues were evaluated to compare the performance of three MAPE variations. MAPE_P has the advantage of not requiring gene matching across studies. When MAPE_G and MAPE_P show complementary advantages, the hybrid version of MAPE_I is generally recommended.


Next Page »