August 19, 2010

METAL: fast and efficient meta-analysis of genomewide association scans

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:29 am

METAL provides a computationally efficient tool for meta-analysis of genome-wide association scans, which is a commonly used approach for improving power complex traits gene mapping studies. METAL provides a rich scripting interface and implements efficient memory management to allow analyses of very large data sets and to support a variety of input file formats.

Availability and implementation: METAL, including source code, documentation, examples, and executables, is available at

August 18, 2010

Dealing with sparse data in predicting outcomes of HIV combination therapies

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 6:26 am

As there exists no cure or vaccine for the infection with human immunodeficiency virus (HIV), the standard approach to treating HIV patients is to repeatedly administer different combinations of several antiretroviral drugs. Because of the large number of possible drug combinations, manually finding a successful regimen becomes practically impossible. This presents a major challenge for HIV treatment. The application of machine learning methods for predicting virological responses to potential therapies is a possible approach to solving this problem. However, due to evolving trends in treating HIV patients the available clinical datasets have a highly unbalanced representation, which might negatively affect the usefulness of derived statistical models.

Results: This article presents an approach that tackles the problem of predicting virological response to combination therapies by learning a separate logistic regression model for each therapy. The models are fitted by using not only the data from the target therapy but also the information from similar therapies. For this purpose, we introduce and evaluate two different measures of therapy similarity. The models are also able to incorporate phenotypic knowledge on the therapy outcomes through a Gaussian prior. With our approach we balance the uneven therapy representation in the datasets and produce higher quality models for therapies with very few training samples. According to the results from the computational experiments our therapy similarity model performs significantly better than training separate models for each therapy by using solely their examples. Furthermore, the model’s performance is as good as an approach that encodes therapy information in the input feature space with the advantage of delivering better results for therapies with very few training samples.

Availability: Code of the efficient logistic regression is available from

August 7, 2010

CplexA: a Mathematica package to study macromolecular-assembly control of gene expression

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 7:38 am

Macromolecular assembly coordinates essential cellular processes, such as gene regulation and signal transduction. A major challenge for conventional computational methods to study these processes is tackling the exponential increase of the number of configurational states with the number of components. CplexA is a Mathematica package that uses functional programming to efficiently compute probabilities and average properties over such exponentially large number of states from the energetics of the interactions. The package is particularly suited to study gene expression at complex promoters controlled by multiple, local and distal, DNA binding sites for transcription factors.

Availability: CplexA is freely available together with documentation at

August 2, 2010

Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:31 am

Count is a software package for the analysis of numerical profiles on a phylogeny. It is primarily designed to deal with profiles derived from the phyletic distribution of homologous gene families, but is suited to study any other integer-valued evolutionary characters. Count performs ancestral reconstruction, and infers family- and lineage-specific characteristics along the evolutionary tree. It implements popular methods employed in gene content analysis such as Dollo and Wagner parsimony, propensity for gene loss, as well as probabilistic methods involving a phylogenetic birth-and-death model.

Availability: Count is available as a stand-alone Java application, as well as an application bundle for MacOS X, at the web site It can also be launched using Java Webstart from the same site. The software is distributed under a BSD-style license. Source code is available upon request from the author.

July 26, 2010

Cassis: detection of genomic rearrangement breakpoints

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 4:55 am

Genomes undergo large structural changes that alter their organization. The chromosomal regions affected by these rearrangements are called breakpoints, while those which have not been rearranged are called synteny blocks. Lemaitre et al. presented a new method to precisely delimit rearrangement breakpoints in a genome by comparison with the genome of a related species. Receiving as input a list of one2one orthologous genes found in the genomes of two species, the method builds a set of reliable and non-overlapping synteny blocks and refines the regions that are not contained into them. Through the alignment of each breakpoint sequence against its specific orthologous sequences in the other species, we can look for weak similarities inside the breakpoint, thus extending the synteny blocks and narrowing the breakpoints. The identification of the narrowed breakpoints relies on a segmentation algorithm and is statistically assessed. Here, we present the package Cassis that implements this method of precise detection of genomic rearrangement breakpoints.

Availability: Perl and R scripts are freely available for download at Documentation with methodological background, technical aspects, download and setup instructions, as well as examples of applications are available together with the package. The package was tested on Linux and Mac OS environments and is distributed under the GNU GPL License

July 22, 2010

MISS: a non-linear methodology based on mutual information for genetic association studies in both population and sib-pairs analysis

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:54 am

Finding association between genetic variants and phenotypes related to disease has become an important vehicle for the study of complex disorders. In this context, multi-loci genetic association might unravel additional information when compared with single loci search. The main goal of this work is to propose a non-linear methodology based on information theory for finding combinatorial association between multi-SNPs and a given phenotype.

Results: The proposed methodology, called MISS (mutual information statistical significance), has been integrated jointly with a feature selection algorithm and has been tested on a synthetic dataset with a controlled phenotype and in the particular case of the F7 gene. The MISS methodology has been contrasted with a multiple linear regression (MLR) method used for genetic association in both, a population-based study and a sib-pairs analysis and with the maximum entropy conditional probability modelling (MECPM) method, which searches for predictive multi-locus interactions. Several sets of SNPs within the F7 gene region have been found to show a significant correlation with the FVII levels in blood. The proposed multi-site approach unveils combinations of SNPs that explain more significant information of the phenotype than their individual polymorphisms. MISS is able to find more correlations between SNPs and the phenotype than MLR and MECPM. Most of the marked SNPs appear in the literature as functional variants with real effect on the protein FVII levels in blood.

Availability: The code is available at

May 21, 2010

ACCUSA—accurate SNP calling on draft genomes

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:47 am
Tags: ,

Next generation sequencing technologies facilitate genome-wide analysis of several biological processes. We are interested in whole-genome genotyping. To our knowledge, none of the existing single nucleotide polymorphism (SNP) callers consider the quality of the reference genome, which is not necessary for high-quality assemblies of well-studied model organisms. However, most genome projects will remain in draft status with little to no genome assembly improvement due to time and financial constraints. Here, we present a simple yet elegant solution (‘ACCUSA’) that considers both the read qualities as well as the reference genome’s quality using a Bayesian framework. We demonstrate that ACCUSA is as good as the current SNP calling software in detecting true SNPs. More importantly, ACCUSA does not call spurious SNPs, which originate from a poor reference sequence.

Availability: ACCUSA is available free of charge to academic users and may be obtained from ACCUSA is programmed in JAVA 6 and runs on any platform with JAVA support.

May 14, 2010

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:36 am
Tags: , ,

In the past few years, human genome structural variation discovery has enjoyed increased attention from the genomics research community. Many studies were published to characterize short insertions, deletions, duplications and inversions, and associate copy number variants (CNVs) with disease. Detection of new sequence insertions requires sequence data, however, the ‘detectable’ sequence length with read-pair analysis is limited by the insert size. Thus, longer sequence insertions that contribute to our genetic makeup are not extensively researched.

Results: We present NovelSeq: a computational framework to discover the content and location of long novel sequence insertions using paired-end sequencing data generated by the next-generation sequencing platforms. Our framework can be built as part of a general sequence analysis pipeline to discover multiple types of genetic variation (SNPs, structural variation, etc.), thus it requires significantly less-computational resources than de novo sequence assembly. We apply our methods to detect novel sequence insertions in the genome of an anonymous donor and validate our results by comparing with the insertions discovered in the same genome using various sources of sequence data.

Availability: The implementation of the NovelSeq pipeline is available at

April 2, 2010

Treephyler: fast taxonomic profiling of metagenomes

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:40 am
Tags: ,

Assessment of phylogenetic diversity is a key element to the analysis of microbial communities. Tools are needed to handle next-generation sequencing data and to cope with the computational complexity of large-scale studies. Here, we present Treephyler, a tool for fast taxonomic profiling of metagenomes. Treephyler was evaluated on real metagenome to assess its performance in comparison to previous approaches for taxonomic profiling. Results indicate that Treephyler is in terms of speed and accuracy prepared for next-generation sequencing techniques and large-scale analysis.

Availability: Treephyler is implemented in Perl; it is portable to all platforms and applicable to both nucleotide and protein input data. Treephyler is freely available for download at

March 20, 2010

BEDTools: a flexible suite of utilities for comparing genomic features

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 12:58 pm
Tags: , , ,

Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing web-based methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner.

Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets.

Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at

Next Page »