Biointelligence

July 28, 2010

ACNE: a summarization method to estimate allele-specific copy numbers for Affymetrix SNP arrays

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:11 am
Tags:

Current algorithms for estimating DNA copy numbers (CNs) borrow concepts from gene expression analysis methods. However, single nucleotide polymorphism (SNP) arrays have special characteristics that, if taken into account, can improve the overall performance. For example, cross hybridization between alleles occurs in SNP probe pairs. In addition, most of the current CN methods are focused on total CNs, while it has been shown that allele-specific CNs are of paramount importance for some studies. Therefore, we have developed a summarization method that estimates high-quality allele-specific CNs.

Results: The proposed method estimates the allele-specific DNA CNs for all Affymetrix SNP arrays dealing directly with the cross hybridization between probes within SNP probesets. This algorithm outperforms (or at least it performs as well as) other state-of-the-art algorithms for computing DNA CNs. It better discerns an aberration from a normal state and it also gives more precise allele-specific CNs.

Availability: The method is available in the open-source R package ACNE, which also includes an add on to the aroma.affymetrix framework (http://www.aroma-project.org/).

July 27, 2010

JAMIE: joint analysis of multiple ChIP-chip experiments

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:16 am
Tags:

Chromatin immunoprecipitation followed by genome tiling array hybridization (ChIP-chip) is a powerful approach to identify transcription factor binding sites (TFBSs) in target genomes. When multiple related ChIP-chip datasets are available, analyzing them jointly allows one to borrow information across datasets to improve peak detection. This is particularly useful for analyzing noisy datasets.

Results: We propose a hierarchical mixture model and develop an R package JAMIE to perform the joint analysis. The genome is assumed to consist of background and potential binding regions (PBRs). PBRs have context-dependent probabilities to become bona fide binding sites in individual datasets. This model captures the correlation among datasets, which provides basis for sharing information across experiments. Real data tests illustrate the advantage of JAMIE over a strategy that analyzes individual datasets separately.

Availability: JAMIE is freely available from http://www.biostat.jhsph.edu/hji/jamie

July 26, 2010

Cassis: detection of genomic rearrangement breakpoints

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 4:55 am
Tags:

Genomes undergo large structural changes that alter their organization. The chromosomal regions affected by these rearrangements are called breakpoints, while those which have not been rearranged are called synteny blocks. Lemaitre et al. presented a new method to precisely delimit rearrangement breakpoints in a genome by comparison with the genome of a related species. Receiving as input a list of one2one orthologous genes found in the genomes of two species, the method builds a set of reliable and non-overlapping synteny blocks and refines the regions that are not contained into them. Through the alignment of each breakpoint sequence against its specific orthologous sequences in the other species, we can look for weak similarities inside the breakpoint, thus extending the synteny blocks and narrowing the breakpoints. The identification of the narrowed breakpoints relies on a segmentation algorithm and is statistically assessed. Here, we present the package Cassis that implements this method of precise detection of genomic rearrangement breakpoints.

Availability: Perl and R scripts are freely available for download at http://pbil.univ-lyon1.fr/software/Cassis/. Documentation with methodological background, technical aspects, download and setup instructions, as well as examples of applications are available together with the package. The package was tested on Linux and Mac OS environments and is distributed under the GNU GPL License

July 24, 2010

COMA server for protein distant homology search

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 8:07 am
Tags:

Detection of distant homology is a widely used computational approach for studying protein evolution, structure and function. Here, we report a homology search web server based on sequence profile–profile comparison. The user may perform searches in one of several regularly updated profile databases using either a single sequence or a multiple sequence alignment as an input. The same profile databases can also be downloaded for local use. The capabilities of the server are illustrated with the identification of new members of the highly diverse PD-(D/E)XK nuclease superfamily.

Availability: http://www.ibt.lt/bioinformatics/coma

July 23, 2010

Structure-based variable selection for survival data

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 6:20 am
Tags:

Variable selection is a typical approach used for molecular-signature and biomarker discovery; however, its application to survival data is often complicated by censored samples. We propose a new algorithm for variable selection suitable for the analysis of high-dimensional, right-censored data called Survival Max–Min Parents and Children (SMMPC). The algorithm is conceptually simple, scalable, based on the theory of Bayesian networks (BNs) and the Markov blanket and extends the corresponding algorithm (MMPC) for classification tasks. The selected variables have a structural interpretation: if T is the survival time (in general the time-to-event), SMMPC returns the variables adjacent to T in the BN representing the data distribution. The selected variables also have a causal interpretation that we discuss.

Results: We conduct an extensive empirical analysis of prototypical and state-of-the-art variable selection algorithms for survival data that are applicable to high-dimensional biological data. SMMPC selects on average the smallest variable subsets (less than a dozen per dataset), while statistically significantly outperforming all of the methods in the study returning a manageable number of genes that could be inspected by a human expert.

Availability: Matlab and R code are freely available from http://www.mensxmachina.org

July 22, 2010

MISS: a non-linear methodology based on mutual information for genetic association studies in both population and sib-pairs analysis

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:54 am
Tags:

Finding association between genetic variants and phenotypes related to disease has become an important vehicle for the study of complex disorders. In this context, multi-loci genetic association might unravel additional information when compared with single loci search. The main goal of this work is to propose a non-linear methodology based on information theory for finding combinatorial association between multi-SNPs and a given phenotype.

Results: The proposed methodology, called MISS (mutual information statistical significance), has been integrated jointly with a feature selection algorithm and has been tested on a synthetic dataset with a controlled phenotype and in the particular case of the F7 gene. The MISS methodology has been contrasted with a multiple linear regression (MLR) method used for genetic association in both, a population-based study and a sib-pairs analysis and with the maximum entropy conditional probability modelling (MECPM) method, which searches for predictive multi-locus interactions. Several sets of SNPs within the F7 gene region have been found to show a significant correlation with the FVII levels in blood. The proposed multi-site approach unveils combinations of SNPs that explain more significant information of the phenotype than their individual polymorphisms. MISS is able to find more correlations between SNPs and the phenotype than MLR and MECPM. Most of the marked SNPs appear in the literature as functional variants with real effect on the protein FVII levels in blood.

Availability: The code is available at http://sisbio.recerca.upc.edu/R/MISS_0.2.tar.gz

July 21, 2010

JAMIE: joint analysis of multiple ChIP-chip experiments

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 8:10 am
Tags:

Chromatin immunoprecipitation followed by genome tiling array hybridization (ChIP-chip) is a powerful approach to identify transcription factor binding sites (TFBSs) in target genomes. When multiple related ChIP-chip datasets are available, analyzing them jointly allows one to borrow information across datasets to improve peak detection. This is particularly useful for analyzing noisy datasets.

Results: We propose a hierarchical mixture model and develop an R package JAMIE to perform the joint analysis. The genome is assumed to consist of background and potential binding regions (PBRs). PBRs have context-dependent probabilities to become bona fide binding sites in individual datasets. This model captures the correlation among datasets, which provides basis for sharing information across experiments. Real data tests illustrate the advantage of JAMIE over a strategy that analyzes individual datasets separately.

Availability: JAMIE is freely available from http://www.biostat.jhsph.edu/hji/jamie

July 20, 2010

adephylo: new tools for investigating the phylogenetic signal in biological traits

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 8:42 am
Tags:

adephylo is a package for the R software dedicated to the analysis of comparative evolutionary data. Phylogenetic comparative methods initially aimed at accounting for or removing the effects of phylogenetic signal in the analysis of biological traits. However, recent approaches have shown that considerable information can be gathered from the study of the phylogenetic signal. In particular, close examination of phylogenetic structures can unveil interesting evolutionary patterns. For this purpose, we developed the package adephylo that provides tools for quantifying and describing the phylogenetic structures of biological traits. adephylo implements tests of phylogenetic signal, phylogenetic distances and proximities, and novel methods for describing further univariate and multivariate phylogenetic structures. These tools open up new perspectives in the analysis of evolutionary comparative data.

Availability: The stable version is available from CRAN: http:/cran.r-project.org/web/packages/adephylo/. The development version is hosted by R-Forge: http://r-forge.r-project.org/projects/adephylo/. Both versions can be installed directly from R. adephylo is distributed under the GNU General Public Licence (2).

July 16, 2010

SLIMS—a user-friendly sample operations and inventory management system for genotyping labs

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 6:02 am
Tags:

We present the Sample-based Laboratory Information Management System (SLIMS), a powerful and user-friendly open source web application that provides all members of a laboratory with an interface to view, edit and create sample information. SLIMS aims to simplify common laboratory tasks with tools such as a user-friendly shopping cart for subjects, samples and containers that easily generates reports, shareable lists and plate designs for genotyping. Further key features include customizable data views, database change-logging and dynamically filled pre-formatted reports. Along with being feature-rich, SLIMS’ power comes from being able to handle longitudinal data from multiple time-points and biological sources. This type of data is increasingly common from studies searching for susceptibility genes for common complex diseases that collect thousands of samples generating millions of genotypes and overwhelming amounts of data. LIMSs provide an efficient way to deal with this data while increasing accessibility and reducing laboratory errors; however, professional LIMS are often too costly to be practical. SLIMS gives labs a feasible alternative that is easily accessible, user-centrically designed and feature-rich. To facilitate system customization, and utilization for other groups, manuals have been written for users and developers.

Availability: Documentation, source code and manuals are available at http://genapha.icapture.ubc.ca/SLIMS/index.jsp. SLIMS was developed using Java 1.6.0, JSPs, Hibernate 3.3.1.GA, DB2 and mySQL, Apache Tomcat 6.0.18, NetBeans IDE 6.5, Jasper Reports 3.5.1 and JasperSoft’s iReport 3.5.1.

July 15, 2010

Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:35 am
Tags:

Since database retrieval is a fundamental operation, the measurement of retrieval efficacy is critical to progress in bioinformatics. This article points out some issues with current methods of measuring retrieval efficacy and suggests some improvements. In particular, many studies have used the pooled receiver operating characteristic for n irrelevant records (ROCn) score, the area under the ROC curve (AUC) of a ‘pooled’ ROC curve, truncated at n irrelevant records. Unfortunately, the pooled ROCn score does not faithfully reflect actual usage of retrieval algorithms. Additionally, a pooled ROCn score can be very sensitive to retrieval results from as little as a single query.

Methods: To replace the pooled ROCn score, we propose the Threshold Average Precision (TAP-k), a measure closely related to the well-known average precision in information retrieval, but reflecting the usage of E-values in bioinformatics. Furthermore, in addition to conditions previously given in the literature, we introduce three new criteria that an ideal measure of retrieval efficacy should satisfy.

Results: PSI-BLAST, GLOBAL, HMMER and RPS-BLAST provided examples of using the TAP-k and pooled ROCn scores to evaluate sequence retrieval algorithms. In particular, compelling examples using real data highlight the drawbacks of the pooled ROCn score, showing that it can produce evaluations skewing far from intuitive expectations. In contrast, the TAP-k satisfies most of the criteria desired in an ideal measure of retrieval efficacy.

Availability and Implementation: The TAP-k web server and downloadable Perl script are freely available at http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html.ncbi/tap/

Next Page »