Biointelligence

August 17, 2010

A probabilistic framework for aligning paired-end RNA-seq data

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 6:56 am
Tags:

The RNA-seq paired-end read (PER) protocol samples transcript fragments longer than the sequencing capability of today’s technology by sequencing just the two ends of each fragment. Deep sampling of the transcriptome using the PER protocol presents the opportunity to reconstruct the unsequenced portion of each transcript fragment using end reads from overlapping PERs, guided by the expected length of the fragment.

Methods: A probabilistic framework is described to predict the alignment to the genome of all PER transcript fragments in a PER dataset. Starting from possible exonic and spliced alignments of all end reads, our method constructs potential splicing paths connecting paired ends. An expectation maximization method assigns likelihood values to all splice junctions and assigns the most probable alignment for each transcript fragment.

Results: The method was applied to 2 x 35 bp PER datasets from cancer cell lines MCF-7 and SUM-102. PER fragment alignment increased the coverage 3-fold compared to the alignment of the end reads alone, and increased the accuracy of splice detection. The accuracy of the expectation maximization (EM) algorithm in the presence of alternative paths in the splice graph was validated by qRT–PCR experiments on eight exon skipping alternative splicing events. PER fragment alignment with long-range splicing confirmed 8 out of 10 fusion events identified in the MCF-7 cell line in an earlier study by (Maher et al., 2009).

Availability: Software available at http://www.netlab.uky.edu/p/bioinfo/MapSplice/PER

August 16, 2010

Bridges: a tool for identifying local similarities in long sequences

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 11:52 am
Tags:

Bridges is a heuristic search tool that uses short word matches to rapidly identify local similarities between sequences. It consists of three stages: filtering input sequences, identifying local similarities and post-processing local similarities. As input sequence data are released from memory after the filtering stage, genome-scale datasets can be efficiently compared in a single run. Bridges also includes 20 parameters, which enable the user to dictate the sensitivity and specificity of a search.

Availability: Bridges is implemented in the C programming language and can be run on all platforms. Source code and documentation are available at http://github.com/rassis/bridges

August 11, 2010

Over-optimism in bioinformatics: an illustration

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:45 am
Tags:

In statistical bioinformatics research, different optimization mechanisms potentially lead to ‘over-optimism’ in published papers. So far, however, a systematic critical study concerning the various sources underlying this over-optimism is lacking.

Results: We present an empirical study on over-optimism using high-dimensional classification as example. Specifically, we consider a ‘promising’ new classification algorithm, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. While this approach yields poor results in terms of error rate, we quantitatively demonstrate that it can artificially seem superior to existing approaches if we ‘fish for significance’. The investigated sources of over-optimism include the optimization of datasets, of settings, of competing methods and, most importantly, of the method’s characteristics. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should always be demonstrated on independent validation data.

Availability: The R codes and relevant data can be downloaded from http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/overoptimism/, such that the study is completely reproducible

August 3, 2010

LOX: inferring Level Of eXpression from diverse methods of census sequencing

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 7:13 am
Tags:

We present LOX (Level Of eXpression) that estimates the Level Of gene eXpression from high-throughput-expressed sequence datasets with multiple treatments or samples. Unlike most analyses, LOX incorporates a gene bias model that facilitates integration of diverse transcriptomic sequencing data that arises when transcriptomic data have been produced using diverse experimental methodologies. LOX integrates overall sequence count tallies normalized by total expressed sequence count to provide expression levels for each gene relative to all treatments as well as Bayesian credible intervals.

Availability: http://www.yale.edu/townsend/software.html

July 14, 2010

geWorkbench: an open source platform for integrative genomics

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 6:34 am
Tags:

geWorkbench (genomics Workbench) is an open source Java desktop application that provides access to an integrated suite of tools for the analysis and visualization of data from a wide range of genomics domains (gene expression, sequence, protein structure and systems biology). More than 70 distinct plug-in modules are currently available implementing both classical analyses (several variants of clustering, classification, homology detection, etc.) as well as state of the art algorithms for the reverse engineering of regulatory networks and for protein structure prediction, among many others. geWorkbench leverages standards-based middleware technologies to provide seamless access to remote data, annotation and computational servers, thus, enabling researchers with limited local resources to benefit from available public infrastructure.

Availability: The project site (http://www.geworkbench.org) includes links to self-extracting installers for most operating system (OS) platforms as well as instructions for building the application from scratch using the source code [which is freely available from the project’s SVN (subversion) repository]. geWorkbench support is available through the end-user and developer forums of the caBIG® Molecular Analysis Tools Knowledge Center, https://cabig-kc.nci.nih.gov/Molecular/forums/

July 13, 2010

Prediction of protease substrates using sequence and structure features

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 6:57 am
Tags:

Granzyme B (GrB) and caspases cleave specific protein substrates to induce apoptosis in virally infected and neoplastic cells. While substrates for both types of proteases have been determined experimentally, there are many more yet to be discovered in humans and other metazoans. Here, we present a bioinformatics method based on support vector machine (SVM) learning that identifies sequence and structural features important for protease recognition of substrate peptides and then uses these features to predict novel substrates. Our approach can act as a convenient hypothesis generator, guiding future experiments by high-confidence identification of peptide-protein partners.

Results:The method is benchmarked on the known substrates of both protease types, including our literature-curated GrB substrate set (GrBah). On these benchmark sets, the method outperforms a number of other methods that consider sequence only, predicting at a 0.87 true positive rate (TPR) and a 0.13 false positive rate (FPR) for caspase substrates, and a 0.79 TPR and a 0.21 FPR for GrB substrates. The method is then applied to 25 000 proteins in the human proteome to generate a ranked list of predicted substrates of each protease type. Two of these predictions, AIF-1 and SMN1, were selected for further experimental analysis, and each was validated as a GrB substrate.

Availability: All predictions for both protease types are publically available at http://salilab.org/peptide. A web server is at the same site that allows a user to train new SVM models to make predictions for any protein that recognizes specific oligopeptide ligands.

June 7, 2010

A computational analysis of the antigenic properties of haemagglutinin in influenza A H3N2

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: ,

Modelling antigenic shift in influenza A H3N2 can help to predict the efficiency of vaccines. The virus is known to exhibit sudden jumps in antigenic distance, and prediction of such novel strains from amino acid sequence differences remains a challenge.

Results: From analysis of 6624 amino acid sequences of wild-type H3, we propose updates to the frequently referenced list of 131 amino acids located at or near the five identified antibody binding regions in haemagglutinin (HA). We introduce a class of predictive models based on the analysis of amino acid changes in these binding regions, and extend the principle to changes in HA1 as a whole by dividing the molecule into regional bands.

Our results show that a range of simple models based on banded changes give better predictive performance than models based on the established five canonical regions and can identify a higher proportion of vaccine escape candidates among novel strains than a current state-of-the-art model.

Contact: wlees01@mail.cryst.bbk.ac.uk

June 3, 2010

DistanceScan: a tool for promoter modeling

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: , ,

The state of the art in promoter modeling for higher eukaryotes is predicting not single transcription factor binding sites (TFBSs), but their combinations. The new tool utilizes a previously developed method of distance distributions of TFBS pairs. We model the random distribution of distances and compare it with the distribution observed in the query sequences. Comparison of the profiles allows filtering out the ‘noise’ and retaining the potentially functional combinations. This approach has proved its usefulness as a filtering technique for the selection of TFBS pairs for promoter modeling and is now implemented as a tool in R. As an input, it can use the outputs of three different TFBS- and motif-predictive tools (Gibbs Sampler for motifs, MatchTM and MEME/FIMO for PWM-based search). The output is a list of predicted pairs on overrepresented distances with assigned scores, P-values and plots showing the distribution of pairs in the input sequences.

Availability: The tool is available at https://www.omnifung.hki-jena.de/Rpad/Distance_Scan/index.htm

May 29, 2010

DistanceScan: a tool for promoter modeling

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: , ,

The state of the art in promoter modeling for higher eukaryotes is predicting not single transcription factor binding sites (TFBSs), but their combinations. The new tool utilizes a previously developed method of distance distributions of TFBS pairs. We model the random distribution of distances and compare it with the distribution observed in the query sequences. Comparison of the profiles allows filtering out the ‘noise’ and retaining the potentially functional combinations. This approach has proved its usefulness as a filtering technique for the selection of TFBS pairs for promoter modeling and is now implemented as a tool in R. As an input, it can use the outputs of three different TFBS- and motif-predictive tools (Gibbs Sampler for motifs, MatchTM and MEME/FIMO for PWM-based search). The output is a list of predicted pairs on overrepresented distances with assigned scores, P-values and plots showing the distribution of pairs in the input sequences.

Availability: The tool is available at https://www.omnifung.hki-jena.de/Rpad/Distance_Scan/index.htm

May 7, 2010

Localized motif discovery in gene regulatory sequences

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:30 am
Tags: ,

Discovery of nucleotide motifs that are localized with respect to a certain biological landmark is important in several appli-cations, such as in regulatory sequences flanking the transcription start site, in the neighborhood of known transcription factor binding sites, and in transcription factor binding regions discovered by massively parallel sequencing (ChIP-Seq).

Results: We report an algorithm called LocalMotif to discover such localized motifs. The algorithm is based on a novel scoring function, called spatial confinement score, which can determine the exact interval of localization of a motif. This score is combined with other existing scoring measures including over-representation and relative entropy to determine the overall prominence of the motif. The approach successfully discovers biologically relevant motifs and their intervals of localization in scenarios where the motifs cannot be discovered by general motif finding tools. It is especially useful for discovering multiple co-localized motifs in a set of regulatory sequences, such as those identified by ChIP-Seq.

Availability and Implementation: The LocalMotif software is available at http://www.comp.nus.edu.sg/bioinfo/LocalMotif

Next Page »