March 30, 2010

On the beta-binomial model for analysis of spectral count data in label-free tandem mass spectrometry-based proteomics

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 12:00 am
Tags: , , ,

Spectral count data generated from label-free tandem mass spectrometry-based proteomic experiments can be used to quantify protein’s abundances reliably. Comparing spectral count data from different sample groups such as control and disease is an essential step in statistical analysis for the determination of altered protein level and biomarker discovery. The Fisher’s exact test, the G-test, the t-test and the local-pooled-error technique (LPE) are commonly used for differential analysis of spectral count data. However, our initial experiments in two cancer studies show that the current methods are unable to declare at 95% confidence level a number of protein markers that have been judged to be differential on the basis of the biology of the disease and the spectral count numbers. A shortcoming of these tests is that they do not take into account within- and between-sample variations together. Hence, our aim is to improve upon existing techniques by incorporating both the within- and between-sample variations.

We propose to use the beta-binomial distribution to test the significance of differential protein abundances expressed in spectral counts in label-free mass spectrometry-based proteomics. The beta-binomial test naturally normalizes for total sample count. Experimental results show that the beta-binomial test performs favorably in comparison with other methods on several datasets in terms of both true detection rate and false positive rate. In addition, it can be applied for experiments with one or more replicates, and for multiple condition comparisons. Finally, we have implemented a software package for parameter estimation of two beta-binomial models and the associated statistical tests.

Availability: A software package implemented in R is freely available for download at

December 24, 2009

“Omics” Technologies

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 8:42 am
Tags: , , ,

“Ome” and “omics” are suffixes that are derived from genome (the whole collection of a person’s DNA, as coined by Hans Winkler, as a combinaion of “gene” and “chromosome”1) and genomics (the study of the genome). Scientists like to append to these to any large-scale system (or really, just about anything complex), such as the collection of proteins in a cell or tissue (the proteome), the collection of metabolites (the metabolome), and the collection of RNA that’s been transcribed from genes (the transcriptome). High-throughput analysis is essential considering data at the “omic” level, that is to say considering all DNA sequences, gene expression levels, or proteins at once (or, to be slightly more precise, a significant subset of them). Without the ability to rapidly and accurately measure tens and hundreds of thousands of data points in a short period of time, there is no way to perform analyses at this level.

There are four major types of high-throughput measurements that are commonly performed: genomic SNP analysis (i.e., the large-scale genotyping of single nucleotide polymorphisms), transcriptomic measurements (i.e., the measurement of all gene expression values in a cell or tissue type simultaneously), proteomic measurements (i.e., the identification of all proteins present in a cell or tissue type), and metabolomic measurements (i.e., the identification and quantification of all metabolites present in a cell or tissue type). Each of these four is distinct and offers a different perspective on the processes underlying disease initiation and progression as well as on ways of predicting, preventing, or treating disease.

Genomic SNP genotyping measures a person’s genotypes for several hundred thousand single nucleotide polymorphisms spread throughout the genome. Other assays exists to genotype ten thousand or so polymorphic sites that are near known genes (under the assumption that these are more likely to have some effect on these genes). The genotyping technology is quite accurate, but the SNPs themselves offer only limited information. These SNPs tend to be quite common (with typically at least 5% of the population having at least one copy of the less frequent allele), and not strictly causal of the disease. Rather, SNPs can act in unison with other SNPs and with environmental variables to increase or decrease a person’s risk of a disease. This makes identifying important SNPs difficult; the variation in a trait that can be accounted for by a single SNP is fairly small relative to the total variation in the trait. Even so, because genotypes remain constant (barring mutations to individual cells) throughout life, SNPs are potentially among the most useful measurements for predicting risk.

Transcriptomic measurements (often referred to as gene expression microarrays or “gene chips” are the oldest and most established of the high-throughput methodologies. The most common are commercially produced “oligonucleotide arrays”, which have hundreds of thousands of small (25 bases) probes, between 11 and 20 per gene. RNA that has been extracted from cells is then hybridized to the chip, and the expression level of ~30,000 different mRNAs can be assessed simultaneously. More so than SNP genotypes, there is the potential for a significant amount of noise in transcriptomic measurements. The source of the RNA, the preparation and purification methods, and variations in the hybridization and scanning process can lead to differences in expression levels; statistical methods to normalize, quantify, and analyze these measures has been one of the hottest areas of research in the last five years. Gene expression levels influence traits more directly than than SNPs, and so significant associations are easier to detect. While transcriptomic measures are not as useful for pre-disease prediction (because a person’s gene expression levels very far in advance of disease initiation are not likely to be informative because they have the potential to change so significantly), they are very well-suited for either early identification of a disease (i.e., finding people who have gene expression levels characteristic of a disease but who have not yet manifested other symptoms) or classifying patients with a disease into subgroups (by identifying gene expression levels that are associated with either better or worse outcomes or with higher or lower values of some disease phenotype).

Proteomics is similar in character to transcriptomics. The most significant difference is in regards to the measurements. Unlike transcriptomics, where the gene expression levels are assessed simultaneously, protein identification is done in a rapid serial fashion. After a sample has been prepared, the proteins are separated using chromatography, 2 dimensional protein gels (which separate proteins based on charge and then size) or 1 dimensional protein gels (which separate based on size alone), and digested, typically with trypsin (which cuts proteins after each arginine and lysine), and then run through mass spectroscopy. The mass spec identifies the size of each of the peptides, and the proteins can be identified by comparing the size of the peptides created with the theoretical digests of all know proteins in a database. This searching is the key to the technology, and a number of algorithms both commercial and open-source have been created for this. Unlike transcriptomic measures, the overall quantity of a protein cannot be assessed, just its presence or absence. Like transcriptomic measures, though, proteomic measures are excellent for early identification of disease or classifying people into subgroups.

Last up is metabolomics, the high-throughput measure of the metabolites present in a cell or tissue. As with proteomics, the metabolites are measured in a very fast serial process. NMR is typically used to both identify and quantify metabolites. This technology is newer and less frequently used than the other technologies, but similar caveats apply. Measurements of metabolites are dynamic as are gene expression levels and proteins, and so are best suited for either early disease detection or disease subclass identification.

So, above was a brief introduction to all the “omics”. Would include details on each in my next posts.

Till then Happy Xmas Season !!

December 15, 2009

Descriptor-based Fold Recognition System

Filed under: Bioinformatics,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 10:43 am
Tags: , , ,

Machine learning-based methods have been proven to be powerful in developing new fold recognition tools.
DescFold(Descriptor-based Fold Recognition System) is a web server for protein fold recognition,which can predict a protein’s fold type from its amino acid sequence. The server combines six effictive descriptors : a profile-sequence-alignment-based descriptor using Psi-blast e-values and bit scores, a sequence-profile-alignment-based descriptor using Rps-blast e-values and bit scores, a descriptor based on secondary structure element alignment (SSEA), a descriptor based on the occurrence of PROSITE functional motifs, a descriptor based on profile-profile-alignment(PPA) and a descriptor based on Profile-structural-profile-alignment (PSPA) .

When the PPA and PSPA descriptors were introduced, the new DescFold boosts the performance of fold recognition substantially. Using the SCOP_1.73_40% dataset as the fold library, the DescFold web server based on the trained SVM models was further constructed. To provide a large-scale test for the new DescFold, a stringent test set of 1,866 proteins were selected from the SCOP 1.75 version. At a less than 5% false positive rate control, the new DescFold is able to correctly recognize structural homologs at the fold level for nearly 46% test proteins. Additionally, we also benchmarked the DescFold method against several well-established fold recognition algorithms through the LiveBench targets and Lindahl dataset.

The DESC server is freely available at:


November 30, 2009

QUPE – For Mass Spectrometry based Quantitative Proteomics Research

Filed under: Bioinformatics,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 4:22 am
Tags: , , ,

Mass spectrometry (MS) is an indispensable technique for the fast analysis of proteins and peptides in complex biological samples. One key problem with the quantitative mass spectrometric analysis of peptides and proteins, however, is the fact that the sensitivity of MS instruments is peptide-dependent, leading to an unclear relationship between the observed peak intensity and the peptide concentration in the sample. Various labeling techniques have been developed to circumvent this problem, but are very expensive and time-consuming. A reliable prediction of peptide-specific sensitivies could provide a peptide-specific correction factor, which would be valuable for label-free absolute quantitation.

QUPE is an itegrated platform for storage and analysis of quantitative proteomics data, implemented in JAVA. Its is a repository and an algorithmic framework to store and analyse mass spectrometry based quantitative proteome experiments.QuPE provides an easily extensible and configurable job concept. Using XML, jobs consisting of one or more tools can be defined, where input and output types provided by the implementation of a tool determine the data a job is executed with. Due to specific interfaces, tools can announce their need for an interactive configuration. The job and tool concept allows the integration of routines written in R, a programming language, specifically designed for mathematical and statistical purposes. Below are listed the various features of QUPE:

– Webrowser-based application using Web 2.0 technologies
– Extensive capabilities to securely store and organise experiments and complete projects (fine-grained application-based security, GPMS)
– Import of mzData as well as mzXML
– Data model adapted to suggestions made by the HUPO proteomics standards initiative (PSI)
– Mascot integration, Import of DTASelect results
– Framework supporting analysis of quantitative proteomics data, including: – Quantification of stable-isotope labelled samples
– Significance tests, analysis of variance
– Principal component analysis

QUPE is hosted here:


November 9, 2009

CDD: Database for Interactive Domain Family Analysis

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 8:30 am
Tags: , , , , , ,

Protein domains may be viewed as units in the molecular evolution of proteins and can be organized into an evolutionary classification. The set of protein domains characterized so far appears to describe no more than a few thousand superfamilies, where members of each superfamily are related to each other by common descent. Computational annotation of protein function is generally obtained via sequence similarity: once a close neighbor with known function has been identified, its annotation is copied to the sequence with unknown function. This strategy may work very well in functionally homogeneous families and when applied only for very close neighbors or suspected orthologs, but it is doomed to fail often when domain or protein families are sufficiently diverse and when no close neighbors with known function are available.

NCBI’s conserved domain database (CDD) attempts to collate that set and to organize related domain models in a hierarchical fashion, meant to reflect major ancient gene duplication events and subsequent functional diversification. The conserved domain database (CDD) is part of NCBI’s Entrez database system and serves as a primary resource for the annotation of conserved domain footprints on protein sequences in Entrez.CDD provides a strategy toward a more accurate assessment of such neighbor relationships, similar to approaches termed ‘phylogenomic inference. CDD acknowledges that protein domain families may be very diverse and that they may contain sets of related subfamilies.

In CDD curation, we attempt to detect evidence for duplication and functional divergence in domain families by means of phylogenetic analysis. We record the resulting subfamily structure as a set of explicit models, but limit the analysis to ancient duplication events—several hundred million years in the past, as judged by the taxonomic distribution of protein sequences with particular domain subfamily footprints. CDD provides a search tool employing reverse position-specific BLAST (RPS–BLAST), where query sequences are compared to databases of position-specific score matrices (PSSMs), and E-values are obtained in much the same way as in the widely used PSI-BLAST application.

CDD is hosted here:





October 12, 2009

MISTRAL: For Multiple Protein Structure Alignment

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 9:50 am
Tags: , ,

With a rapidly growing pool of known tertiary structures, the importance of protein structure comparison parallels that of sequence alignment. Detecting structural equivalences in two or more proteins is computationally demanding as it typically entails the exploration of the combinatorial space of all possible amino acid pairings in the parent protein.

A new tool MISTRAL has been developed for multiple protein alignment based on the minimization of an energy function over the low-dimensional space of the relative rotations and translations of the molecules.

An alignment of upto 20 sequences in PDB format can be submitted at a time, where the length of each protein sequence is limited to 500 amino acids. It can be used both a standalone version or can be accessed online. MISTRAL can be accessed online here:

September 17, 2009

ADAN: A Database for Prediction of Protein Protein Interactions

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 12:51 pm
Tags: , , ,

In the last post we had given an introductions to MIPS (Mammalian Prtein Protein Interaction database). Most of the structures and functions of proteome globular domains are yet unknown. We can use high-resolution structures from different modular domains in combination with automatic protein design algorithms to predict genome-wide potential interactions of a protein. Todays post introduces a database whcih helps in prediction of such protein interactions.

ADAN database is a collection of different modular protein domains (SH2, SH3, PDZ, WW, etc.). It contains 3505 entries with extensive structural and functional information available, manually integrated, curated and annotated with cross-references to other databases, biochemical and thermodynamical data, simplified coordinate files, sequence files and alignments. Prediadan, a subset of ADAN database, offers position-specific scoring matrices for protein?protein interactions, calculated by FoldX, and predictions of optimum ligands and putative binding partners. Users can also scan a query sequence against selected matrices, or improve a ligand?domain interaction. The ADAN Database can be accessed from here:

September 16, 2009

Database for Protein Protein Interactions

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 1:28 pm
Tags: , , , ,


Proteins are organic compounds made of amino acids arranged in a linear chain and folded into a globular form. These molecules are of great importance because of the function they perform.

Protein associations are studied from the perspectives of biochemistry, quantum chemistry, molecular dynamics, signal transduction and other metabolic or genetic/epigenetic networks. Protein-protein interactions are at the core of the entire Interactomics system of any living cell.These interactions involve not only the direct-contact association of protein molecules but also longer range interactions through the electrolyte, aqueous solution medium surrounding neighbor hydrated proteins over distances from less than one nanometer to distances of several tens of nanometers. Furthermore, such protein-protein interactions are thermodynamically linked functions of dynamically bound ions and water that exchange rapidly with the surrounding solution by comparison with the molecular tumbling rate (or correlation times) of the interacting proteins.

The MIPS Mammalian Protein-Protein Interaction Database is a collection of manually curated high-quality Protein Protein Interaction data collected from the scientific literature by expert curators.The content is based on published experimental evidence that has been processed by human expert curators. MIPS provides the full dataset for download and a flexible and powerful web interface for users with various requirements.

Click here to access MIPS:

September 14, 2009

Protein Mutant Database: An Introduction

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 7:41 am
Tags: , , ,

Protein structure is one of the most important and popular research topics in todays era. Reaesrch on protein structure, sequence and organization gives a broad view of its functionality.

Compliations of protein mutant data are valuable as a basis for protein engineering. They provide information on what kinds of functional and/or structural influences are brought about by amino acid mutation at a specific position of protein. The Protein Mutant Database (PMD) which is being constructed covers natural as well as artificial mutants, including random and site-directed ones, for all proteins except members of the globin and immunoglobulin families. The PMD is based on literature, not on proteins. That is, each entry in the database corresponds to one article which may describe one, several or a number of protein mutants.

Click here to know more on PMD:

August 3, 2009

Proteomics: Challenges and Approaches

Filed under: Bioinformatics,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 9:05 am
Tags: , , , ,

Proteomics is the study of the function of all expressed proteins. The term proteome was first coined to describe the set of proteins encoded by the genome1. The study of the proteome, called proteomics, now evokes not only all the proteins in any given cell, but also the set of all protein isoforms and modifications, the interactions between them, the structural description of proteins and their higher-order complexes, and for that matter almost everything ‘post-genomic’. In this overview we will use proteomics in an overall sense to mean protein biochemistry on an unprecedented, high-throughput scale.

Proteomics complements other functional genomics approaches, including microarray-based expression profiles, systematic phenotypic profiles at the cell and organism level, systematic genetics and small-molecule-based arrays. Integration of these data sets through bioinformatics will yield a comprehensive database of gene function that will serve as a powerful reference of protein properties and functions, and a useful tool for the individual researcher to both build and test hypotheses. Moreover,this large-scale data sets will be of utmost importance for the emerging field of systems biology.

Platforms for Proteomics 

Challenges and Approaches in Proteomics

Proteomics would not be possible without the previous achievements of genomics, which provided the ‘blueprint’ of possible gene products that are the focal point of proteomics studies. Some of the recent approaches used in the field of proteomics are:

1. Mass spectrometry-based proteomics

2. Array Based Proteomics

3. Structural Proteomics

4. Proteome informatics

5. Clinical Proteomics

To read more on this visit check out-