December 24, 2009

“Omics” Technologies

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 8:42 am
Tags: , , ,

“Ome” and “omics” are suffixes that are derived from genome (the whole collection of a person’s DNA, as coined by Hans Winkler, as a combinaion of “gene” and “chromosome”1) and genomics (the study of the genome). Scientists like to append to these to any large-scale system (or really, just about anything complex), such as the collection of proteins in a cell or tissue (the proteome), the collection of metabolites (the metabolome), and the collection of RNA that’s been transcribed from genes (the transcriptome). High-throughput analysis is essential considering data at the “omic” level, that is to say considering all DNA sequences, gene expression levels, or proteins at once (or, to be slightly more precise, a significant subset of them). Without the ability to rapidly and accurately measure tens and hundreds of thousands of data points in a short period of time, there is no way to perform analyses at this level.

There are four major types of high-throughput measurements that are commonly performed: genomic SNP analysis (i.e., the large-scale genotyping of single nucleotide polymorphisms), transcriptomic measurements (i.e., the measurement of all gene expression values in a cell or tissue type simultaneously), proteomic measurements (i.e., the identification of all proteins present in a cell or tissue type), and metabolomic measurements (i.e., the identification and quantification of all metabolites present in a cell or tissue type). Each of these four is distinct and offers a different perspective on the processes underlying disease initiation and progression as well as on ways of predicting, preventing, or treating disease.

Genomic SNP genotyping measures a person’s genotypes for several hundred thousand single nucleotide polymorphisms spread throughout the genome. Other assays exists to genotype ten thousand or so polymorphic sites that are near known genes (under the assumption that these are more likely to have some effect on these genes). The genotyping technology is quite accurate, but the SNPs themselves offer only limited information. These SNPs tend to be quite common (with typically at least 5% of the population having at least one copy of the less frequent allele), and not strictly causal of the disease. Rather, SNPs can act in unison with other SNPs and with environmental variables to increase or decrease a person’s risk of a disease. This makes identifying important SNPs difficult; the variation in a trait that can be accounted for by a single SNP is fairly small relative to the total variation in the trait. Even so, because genotypes remain constant (barring mutations to individual cells) throughout life, SNPs are potentially among the most useful measurements for predicting risk.

Transcriptomic measurements (often referred to as gene expression microarrays or “gene chips” are the oldest and most established of the high-throughput methodologies. The most common are commercially produced “oligonucleotide arrays”, which have hundreds of thousands of small (25 bases) probes, between 11 and 20 per gene. RNA that has been extracted from cells is then hybridized to the chip, and the expression level of ~30,000 different mRNAs can be assessed simultaneously. More so than SNP genotypes, there is the potential for a significant amount of noise in transcriptomic measurements. The source of the RNA, the preparation and purification methods, and variations in the hybridization and scanning process can lead to differences in expression levels; statistical methods to normalize, quantify, and analyze these measures has been one of the hottest areas of research in the last five years. Gene expression levels influence traits more directly than than SNPs, and so significant associations are easier to detect. While transcriptomic measures are not as useful for pre-disease prediction (because a person’s gene expression levels very far in advance of disease initiation are not likely to be informative because they have the potential to change so significantly), they are very well-suited for either early identification of a disease (i.e., finding people who have gene expression levels characteristic of a disease but who have not yet manifested other symptoms) or classifying patients with a disease into subgroups (by identifying gene expression levels that are associated with either better or worse outcomes or with higher or lower values of some disease phenotype).

Proteomics is similar in character to transcriptomics. The most significant difference is in regards to the measurements. Unlike transcriptomics, where the gene expression levels are assessed simultaneously, protein identification is done in a rapid serial fashion. After a sample has been prepared, the proteins are separated using chromatography, 2 dimensional protein gels (which separate proteins based on charge and then size) or 1 dimensional protein gels (which separate based on size alone), and digested, typically with trypsin (which cuts proteins after each arginine and lysine), and then run through mass spectroscopy. The mass spec identifies the size of each of the peptides, and the proteins can be identified by comparing the size of the peptides created with the theoretical digests of all know proteins in a database. This searching is the key to the technology, and a number of algorithms both commercial and open-source have been created for this. Unlike transcriptomic measures, the overall quantity of a protein cannot be assessed, just its presence or absence. Like transcriptomic measures, though, proteomic measures are excellent for early identification of disease or classifying people into subgroups.

Last up is metabolomics, the high-throughput measure of the metabolites present in a cell or tissue. As with proteomics, the metabolites are measured in a very fast serial process. NMR is typically used to both identify and quantify metabolites. This technology is newer and less frequently used than the other technologies, but similar caveats apply. Measurements of metabolites are dynamic as are gene expression levels and proteins, and so are best suited for either early disease detection or disease subclass identification.

So, above was a brief introduction to all the “omics”. Would include details on each in my next posts.

Till then Happy Xmas Season !!

December 15, 2009

Descriptor-based Fold Recognition System

Filed under: Bioinformatics,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 10:43 am
Tags: , , ,

Machine learning-based methods have been proven to be powerful in developing new fold recognition tools.
DescFold(Descriptor-based Fold Recognition System) is a web server for protein fold recognition,which can predict a protein’s fold type from its amino acid sequence. The server combines six effictive descriptors : a profile-sequence-alignment-based descriptor using Psi-blast e-values and bit scores, a sequence-profile-alignment-based descriptor using Rps-blast e-values and bit scores, a descriptor based on secondary structure element alignment (SSEA), a descriptor based on the occurrence of PROSITE functional motifs, a descriptor based on profile-profile-alignment(PPA) and a descriptor based on Profile-structural-profile-alignment (PSPA) .

When the PPA and PSPA descriptors were introduced, the new DescFold boosts the performance of fold recognition substantially. Using the SCOP_1.73_40% dataset as the fold library, the DescFold web server based on the trained SVM models was further constructed. To provide a large-scale test for the new DescFold, a stringent test set of 1,866 proteins were selected from the SCOP 1.75 version. At a less than 5% false positive rate control, the new DescFold is able to correctly recognize structural homologs at the fold level for nearly 46% test proteins. Additionally, we also benchmarked the DescFold method against several well-established fold recognition algorithms through the LiveBench targets and Lindahl dataset.

The DESC server is freely available at:


December 14, 2009

Applications of Systems Biology in Drug Discovery

Filed under: Bioinformatics,Chemoinformatics,Systems Biology — Biointelligence: Education,Training & Consultancy Services @ 4:33 am
Tags: , , ,

Till date we have made a lot of posts on Systems Biology, its applications and it scope. Indeed, Systems Biology has brought a big revolution in cell biology and pathway analysis. When seen in combination with treatment of diseases and drug discovery, it proves even more handy. Here we discuss Systems Biology in combination with drug discovery.

The goal of modern systems biology is to understand physiology and disease from the level of molecular pathways, regulatory networks, cells, tissues, organs and ultimately the whole organism. As currently employed, the term ‘systems biology’ encompasses many different approaches and models for probing and understanding biological complexity, and studies of many organisms from bacteria to man. Much of the academic focus is on developing fundamental computational and informatics tools required to integrate large amounts of reductionist data (global gene expression, proteomic and metabolomic data) into models of regulatory networks and cell behavior. Because biological complexity is an exponential function of the number of system components and the interactions between them, and escalates at each additional level of organization.

There are basically three advances in the practical applications of systems biology to drug discovery. These are:

1. Informatic integration of ‘omics’ data sets (a bottom-up approach)

Omics approaches to systems biology focus on the building blocks of complex systems (genes, proteins and metabolites). These approaches have been adopted wholeheartedly by the drug industry to complement traditional approaches to target identification and validation, for generating hypotheses and for experimental analysis in traditional hypothesis-based methods.

2. Computer modeling of disease or organ system physiology from cell and organ response level information available in the literature (a top-down approach to target selection, clinical indication and clinical trial design).
The goal of modeling in systems biology is to provide a framework for hypothesis generation and prediction based on in silico simulation of human disease biology across the multiple distance and time scales of an organism. More detailed understanding of the systems behavior of intercellular signaling pathways, such as the identification of key nodes or regulatory points in networks or better understanding of crosstalk between pathways, can also help predict drug target effects and their translation to organ and organism level physiology.

3.  The use of complex human cell systems themselves to interpret and predict the biological activities of drugs and gene targets (a direct experimental approach to cataloguing complex disease-relevant biological responses).

Pathway modeling as yet remains too disconnected from systemic disease biology to have a significant impact on drug discovery. Top-down modeling at the cell-to-organ and organism scale shows promise, but is extremely dependent on contextual cell response data. Moreover, to bridge the gap between omics and modeling, we need to collect a different type of cell biology data—data that incorporate the complexity and emergent properties of cell regulatory systems and yet ideally are reproducible and amenable to storing in databases, sharing and quantitative analysis.

This is how Systems Biology has aided in Drug Discovery Research and paved its path to cure many vital diseases.

Read our other posts on Systems Biology –

December 8, 2009

Bioinformatics Education—Perspectives and Challenges

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 4:59 am
Tags: , , ,

Here is another post similar to the one I had posted yesterday. This article mailnly talks on Bioinformatics Education and the challenges involved in it.
Read it below:

Education in bioinformatics has undergone a sea change, from informal workshops and training courses to structured certificate, diploma, and degree programs—spanning casual self-enriching courses all the way to doctorate programs. The evolution of curriculum, instructional methodologies, and initiatives supporting the dissemination of bioinformatics is presented here.

Building on the early applications of informatics (computer science) to the field of biology, bioinformatics research entails input from the diverse disciplines of mathematics and statistics, physics and chemistry, and medicine and pharmacology. Providing education in bioinformatics is challenging from this multidisciplinary perspective, and represents short- and long-term efforts directed at casual and dedicated learners in academic and industrial environments. This is an NP-hard problem.

Training in bioinformatics remains the oldest and most important rapid induction approach to learning bioinformatics skills. Both formal (short-term courses) and informal training (on-demand “how-to” procedures) have remained the mainstay of on-the-job programs. Here it reminds me of posting a link which provides bioinformatics online training. Its Do visit it !!

After almost a decade of short-term training, and retraining students, faculty, and scientists in discrete aspects of bioinformatics, the impetus to formalize bioinformatics education came in 1998, with a wish list of topics for an ideal bioinformatics educational program at the masters and PhD levels. Given the multidisciplinary nature of bioinformatics and the need for designing cross-faculty courses, by 2001 only a handful of universities had successfully commenced formal education in bioinformatics, with others waiting and watching.

This wait and watch decreased constantly and a number of courses included professional and corporate courses in Bioinformatics were introduced which helped enhance the knowledge in Bioinformatics Research.

Hope this knowledge grows further and becomes the first choice of career for research scholars !!


December 7, 2009

Career in Bioinformatics

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 6:55 am
Tags: , , ,

For many stuidents, bioinformatics is still a puzzle. What before bioinformatics, what is bioinformatics and ahat after bioinformaics? These are some common and the most typical questions which people want to know. While broewing through the latest articles on pubmed central, a paper authored by Shoba Ranganathan caught our attention. Its titled somewhat like this – “Towards a career in bioinformatics“. Wide eyed I started reading the article and no doubt found it informative, intresting and useful. Below is a small summary of the article.

Science is itself a quest for truth and honesty in scientific endeavours is the keystone to a successful career. Scientific integrity in presenting research results and honesty in dealing with colleagues are invaluable to a scientific career, especially one that deals with large datasets. In this context, acknowledging the prior work of other scientists is important.

Domain knowledge is the key to a successful career in bioinformatics. “Computational biology” is not merely a sum of its parts, viz. computer science/informatics and biology. It also requires knowledge of mathematics, statistics, biochemistry and sometimes a nodding acquaintance with physics, chemistry and medical sciences. A career is bioinformatics requires problem solving. Here, you need to show persistence in following your hypothesis, even if others think that you are wrong. At the same time, be prepared to modify your hypothesis if the data suggests otherwise. Reaching your ultimate goal is of principal importance, no matter which path you follow.

Many graduate students simply see their bioinformatics Ph.D. as a goal. For a career, you must make plans for the next year, next three years and maybe even the next five years. Graduate school, your first job, your next job, your publication profile can all be planned as projects using project management tools. Without plans, you are drifting on the internet, without a specific search in mind.

Among the numerous areas of bioinformatics endeavour, traditional avenues such as sequence analysis, genetic and population analysis, structural bioinformatics, text mining and ontologies are represented in this supplement, while chemoinformatics and biodiversity informatics embody emerging bioinformatics themes. In order to carry out bioinformatics research, innovative teaching is a prerequisite. Improvement in bioinformatics learning is evident from the case study using e-learning tools.

This paper covers many areas of bioinformatics which might prove useful for graduates and post graduates. Here is the link to the full article:

Have a promising career in Bioinformatics !!

December 2, 2009

PRGdb: The Plant Resistance Genes Database

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 4:26 am
Tags: , , ,

Plant disease resistance genes (R-genes) play a key role in recognizing proteins expressed by specific avirulence (Avr) genes of pathogens. R-genes originate from a phylogenetically ancient form of immunity that is common to plants and animals. However, the rapid evolution of plant immunity systems has led to enormous gene diversification. Although little is known about these agriculturally important genes, some fundamental genomic features have already been described. It has been recently shown that proteins encoded by resistance genes display modular domain structures and require several dynamic interactions between specific domains to perform their function. Some of these domains also seem necessary for proper interaction with Avr proteins and in the formation of signalling complexes that activate an innate immune response which arrests the proliferation of the invading pathogen.

PRGdb is a web accessible open-source ( database that represents the first bioinformatic resource providing a comprehensive overview of resistance genes (R-genes) in plants. PRGdb holds more than 16 000 known and putative R-genes belonging to 192 plant species challenged by 115 different pathogens and linked with useful biological information. The complete database includes a set of 73 manually curated reference R-genes, 6308 putative R-genes collected from NCBI and 10463 computationally predicted putative R-genes.

The Plant Resistance Genes (PRG) database is intended to serve as a research tool to identify and study genes involved in the disease resistance process in all plants. Data from a variety of on-line resources and literature are stored in several sections to create a unified knowledge resource with emphasis on R gene characterization and classification. The database is designed so as to allow easy integration with other data types and existing and future databases. For each cloned R gene (reference gene) is provided a fine locus annotation, reporting also homologous sequences and related disease sequences. Moreover cross links with pathogen and disease information are built, to obtain a complete view of the plant-gene-pathogen interaction system.

PRGdb is freely available here:

December 1, 2009

Machine Learning in Bioinformatics: A Review

Filed under: Bioinformatics,Computational Biology,Systems Biology — Biointelligence: Education,Training & Consultancy Services @ 12:12 pm
Tags: , , , ,

Due to continued research there is a continuous groth in the amount of biological data available. The exponential growth of the amount of biological data available raises two problems:

1. Efficient information storage and management and, on the other hand, the extraction of useful information from these data.

2. It requires the development of tools and methods capable of transforming all these heterogeneous data into biological knowledge about the underlying mechanism.

 There are various biological domains where machine learning techniques are applied for knowledge extraction from data. The below figure shows the main areas of biology such as genomics, proteomics, microarrays, evolution and text mining where computational methods are being applied.


In addition to all the above applications, computational techniques are used to solve other problems, such as efficient primer design for PCR, biological image analysis and backtranslation of proteins (which is, given the degeneration of the genetic code, a complex combinatorial problem). Machine learning consists in programming computers to optimize a performance criterion by using example data or past experience. The optimized criterion can be the accuracy provided by a predictive model—in a modelling problem—, and the value of a fitness or evaluation function—in an optimization problem. Machine learning uses statistical theory when building computational models since the objective is to make inferences from a sample. The two main steps in this process are:

 1. To induce the model by processing the huge amount of data

2. To represent the model and making inferences efficiently.

 The process of transforming data into knowledge is both iterative and interactive. The iterative phase consists of several steps. In the first step, we need to integrate and merge the different sources of information into only one format. By using data warehouse techniques, the detection and resolution of outliers and inconsistencies are solved. In the second step, it is necessary to select, clean and transform the data. To carry out this step, we need to eliminate or correct the uncorrected data, as well as decide the strategy to impute missing data. This step also selects the relevant and non-redundant variables; this selection could also be done with respect to the instances. In the third step, called data mining, we take the objectives of the study into account in order to choose the most appropriate analysis for the data. In this step, the type of paradigm for supervised or unsupervised classification should be selected and the model will be induced from the data. Once the model is obtained, it should be evaluated and interpreted—both from statistical and biological points of view—and, if necessary, we should return to the previous steps for a new iteration. This includes the solution of conflicts with the current knowledge in the domain. The model satisfactorily checked—and the new knowledge discovered—are then used to solve the problem.

 An article published in the journal ‘Briefings in Bioinformatics’ gives an insight of various machine learning techniques used in Bioinformatics. It also throws light on some major techniques such as Bayesian classifiers, logistic regression, discriminant analysis, classification trees, nearest neighbour, neural networks, Support vector machines, clustering, Hidden Markov Models and much more.

 The article can be found here: