March 31, 2010

QDD: a user-friendly program to select micro satellite markers and design primers from large sequencing projects

Filed under: Bioinformatics,Computational Biology,Systems Biology — Biointelligence: Education,Training & Consultancy Services @ 12:02 am
Tags: , , , , ,

QDD is an open access program providing a user-friendly tool for microsatellite detection and primer design from large sets of DNA sequences. The program is designed to deal with all steps of treatment of raw sequences obtained from pyrosequencing of enriched DNA libraries, but it is also applicable to data obtained through other sequencing methods, using FASTA files as input. The following tasks are completed by QDD: tag sorting, adapter/vector removal, elimination of redundant sequences, detection of possible genomic multicopies (duplicated loci or transposable elements), stringent selection of target microsatellites and customizable primer design. It can treat up to one million sequences of a few hundred base pairs in the tag-sorting step, and up to 50 000 sequences in a single input file for the steps involving estimation of sequence similarity.

Availability: QDD is freely available under the GPL licence for Windows and Linux from the following web site:

March 29, 2010

webMGR: an online tool for the multiple genome rearrangement problem

Filed under: Bioinformatics,Computational Biology,Systems Biology — Biointelligence: Education,Training & Consultancy Services @ 12:00 am
Tags: , , , ,

The algorithm MGR enables the reconstruction of rearrangement phylogenies based on gene or synteny block order in multiple genomes. Although MGR has been successfully applied to study the evolution of different sets of species, its utilization has been hampered by the prohibitive running time for some applications. In the current work, we have designed new heuristics that significantly speed up the tool without compromising its accuracy. Moreover, we have developed a web server (webMGR) that includes elaborate web output to facilitate navigation through the results.

webMGR can be accessed via

January 13, 2010

Database of human Protein-DNA Interactions – hPDI

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 6:05 am
Tags: , , , ,

The characterization of the protein-DNA interactions usually requires three levels of analysis:

1.Genetic: Determination of the nucleotide sequence of the protein-binding region and the identification of sequence changes that confer a mutated phenotype.

2. Biochemical: Identification of potential protein-DNA contacts using a variety of footprinting and protection experiments such as DNaseI or hydroxyl radical footprinting, and methylation protection or ethylation protection experiments.

3. Physical: Analysis of specific interactions in protein-target sequence fragment co-crystals.

The hPDI database holds experimental protein-DNA interaction data for humans identified by protein microarray assays. The current release of hPDI contains 17,718 protein-DNA interactions for 1013 human DNA-binding proteins. These DNA-binding proteins include 493 human transcription factors (TFs) and 520 unconventional DNA binding proteins (uDBPs). This database is freely accessible for any academic purposes.

hPDI can be accessed from here:

January 12, 2010

Comparative Genetic Maps visualisation: CMap3D

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 5:52 am
Tags: , , , ,

Chromosome mapping by counting the number of recombinants produces a genetic map of the chromosome. But all the genes on the chromosome are incorporated in a single molecule of DNA. Genes are simply portions of the molecule (open reading frames or ORFs) encoding products that create the observed trait (phenotype). The rapid progress in DNA sequencing has produced complete genomes for hundreds of microbes and several eukaryotes.

Genetic linkage mapping enables the study of genome organisation and the association of heritable traits with regions of sequenced genomes and underlying genome sequence variation. Comparative genetic mapping is particularly powerful as it allows a translation of information between related genomes and gives an insight into genome evolution.

CMap3D is  powerful tool to graphically compare multiple genetic maps in three-dimensional space.CMap3D is a stand-alone application and is available for Windows, OS X, and Linux. This software is free for academic users.

For more information refer this link:

December 24, 2009

“Omics” Technologies

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 8:42 am
Tags: , , ,

“Ome” and “omics” are suffixes that are derived from genome (the whole collection of a person’s DNA, as coined by Hans Winkler, as a combinaion of “gene” and “chromosome”1) and genomics (the study of the genome). Scientists like to append to these to any large-scale system (or really, just about anything complex), such as the collection of proteins in a cell or tissue (the proteome), the collection of metabolites (the metabolome), and the collection of RNA that’s been transcribed from genes (the transcriptome). High-throughput analysis is essential considering data at the “omic” level, that is to say considering all DNA sequences, gene expression levels, or proteins at once (or, to be slightly more precise, a significant subset of them). Without the ability to rapidly and accurately measure tens and hundreds of thousands of data points in a short period of time, there is no way to perform analyses at this level.

There are four major types of high-throughput measurements that are commonly performed: genomic SNP analysis (i.e., the large-scale genotyping of single nucleotide polymorphisms), transcriptomic measurements (i.e., the measurement of all gene expression values in a cell or tissue type simultaneously), proteomic measurements (i.e., the identification of all proteins present in a cell or tissue type), and metabolomic measurements (i.e., the identification and quantification of all metabolites present in a cell or tissue type). Each of these four is distinct and offers a different perspective on the processes underlying disease initiation and progression as well as on ways of predicting, preventing, or treating disease.

Genomic SNP genotyping measures a person’s genotypes for several hundred thousand single nucleotide polymorphisms spread throughout the genome. Other assays exists to genotype ten thousand or so polymorphic sites that are near known genes (under the assumption that these are more likely to have some effect on these genes). The genotyping technology is quite accurate, but the SNPs themselves offer only limited information. These SNPs tend to be quite common (with typically at least 5% of the population having at least one copy of the less frequent allele), and not strictly causal of the disease. Rather, SNPs can act in unison with other SNPs and with environmental variables to increase or decrease a person’s risk of a disease. This makes identifying important SNPs difficult; the variation in a trait that can be accounted for by a single SNP is fairly small relative to the total variation in the trait. Even so, because genotypes remain constant (barring mutations to individual cells) throughout life, SNPs are potentially among the most useful measurements for predicting risk.

Transcriptomic measurements (often referred to as gene expression microarrays or “gene chips” are the oldest and most established of the high-throughput methodologies. The most common are commercially produced “oligonucleotide arrays”, which have hundreds of thousands of small (25 bases) probes, between 11 and 20 per gene. RNA that has been extracted from cells is then hybridized to the chip, and the expression level of ~30,000 different mRNAs can be assessed simultaneously. More so than SNP genotypes, there is the potential for a significant amount of noise in transcriptomic measurements. The source of the RNA, the preparation and purification methods, and variations in the hybridization and scanning process can lead to differences in expression levels; statistical methods to normalize, quantify, and analyze these measures has been one of the hottest areas of research in the last five years. Gene expression levels influence traits more directly than than SNPs, and so significant associations are easier to detect. While transcriptomic measures are not as useful for pre-disease prediction (because a person’s gene expression levels very far in advance of disease initiation are not likely to be informative because they have the potential to change so significantly), they are very well-suited for either early identification of a disease (i.e., finding people who have gene expression levels characteristic of a disease but who have not yet manifested other symptoms) or classifying patients with a disease into subgroups (by identifying gene expression levels that are associated with either better or worse outcomes or with higher or lower values of some disease phenotype).

Proteomics is similar in character to transcriptomics. The most significant difference is in regards to the measurements. Unlike transcriptomics, where the gene expression levels are assessed simultaneously, protein identification is done in a rapid serial fashion. After a sample has been prepared, the proteins are separated using chromatography, 2 dimensional protein gels (which separate proteins based on charge and then size) or 1 dimensional protein gels (which separate based on size alone), and digested, typically with trypsin (which cuts proteins after each arginine and lysine), and then run through mass spectroscopy. The mass spec identifies the size of each of the peptides, and the proteins can be identified by comparing the size of the peptides created with the theoretical digests of all know proteins in a database. This searching is the key to the technology, and a number of algorithms both commercial and open-source have been created for this. Unlike transcriptomic measures, the overall quantity of a protein cannot be assessed, just its presence or absence. Like transcriptomic measures, though, proteomic measures are excellent for early identification of disease or classifying people into subgroups.

Last up is metabolomics, the high-throughput measure of the metabolites present in a cell or tissue. As with proteomics, the metabolites are measured in a very fast serial process. NMR is typically used to both identify and quantify metabolites. This technology is newer and less frequently used than the other technologies, but similar caveats apply. Measurements of metabolites are dynamic as are gene expression levels and proteins, and so are best suited for either early disease detection or disease subclass identification.

So, above was a brief introduction to all the “omics”. Would include details on each in my next posts.

Till then Happy Xmas Season !!

December 8, 2009

Bioinformatics Education—Perspectives and Challenges

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 4:59 am
Tags: , , ,

Here is another post similar to the one I had posted yesterday. This article mailnly talks on Bioinformatics Education and the challenges involved in it.
Read it below:

Education in bioinformatics has undergone a sea change, from informal workshops and training courses to structured certificate, diploma, and degree programs—spanning casual self-enriching courses all the way to doctorate programs. The evolution of curriculum, instructional methodologies, and initiatives supporting the dissemination of bioinformatics is presented here.

Building on the early applications of informatics (computer science) to the field of biology, bioinformatics research entails input from the diverse disciplines of mathematics and statistics, physics and chemistry, and medicine and pharmacology. Providing education in bioinformatics is challenging from this multidisciplinary perspective, and represents short- and long-term efforts directed at casual and dedicated learners in academic and industrial environments. This is an NP-hard problem.

Training in bioinformatics remains the oldest and most important rapid induction approach to learning bioinformatics skills. Both formal (short-term courses) and informal training (on-demand “how-to” procedures) have remained the mainstay of on-the-job programs. Here it reminds me of posting a link which provides bioinformatics online training. Its Do visit it !!

After almost a decade of short-term training, and retraining students, faculty, and scientists in discrete aspects of bioinformatics, the impetus to formalize bioinformatics education came in 1998, with a wish list of topics for an ideal bioinformatics educational program at the masters and PhD levels. Given the multidisciplinary nature of bioinformatics and the need for designing cross-faculty courses, by 2001 only a handful of universities had successfully commenced formal education in bioinformatics, with others waiting and watching.

This wait and watch decreased constantly and a number of courses included professional and corporate courses in Bioinformatics were introduced which helped enhance the knowledge in Bioinformatics Research.

Hope this knowledge grows further and becomes the first choice of career for research scholars !!


December 7, 2009

Career in Bioinformatics

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 6:55 am
Tags: , , ,

For many stuidents, bioinformatics is still a puzzle. What before bioinformatics, what is bioinformatics and ahat after bioinformaics? These are some common and the most typical questions which people want to know. While broewing through the latest articles on pubmed central, a paper authored by Shoba Ranganathan caught our attention. Its titled somewhat like this – “Towards a career in bioinformatics“. Wide eyed I started reading the article and no doubt found it informative, intresting and useful. Below is a small summary of the article.

Science is itself a quest for truth and honesty in scientific endeavours is the keystone to a successful career. Scientific integrity in presenting research results and honesty in dealing with colleagues are invaluable to a scientific career, especially one that deals with large datasets. In this context, acknowledging the prior work of other scientists is important.

Domain knowledge is the key to a successful career in bioinformatics. “Computational biology” is not merely a sum of its parts, viz. computer science/informatics and biology. It also requires knowledge of mathematics, statistics, biochemistry and sometimes a nodding acquaintance with physics, chemistry and medical sciences. A career is bioinformatics requires problem solving. Here, you need to show persistence in following your hypothesis, even if others think that you are wrong. At the same time, be prepared to modify your hypothesis if the data suggests otherwise. Reaching your ultimate goal is of principal importance, no matter which path you follow.

Many graduate students simply see their bioinformatics Ph.D. as a goal. For a career, you must make plans for the next year, next three years and maybe even the next five years. Graduate school, your first job, your next job, your publication profile can all be planned as projects using project management tools. Without plans, you are drifting on the internet, without a specific search in mind.

Among the numerous areas of bioinformatics endeavour, traditional avenues such as sequence analysis, genetic and population analysis, structural bioinformatics, text mining and ontologies are represented in this supplement, while chemoinformatics and biodiversity informatics embody emerging bioinformatics themes. In order to carry out bioinformatics research, innovative teaching is a prerequisite. Improvement in bioinformatics learning is evident from the case study using e-learning tools.

This paper covers many areas of bioinformatics which might prove useful for graduates and post graduates. Here is the link to the full article:

Have a promising career in Bioinformatics !!

December 1, 2009

Machine Learning in Bioinformatics: A Review

Filed under: Bioinformatics,Computational Biology,Systems Biology — Biointelligence: Education,Training & Consultancy Services @ 12:12 pm
Tags: , , , ,

Due to continued research there is a continuous groth in the amount of biological data available. The exponential growth of the amount of biological data available raises two problems:

1. Efficient information storage and management and, on the other hand, the extraction of useful information from these data.

2. It requires the development of tools and methods capable of transforming all these heterogeneous data into biological knowledge about the underlying mechanism.

 There are various biological domains where machine learning techniques are applied for knowledge extraction from data. The below figure shows the main areas of biology such as genomics, proteomics, microarrays, evolution and text mining where computational methods are being applied.


In addition to all the above applications, computational techniques are used to solve other problems, such as efficient primer design for PCR, biological image analysis and backtranslation of proteins (which is, given the degeneration of the genetic code, a complex combinatorial problem). Machine learning consists in programming computers to optimize a performance criterion by using example data or past experience. The optimized criterion can be the accuracy provided by a predictive model—in a modelling problem—, and the value of a fitness or evaluation function—in an optimization problem. Machine learning uses statistical theory when building computational models since the objective is to make inferences from a sample. The two main steps in this process are:

 1. To induce the model by processing the huge amount of data

2. To represent the model and making inferences efficiently.

 The process of transforming data into knowledge is both iterative and interactive. The iterative phase consists of several steps. In the first step, we need to integrate and merge the different sources of information into only one format. By using data warehouse techniques, the detection and resolution of outliers and inconsistencies are solved. In the second step, it is necessary to select, clean and transform the data. To carry out this step, we need to eliminate or correct the uncorrected data, as well as decide the strategy to impute missing data. This step also selects the relevant and non-redundant variables; this selection could also be done with respect to the instances. In the third step, called data mining, we take the objectives of the study into account in order to choose the most appropriate analysis for the data. In this step, the type of paradigm for supervised or unsupervised classification should be selected and the model will be induced from the data. Once the model is obtained, it should be evaluated and interpreted—both from statistical and biological points of view—and, if necessary, we should return to the previous steps for a new iteration. This includes the solution of conflicts with the current knowledge in the domain. The model satisfactorily checked—and the new knowledge discovered—are then used to solve the problem.

 An article published in the journal ‘Briefings in Bioinformatics’ gives an insight of various machine learning techniques used in Bioinformatics. It also throws light on some major techniques such as Bayesian classifiers, logistic regression, discriminant analysis, classification trees, nearest neighbour, neural networks, Support vector machines, clustering, Hidden Markov Models and much more.

 The article can be found here:


November 11, 2009

PLAST: Parallel Local Alignment Search Tool for Database Comparison

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 7:17 am
Tags: , , , , ,

Genomic sequence comparison is a central task in computational biology for identifying closely related protein or DNA sequences. Similarities between sequences are commonly used, for instance, to identify functionality of new genes or to annotate new genomes. Algorithms designed to identify such similarities have long been available and still represent an active research domain, since this task remains critical for many bioinformatics studies. Two avenues of research are generally explored to improve these algorithms, depending on the target application.

1. The first aims to increase sensitivity.
2. While the second seeks to minimize computation time.
With next generation sequencing technology, the challenge is not only to develop new algorithms capable of managing large amounts of sequences, but also to imagine new methods for processing this mass of data as quickly as possible.

The PLAST program is a pure software implementation designed to exploit the internal parallel features of modern microprocessors. The sequence comparison algorithm has been structured to group together the most time consuming parts inside small critical sections that have good properties for parallelism. The resulting code is both well-suited for fine-grained (SIMD programming model) and medium-grained parallelization (multithreaded programming model). The first level of parallelism is supported by SSE instructions. The second is exploited with the multicore architecture of the microprocessors.

PLAST has been primarily designed to compare large protein or DNA banks. Unlike BLAST, it is not optimized to perform large database scanning. It is intended more for use in intensive comparison processes such as bioinformatics workflows, for example, to annotate new sequenced genomes. Different versions have been developed based on the BLAST family model: PLASTP for comparing two protein banks, TPLASTN for comparing one protein bank with one translated DNA bank (or genome) and PLASTX for comparing one translated DNA bank with one protein bank. The input format is the well-known FASTA format. No pre-processing (such as formatdb) is required. Like BLAST, the PLAST algorithm detects alignment using a seed heuristic method, but does so in a slightly different way. Consequently, it does not provide the same alignments, especially when there is little similarity between two sequences: some alignments are found by PLAST and not by BLAST, others are found by BLAST and not by PLAST. Nonetheless, comparable selectivity and sensitivity were measured using ROC curve, coverage versus error plot, and missed alignments.

PLAST can be downloaded from here:






November 9, 2009

CDD: Database for Interactive Domain Family Analysis

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 8:30 am
Tags: , , , , , ,

Protein domains may be viewed as units in the molecular evolution of proteins and can be organized into an evolutionary classification. The set of protein domains characterized so far appears to describe no more than a few thousand superfamilies, where members of each superfamily are related to each other by common descent. Computational annotation of protein function is generally obtained via sequence similarity: once a close neighbor with known function has been identified, its annotation is copied to the sequence with unknown function. This strategy may work very well in functionally homogeneous families and when applied only for very close neighbors or suspected orthologs, but it is doomed to fail often when domain or protein families are sufficiently diverse and when no close neighbors with known function are available.

NCBI’s conserved domain database (CDD) attempts to collate that set and to organize related domain models in a hierarchical fashion, meant to reflect major ancient gene duplication events and subsequent functional diversification. The conserved domain database (CDD) is part of NCBI’s Entrez database system and serves as a primary resource for the annotation of conserved domain footprints on protein sequences in Entrez.CDD provides a strategy toward a more accurate assessment of such neighbor relationships, similar to approaches termed ‘phylogenomic inference. CDD acknowledges that protein domain families may be very diverse and that they may contain sets of related subfamilies.

In CDD curation, we attempt to detect evidence for duplication and functional divergence in domain families by means of phylogenetic analysis. We record the resulting subfamily structure as a set of explicit models, but limit the analysis to ancient duplication events—several hundred million years in the past, as judged by the taxonomic distribution of protein sequences with particular domain subfamily footprints. CDD provides a search tool employing reverse position-specific BLAST (RPS–BLAST), where query sequences are compared to databases of position-specific score matrices (PSSMs), and E-values are obtained in much the same way as in the widely used PSI-BLAST application.

CDD is hosted here:





Next Page »