November 30, 2009

QUPE – For Mass Spectrometry based Quantitative Proteomics Research

Filed under: Bioinformatics,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 4:22 am
Tags: , , ,

Mass spectrometry (MS) is an indispensable technique for the fast analysis of proteins and peptides in complex biological samples. One key problem with the quantitative mass spectrometric analysis of peptides and proteins, however, is the fact that the sensitivity of MS instruments is peptide-dependent, leading to an unclear relationship between the observed peak intensity and the peptide concentration in the sample. Various labeling techniques have been developed to circumvent this problem, but are very expensive and time-consuming. A reliable prediction of peptide-specific sensitivies could provide a peptide-specific correction factor, which would be valuable for label-free absolute quantitation.

QUPE is an itegrated platform for storage and analysis of quantitative proteomics data, implemented in JAVA. Its is a repository and an algorithmic framework to store and analyse mass spectrometry based quantitative proteome experiments.QuPE provides an easily extensible and configurable job concept. Using XML, jobs consisting of one or more tools can be defined, where input and output types provided by the implementation of a tool determine the data a job is executed with. Due to specific interfaces, tools can announce their need for an interactive configuration. The job and tool concept allows the integration of routines written in R, a programming language, specifically designed for mathematical and statistical purposes. Below are listed the various features of QUPE:

– Webrowser-based application using Web 2.0 technologies
– Extensive capabilities to securely store and organise experiments and complete projects (fine-grained application-based security, GPMS)
– Import of mzData as well as mzXML
– Data model adapted to suggestions made by the HUPO proteomics standards initiative (PSI)
– Mascot integration, Import of DTASelect results
– Framework supporting analysis of quantitative proteomics data, including: – Quantification of stable-isotope labelled samples
– Significance tests, analysis of variance
– Principal component analysis

QUPE is hosted here:


November 17, 2009

OrChem: A Chemistry Search Engine for Oracle

Filed under: Bioinformatics,Chemoinformatics — Biointelligence: Education,Training & Consultancy Services @ 10:46 am
Tags: , , ,

Registration, indexing and searching of chemical structures in relational databases is one of the core areas of cheminformatics. Research on the topic goes back to the 1960s and probably before that. However, little detail has been published on the inner workings of search engines and developments have been mostly closed-source. This has led to the situation that despite more than thirty years of research and publications very few open reference code is available for use and study. The cheminformatics open source community has been working since the mid 1990s to overcome this problematic situation.

OrChem, an extension for the Oracle 11G database that adds registration and indexing of chemical structures to support fast substructure and similarity searching. The cheminformatics functionality is provided by the Chemistry Development Kit. OrChem provides similarity searching with response times in the order of seconds for databases with millions of compounds, depending on a given similarity cut-off. For substructure searching, it can make use of multiple processor cores on today’s powerful database servers to provide fast response times in equally large data sets.

OrChem is built on top of the Chemistry Development Kit (CDK) and depends on this Java library in numerous ways. For example, compounds are represented internally as CDK molecule objects, the CDK’s I/O package is used to retrieve compound data, and its subgraph isomorphism algorithms are used for substructure validation. OrChem adds its own Java layer on top of the CDK to implement fast database storage and retrieval. With the CDK loaded into Oracle, a large cheminformatics library becomes readily available to PL/SQL. With little effort developers can build database functions around the CDK and so quickly implement chemistry extensions for Oracle. OrChem works in the same way, using the CDK where possible.

It uses chemical fingerprints to find compounds by substructure or similarity criteria. Fingerprints are bitsets in which each bit indicates the presence or absence of a particular chemical aspect. During a similarity search the fingerprints are used to calculate a Tanimoto measure. A Tanimoto similarity measure between two binary fingerprints is defined by the ratio of the number of common bits set to one to the total number of bits set to one in the two fingerprints. For substructure searching the fingerprint has a different function: it is used to screen possible candidates before a computationally more expensive isomorphism test.

The Orchem search engine would definately prove benefical to the cheminformatics community. More can be read on Orchem here:

November 12, 2009

KEGGConverter: Tool for modelling Metabolic Networks

Filed under: Bioinformatics,Systems Biology — Biointelligence: Education,Training & Consultancy Services @ 7:47 am
Tags: , , , ,

The Kyoto Encyclopedia of Genes and Genomes (KEGG) PATHWAY database is a valuable comprehensive collection of manually curated pathway maps for metabolism, genetic information processing and other functions. It is an integrated database resource consisting of 16 main databases, broadly categorized into systems information, genomic information, and chemical information as shown below. Genomic and chemical information represents the molecular building blocks of life in the genomic and chemical spaces, respectively, and systems information represents functional aspects of the biological systems, such as the cell and the organism, that are built from the building blocks. KEGG has been widely used as a reference knowledge base for biological interpretation of large-scale datasets generated by sequencing and other high-throughput experimental technologies.

The KEGG Pathway database is a valuable collection of metabolic pathway maps. Nevertheless, the production of simulation capable metabolic networks from KEGG Pathway data is a challenging complicated work, regardless the already developed tools for this scope. Originally used for illustration purposes, KEGG Pathways through KGML (KEGG Markup Language) files, can provide complete reaction sets and introduce species versioning, which offers advantages for the scope of cellular metabolism simulation modelling.

In order to construct such metabolic pathways, the KEGGConvertor has been implemented. It is a tool implemented in JAVA. KEGGconverter is capable of producing integrated analogues of metabolic pathways appropriate for simulation tasks, by inputting only KGML files. The web application acts as a user friendly shell which transparently enables the automated biochemically correct pathway merging, conversion to SBML format, proper renaming of the species, and insertion of default kinetic properties for the pertaining reactions. It permits the inclusion of additional reactions in the resulting model which represent flux cross-talk with neighbouring pathways, providing in this way improved simulative accuracy.
KEGG Convertor is available here:

November 11, 2009

PLAST: Parallel Local Alignment Search Tool for Database Comparison

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 7:17 am
Tags: , , , , ,

Genomic sequence comparison is a central task in computational biology for identifying closely related protein or DNA sequences. Similarities between sequences are commonly used, for instance, to identify functionality of new genes or to annotate new genomes. Algorithms designed to identify such similarities have long been available and still represent an active research domain, since this task remains critical for many bioinformatics studies. Two avenues of research are generally explored to improve these algorithms, depending on the target application.

1. The first aims to increase sensitivity.
2. While the second seeks to minimize computation time.
With next generation sequencing technology, the challenge is not only to develop new algorithms capable of managing large amounts of sequences, but also to imagine new methods for processing this mass of data as quickly as possible.

The PLAST program is a pure software implementation designed to exploit the internal parallel features of modern microprocessors. The sequence comparison algorithm has been structured to group together the most time consuming parts inside small critical sections that have good properties for parallelism. The resulting code is both well-suited for fine-grained (SIMD programming model) and medium-grained parallelization (multithreaded programming model). The first level of parallelism is supported by SSE instructions. The second is exploited with the multicore architecture of the microprocessors.

PLAST has been primarily designed to compare large protein or DNA banks. Unlike BLAST, it is not optimized to perform large database scanning. It is intended more for use in intensive comparison processes such as bioinformatics workflows, for example, to annotate new sequenced genomes. Different versions have been developed based on the BLAST family model: PLASTP for comparing two protein banks, TPLASTN for comparing one protein bank with one translated DNA bank (or genome) and PLASTX for comparing one translated DNA bank with one protein bank. The input format is the well-known FASTA format. No pre-processing (such as formatdb) is required. Like BLAST, the PLAST algorithm detects alignment using a seed heuristic method, but does so in a slightly different way. Consequently, it does not provide the same alignments, especially when there is little similarity between two sequences: some alignments are found by PLAST and not by BLAST, others are found by BLAST and not by PLAST. Nonetheless, comparable selectivity and sensitivity were measured using ROC curve, coverage versus error plot, and missed alignments.

PLAST can be downloaded from here:






November 10, 2009

BASE – A software for Microarray Data Management and Analysis

Filed under: Bioinformatics,Microarray — Biointelligence: Education,Training & Consultancy Services @ 4:27 am
Tags: , , , , , ,

Microarray techniques produce large amounts of data in many different formats and experiment sizes are growing with more samples analysed in each experiment. Samples are collected over long time and microarray analysis is performed asynchronously and re-analysed as more samples are hybridised. Systematic use of collected data requires tracking of biomaterials, array information, raw data, and assembly of annotations. Particularly for microarray service facilities, where researchers deposit samples for experimentation, information tracking becomes vital for a subsequent data delivery back to the researchers. To meet the information tracking and data analysis challenges involved in microarray experiments BASE has been implemented.

BASE (Bioarray Software Environment) is a comprehensive free web-based database solution for the massive amounts of data generated by microarray analysis. It is a MIAME (Minimum Information About a Microarray Experiment guidelines) compliant application designed for microarray laboratories looking for a single point of storage for all information related to their microarray experimentation. BASE is a multi-user local data repository that features a web browser user interface, laboratory information management system (LIMS) for biomaterials and array production, annotations, hierarchical overview of analysis, and integrates tools like MultiExperiment Viewer (MEV) and GenePattern.

BASE is an annotable microarray data repository and analysis application providing researchers with efficient information management and analysis. BASE stores all microarray experiment related data, biomaterial information, and annotations regardless if analysis tools for specific techniques or data formats are readily available. As new techniques becomes available software applications should be expendable and modifiable to support changed needs. Moreover, it is an open source software and is freely available.

BASE website:

November 9, 2009

CDD: Database for Interactive Domain Family Analysis

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 8:30 am
Tags: , , , , , ,

Protein domains may be viewed as units in the molecular evolution of proteins and can be organized into an evolutionary classification. The set of protein domains characterized so far appears to describe no more than a few thousand superfamilies, where members of each superfamily are related to each other by common descent. Computational annotation of protein function is generally obtained via sequence similarity: once a close neighbor with known function has been identified, its annotation is copied to the sequence with unknown function. This strategy may work very well in functionally homogeneous families and when applied only for very close neighbors or suspected orthologs, but it is doomed to fail often when domain or protein families are sufficiently diverse and when no close neighbors with known function are available.

NCBI’s conserved domain database (CDD) attempts to collate that set and to organize related domain models in a hierarchical fashion, meant to reflect major ancient gene duplication events and subsequent functional diversification. The conserved domain database (CDD) is part of NCBI’s Entrez database system and serves as a primary resource for the annotation of conserved domain footprints on protein sequences in Entrez.CDD provides a strategy toward a more accurate assessment of such neighbor relationships, similar to approaches termed ‘phylogenomic inference. CDD acknowledges that protein domain families may be very diverse and that they may contain sets of related subfamilies.

In CDD curation, we attempt to detect evidence for duplication and functional divergence in domain families by means of phylogenetic analysis. We record the resulting subfamily structure as a set of explicit models, but limit the analysis to ancient duplication events—several hundred million years in the past, as judged by the taxonomic distribution of protein sequences with particular domain subfamily footprints. CDD provides a search tool employing reverse position-specific BLAST (RPS–BLAST), where query sequences are compared to databases of position-specific score matrices (PSSMs), and E-values are obtained in much the same way as in the widely used PSI-BLAST application.

CDD is hosted here: