July 16, 2010

SLIMS—a user-friendly sample operations and inventory management system for genotyping labs

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 6:02 am

We present the Sample-based Laboratory Information Management System (SLIMS), a powerful and user-friendly open source web application that provides all members of a laboratory with an interface to view, edit and create sample information. SLIMS aims to simplify common laboratory tasks with tools such as a user-friendly shopping cart for subjects, samples and containers that easily generates reports, shareable lists and plate designs for genotyping. Further key features include customizable data views, database change-logging and dynamically filled pre-formatted reports. Along with being feature-rich, SLIMS’ power comes from being able to handle longitudinal data from multiple time-points and biological sources. This type of data is increasingly common from studies searching for susceptibility genes for common complex diseases that collect thousands of samples generating millions of genotypes and overwhelming amounts of data. LIMSs provide an efficient way to deal with this data while increasing accessibility and reducing laboratory errors; however, professional LIMS are often too costly to be practical. SLIMS gives labs a feasible alternative that is easily accessible, user-centrically designed and feature-rich. To facilitate system customization, and utilization for other groups, manuals have been written for users and developers.

Availability: Documentation, source code and manuals are available at SLIMS was developed using Java 1.6.0, JSPs, Hibernate 3.3.1.GA, DB2 and mySQL, Apache Tomcat 6.0.18, NetBeans IDE 6.5, Jasper Reports 3.5.1 and JasperSoft’s iReport 3.5.1.

July 2, 2010

DARNED: a DAtabase of RNa EDiting in humans

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:06 am

RNA editing is a phenomenon, which is responsible for the alteration of particular nucleotides in RNA sequences relative to their genomic templates. Recently, a large number of RNA editing instances in humans have been identified using bioinformatic screens and high-throughput experimental investigations utilizing next-generation sequencing technologies. However, the available data on RNA editing are not uniform and difficult to access.

Results: Here, we describe a new database DARNED (DAtabase of RNa EDiting) that provides centralized access to available published data related to RNA editing. RNA editing locations are mapped on the reference human genome. The current release of the database contains information on approximately 42 000 human genome coordinates corresponding to RNA locations that undergo RNA editing, mostly involving adenosine-to-inosine (A-to-I) substitutions. The data can be queried using a range of genomic coordinates, their corresponding functional localization in RNA molecules [Exons, Introns, CoDing Sequence (CDS) and UnTranslated Regions (UTRs)] and information regarding tissue/organ/cell sources where RNA editing has been observed. It is also possible to obtain RNA editing information for a specific gene or an RNA molecule using corresponding accession numbers. Search results provide information on the number of expressed sequence tags (ESTs) supporting edited and genomic bases, functional localization of RNA editing and existence of known single nucleotide polymorphisms (SNPs). Editing data can be explored in UCSC and Ensembl genome browsers, in conjunction with additional data provided by these popular genome browsers. DARNED has been designed for researchers seeking information on RNA editing and for the developers of novel algorithms for its prediction.

Availability: DARNED is accessible at

June 28, 2010

Automated analysis of protein subcellular location in time series images

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:03 am

Image analysis, machine learning and statistical modeling have become well established for the automatic recognition and comparison of the subcellular locations of proteins in microscope images. By using a comprehensive set of features describing static images, major subcellular patterns can be distinguished with near perfect accuracy. We now extend this work to time series images, which contain both spatial and temporal information. The goal is to use temporal features to improve recognition of protein patterns that are not fully distinguishable by their static features alone.

Results: We have adopted and designed five sets of features for capturing temporal behavior in 2D time series images, based on object tracking, temporal texture, normal flow, Fourier transforms and autoregression. Classification accuracy on an image collection for 12 fluorescently tagged proteins was increased when temporal features were used in addition to static features. Temporal texture, normal flow and Fourier transform features were most effective at increasing classification accuracy. We therefore extended these three feature sets to 3D time series images, but observed no significant improvement over results for 2D images. The methods for 2D and 3D temporal pattern analysis do not require segmentation of images into single cell regions, and are suitable for automated high-throughput microscopy applications.

Availability: Images, source code and results will be available upon publication at

June 25, 2010

Classification of DNA sequences using Bloom filters

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 4:47 am

New generation sequencing technologies producing increasingly complex datasets demand new efficient and specialized sequence analysis algorithms. Often, it is only the ‘novel’ sequences in a complex dataset that are of interest and the superfluous sequences need to be removed.

Results: A novel algorithm, fast and accurate classification of sequences (FACSs), is introduced that can accurately and rapidly classify sequences as belonging or not belonging to a reference sequence. FACS was first optimized and validated using a synthetic metagenome dataset. An experimental metagenome dataset was then used to show that FACS achieves comparable accuracy as BLAT and SSAHA2 but is at least 21 times faster in classifying sequences.

Availability: Source code for FACS, Bloom filters and MetaSim dataset used is available at The Bloom::Faster 1.6 Perl module can be downloaded from CPAN at

June 23, 2010

Pathgroups, a dynamic data structure for genome reconstruction problems

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 5:56 am

Ancestral gene order reconstruction problems, including the median problem, quartet construction, small phylogeny, guided genome halving and genome aliquoting, are NP hard. Available heuristics dedicated to each of these problems are computationally costly for even small instances.

Results: We present a data structure enabling rapid heuristic solution to all these ancestral genome reconstruction problems. A generic greedy algorithm with look-ahead based on an automatically generated priority system suffices for all the problems using this data structure. The efficiency of the algorithm is due to fast updating of the structure during run time and to the simplicity of the priority scheme. We illustrate with the first rapid algorithm for quartet construction and apply this to a set of yeast genomes to corroborate a recent gene sequence-based phylogeny.


June 19, 2010

Modeling RNA loops using sequence homology and geometric constraints

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: ,

RNA loop regions are essential structural elements of RNA molecules influencing both their structural and functional properties. We developed RLooM, a web application for homology-based modeling of RNA loops utilizing template structures extracted from the PDB. RLooM allows the insertion and replacement of loop structures of a desired sequence into an existing RNA structure. Furthermore, a comprehensive database of loops in RNA structures can be accessed through the web interface.

Availability and Implementation: The application was implemented in Python, MySQL and Apache. A web interface to the database and loop modeling application is freely available at

June 17, 2010

TEAM: efficient two-locus epistasis tests in human genome-wide association study

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: ,

As a promising tool for identifying genetic markers underlying phenotypic differences, genome-wide association study (GWAS) has been extensively investigated in recent years. In GWAS, detecting epistasis (or gene–gene interaction) is preferable over single locus study since many diseases are known to be complex traits. A brute force search is infeasible for epistasis detection in the genome-wide scale because of the intensive computational burden. Existing epistasis detection algorithms are designed for dataset consisting of homozygous markers and small sample size. In human study, however, the genotype may be heterozygous, and number of individuals can be up to thousands. Thus, existing methods are not readily applicable to human datasets. In this article, we propose an efficient algorithm, TEAM, which significantly speeds up epistasis detection for human GWAS. Our algorithm is exhaustive, i.e. it does not ignore any epistatic interaction. Utilizing the minimum spanning tree structure, the algorithm incrementally updates the contingency tables for epistatic tests without scanning all individuals. Our algorithm has broader applicability and is more efficient than existing methods for large sample study. It supports any statistical test that is based on contingency tables, and enables both family-wise error rate and false discovery rate controlling. Extensive experiments show that our algorithm only needs to examine a small portion of the individuals to update the contingency tables, and it achieves at least an order of magnitude speed up over the brute force approach.

June 14, 2010

SUPERTRIPLETS: a triplet-based supertree approach to phylogenomics

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 1:30 am
Tags: ,

Phylogenetic tree-building methods use molecular data to represent the evolutionary history of genes and taxa. A recurrent problem is to reconcile the various phylogenies built from different genomic sequences into a single one. This task is generally conducted by a two-step approach whereby a binary representation of the initial trees is first inferred and then a maximum parsimony (MP) analysis is performed on it. This binary representation uses a decomposition of all source trees that is usually based on clades, but that can also be based on triplets or quartets. The relative performances of these representations have been discussed but are difficult to assess since both are limited to relatively small datasets.

Results: This article focuses on the triplet-based representation of source trees. We first recall how, using this representation, the parsimony analysis is related to the median tree notion. We then introduce SUPERTRIPLETS, a new algorithm that is specially designed to optimize this alternative formulation of the MP criterion. The method avoids several practical limitations of the triplet-based binary matrix representation, making it useful to deal with large datasets. When the correct resolution of every triplet appears more often than the incorrect ones in source trees, SUPERTRIPLETS warrants to reconstruct the correct phylogeny. Both simulations and a case study on mammalian phylogenomics confirm the advantages of this approach. In both cases, SUPERTRIPLETS tends to propose less resolved but more reliable supertrees than those inferred using Matrix REPRESENTATION with PARSIMONY.

Availability: Online and JAVA standalone versions of SUPERTRIPLETS are available at

June 11, 2010

A soft kinetic data structure for lesion border detection

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 10:30 am
Tags: ,

The medical imaging and image processing techniques, ranging from microscopic to macroscopic, has become one of the main components of diagnostic procedures to assist dermatologists in their medical decision-making processes. Computer-aided segmentation and border detection on dermoscopic images is one of the core components of diagnostic procedures and therapeutic interventions for skin cancer. Automated assessment tools for dermoscopic images have become an important research field mainly because of inter- and intra-observer variations in human interpretations. In this study, a novel approach—graph spanner—for automatic border detection in dermoscopic images is proposed. In this approach, a proximity graph representation of dermoscopic images in order to detect regions and borders in skin lesion is presented.

Results: Graph spanner approach is examined on a set of 100 dermoscopic images whose manually drawn borders by a dermatologist are used as the ground truth. Error rates, false positives and false negatives along with true positives and true negatives are quantified by digitally comparing results with manually determined borders from a dermatologist. The results show that the highest precision and recall rates obtained to determine lesion boundaries are 100%. However, accuracy of assessment averages out at 97.72% and borders errors’ mean is 2.28% for whole dataset.

June 7, 2010

A computational analysis of the antigenic properties of haemagglutinin in influenza A H3N2

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 9:00 am
Tags: ,

Modelling antigenic shift in influenza A H3N2 can help to predict the efficiency of vaccines. The virus is known to exhibit sudden jumps in antigenic distance, and prediction of such novel strains from amino acid sequence differences remains a challenge.

Results: From analysis of 6624 amino acid sequences of wild-type H3, we propose updates to the frequently referenced list of 131 amino acids located at or near the five identified antibody binding regions in haemagglutinin (HA). We introduce a class of predictive models based on the analysis of amino acid changes in these binding regions, and extend the principle to changes in HA1 as a whole by dividing the molecule into regional bands.

Our results show that a range of simple models based on banded changes give better predictive performance than models based on the established five canonical regions and can identify a higher proportion of vaccine escape candidates among novel strains than a current state-of-the-art model.


Next Page »