Biointelligence

January 13, 2010

Database of human Protein-DNA Interactions – hPDI

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 6:05 am
Tags: , , , ,

The characterization of the protein-DNA interactions usually requires three levels of analysis:

1.Genetic: Determination of the nucleotide sequence of the protein-binding region and the identification of sequence changes that confer a mutated phenotype.

2. Biochemical: Identification of potential protein-DNA contacts using a variety of footprinting and protection experiments such as DNaseI or hydroxyl radical footprinting, and methylation protection or ethylation protection experiments.

3. Physical: Analysis of specific interactions in protein-target sequence fragment co-crystals.

The hPDI database holds experimental protein-DNA interaction data for humans identified by protein microarray assays. The current release of hPDI contains 17,718 protein-DNA interactions for 1013 human DNA-binding proteins. These DNA-binding proteins include 493 human transcription factors (TFs) and 520 unconventional DNA binding proteins (uDBPs). This database is freely accessible for any academic purposes.

hPDI can be accessed from here: http://www.xiezhi.org/hpdi-database

Advertisements

December 2, 2009

PRGdb: The Plant Resistance Genes Database

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 4:26 am
Tags: , , ,

Plant disease resistance genes (R-genes) play a key role in recognizing proteins expressed by specific avirulence (Avr) genes of pathogens. R-genes originate from a phylogenetically ancient form of immunity that is common to plants and animals. However, the rapid evolution of plant immunity systems has led to enormous gene diversification. Although little is known about these agriculturally important genes, some fundamental genomic features have already been described. It has been recently shown that proteins encoded by resistance genes display modular domain structures and require several dynamic interactions between specific domains to perform their function. Some of these domains also seem necessary for proper interaction with Avr proteins and in the formation of signalling complexes that activate an innate immune response which arrests the proliferation of the invading pathogen.

PRGdb is a web accessible open-source (http://www.prgdb.org) database that represents the first bioinformatic resource providing a comprehensive overview of resistance genes (R-genes) in plants. PRGdb holds more than 16 000 known and putative R-genes belonging to 192 plant species challenged by 115 different pathogens and linked with useful biological information. The complete database includes a set of 73 manually curated reference R-genes, 6308 putative R-genes collected from NCBI and 10463 computationally predicted putative R-genes.

The Plant Resistance Genes (PRG) database is intended to serve as a research tool to identify and study genes involved in the disease resistance process in all plants. Data from a variety of on-line resources and literature are stored in several sections to create a unified knowledge resource with emphasis on R gene characterization and classification. The database is designed so as to allow easy integration with other data types and existing and future databases. For each cloned R gene (reference gene) is provided a fine locus annotation, reporting also homologous sequences and related disease sequences. Moreover cross links with pathogen and disease information are built, to obtain a complete view of the plant-gene-pathogen interaction system.

PRGdb is freely available here: http://prgdb.cbm.fvg.it/

November 12, 2009

KEGGConverter: Tool for modelling Metabolic Networks

Filed under: Bioinformatics,Systems Biology — Biointelligence: Education,Training & Consultancy Services @ 7:47 am
Tags: , , , ,

The Kyoto Encyclopedia of Genes and Genomes (KEGG) PATHWAY database is a valuable comprehensive collection of manually curated pathway maps for metabolism, genetic information processing and other functions. It is an integrated database resource consisting of 16 main databases, broadly categorized into systems information, genomic information, and chemical information as shown below. Genomic and chemical information represents the molecular building blocks of life in the genomic and chemical spaces, respectively, and systems information represents functional aspects of the biological systems, such as the cell and the organism, that are built from the building blocks. KEGG has been widely used as a reference knowledge base for biological interpretation of large-scale datasets generated by sequencing and other high-throughput experimental technologies.

The KEGG Pathway database is a valuable collection of metabolic pathway maps. Nevertheless, the production of simulation capable metabolic networks from KEGG Pathway data is a challenging complicated work, regardless the already developed tools for this scope. Originally used for illustration purposes, KEGG Pathways through KGML (KEGG Markup Language) files, can provide complete reaction sets and introduce species versioning, which offers advantages for the scope of cellular metabolism simulation modelling.

In order to construct such metabolic pathways, the KEGGConvertor has been implemented. It is a tool implemented in JAVA. KEGGconverter is capable of producing integrated analogues of metabolic pathways appropriate for simulation tasks, by inputting only KGML files. The web application acts as a user friendly shell which transparently enables the automated biochemically correct pathway merging, conversion to SBML format, proper renaming of the species, and insertion of default kinetic properties for the pertaining reactions. It permits the inclusion of additional reactions in the resulting model which represent flux cross-talk with neighbouring pathways, providing in this way improved simulative accuracy.
KEGG Convertor is available here: http://www.grissom.gr/keggconverter/

November 9, 2009

CDD: Database for Interactive Domain Family Analysis

Filed under: Bioinformatics,Computational Biology,Proteomics — Biointelligence: Education,Training & Consultancy Services @ 8:30 am
Tags: , , , , , ,

Protein domains may be viewed as units in the molecular evolution of proteins and can be organized into an evolutionary classification. The set of protein domains characterized so far appears to describe no more than a few thousand superfamilies, where members of each superfamily are related to each other by common descent. Computational annotation of protein function is generally obtained via sequence similarity: once a close neighbor with known function has been identified, its annotation is copied to the sequence with unknown function. This strategy may work very well in functionally homogeneous families and when applied only for very close neighbors or suspected orthologs, but it is doomed to fail often when domain or protein families are sufficiently diverse and when no close neighbors with known function are available.

NCBI’s conserved domain database (CDD) attempts to collate that set and to organize related domain models in a hierarchical fashion, meant to reflect major ancient gene duplication events and subsequent functional diversification. The conserved domain database (CDD) is part of NCBI’s Entrez database system and serves as a primary resource for the annotation of conserved domain footprints on protein sequences in Entrez.CDD provides a strategy toward a more accurate assessment of such neighbor relationships, similar to approaches termed ‘phylogenomic inference. CDD acknowledges that protein domain families may be very diverse and that they may contain sets of related subfamilies.

In CDD curation, we attempt to detect evidence for duplication and functional divergence in domain families by means of phylogenetic analysis. We record the resulting subfamily structure as a set of explicit models, but limit the analysis to ancient duplication events—several hundred million years in the past, as judged by the taxonomic distribution of protein sequences with particular domain subfamily footprints. CDD provides a search tool employing reverse position-specific BLAST (RPS–BLAST), where query sequences are compared to databases of position-specific score matrices (PSSMs), and E-values are obtained in much the same way as in the widely used PSI-BLAST application.

CDD is hosted here: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

 

 

 

 

September 21, 2009

A New miRNA Knowledgebase: miRò

Filed under: Bioinformatics — Biointelligence: Education,Training & Consultancy Services @ 1:59 am
Tags: , , ,

Post Transcriptional Gene Silencing (PTGS) is a highly conserved mechanism of gene expression regulation and microRNAs (miRNAs) are its main actors. These little RNA molecules are able to bind to specific sites located in the 3′ untranslated regions (UTRs) of target transcripts, inhibiting their translation or promoting their degradation. Although much effort has demonstrated their crucial role in several physiological and pathological processes, their mechanisms of action still remain unclear.

miRò is a web-based knowledge base that provides users with miRNA–phenotype associations in humans. It integrates data from various online sources, such as databases of miRNAs, ontologies, diseases and targets, into a unified database equipped with an intuitive and flexible query interface and data mining facilities. The main goal of miRò is the establishment of a knowledge base which allows non-trivial analysis through sophisticated mining techniques and the introduction of a new layer of associations between genes and phenotypes inferred based on miRNAs annotations. Furthermore, a specificity function applied to validated data highlights the most significant associations.
The miRò web site is available at: http://ferrolab.dmi.unict.it/miro.

September 13, 2009

AN INTRODUCTION TO HAPMAP

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 1:32 pm
Tags: , ,

THE INTERNATIONAL HAPMAP PROJECT: http://www.hapmap.org/

The HapMap is a catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world. The International HapMap Project is not using the information in the HapMap to establish connections between particular genetic variants and diseases. Rather, the Project is designed to provide information that other researchers can use to link genetic variants to the risk for specific illnesses, which will lead to new methods of preventing, diagnosing, and treating disease.
The DNA in our cells contains long chains of four chemical building blocks — adenine, thymine, cytosine, and guanine, abbreviated A, T, C, and G. More than 6 billion of these chemical bases, strung together in 23 pairs of chromosomes, exist in a human cell. (See http://www.dnaftb.org/dnaftb/ for basic information about genetics.) These genetic sequences contain information that influences our physical traits, our likelihood of suffering from disease, and the responses of our bodies to substances that we encounter in the environment.
The genetic sequences of different people are remarkably similar. When the chromosomes of two humans are compared, their DNA sequences can be identical for hundreds of bases. But at about one in every 1,200 bases, on average, the sequences will differ (Figure 1). One person might have an A at that location, while another person has a G, or a person might have extra bases at a given location or a missing segment of DNA. Each distinct “spelling” of a chromosomal region is called an allele, and a collection of alleles in a person’s chromosomes is known as a genotype.

Differences in individual bases are by far the most common type of genetic variation. These genetic differences are known as single nucleotide polymorphisms, or SNPs (pronounced “snips”). By identifying most of the approximately 10 million SNPs estimated to occur commonly in the human genome, the International HapMap Project is identifying the basis for a large fraction of the genetic diversity in the human species.

For geneticists, SNPs act as markers to locate genes in DNA sequences. Say that a spelling change in a gene increases the risk of suffering from high blood pressure, but researchers do not know where in our chromosomes that gene is located. They could compare the SNPs in people who have high blood pressure with the SNPs of people who do not. If a particular SNP is more common among people with hypertension, that SNP could be used as a pointer to locate and identify the gene involved in the disease.

However, testing all of the 10 million common SNPs in a person’s chromosomes would be extremely expensive. The development of the HapMap will enable geneticists to take advantage of how SNPs and other genetic variants are organized on chromosomes. Genetic variants that are near each other tend to be inherited together. For example, all of the people who have an A rather than a G at a particular location in a chromosome can have identical genetic variants at other SNPs in the chromosomal region surrounding the A. These regions of linked variants are known as haplotypes.

In many parts of our chromosomes, just a handful of haplotypes are found in humans. [See The Origins of Haplotypes.] In a given population, 55 percent of people may have one version of a haplotype, 30 percent may have another, 8 percent may have a third, and the rest may have a variety of less common haplotypes. The International HapMap Project is identifying these common haplotypes in four populations from different parts of the world. It also is identifying “tag” SNPs that uniquely identify these haplotypes. By testing an individual’s tag SNPs (a process known as genotyping), researchers will be able to identify the collection of haplotypes in a person’s DNA. The number of tag SNPs that contain most of the information about the patterns of genetic variation is estimated to be about 300,000 to 600,000, which is far fewer than the 10 million common SNPs.

Once the information on tag SNPs from the HapMap is available, researchers will be able to use them to locate genes involved in medically important traits. Consider the researcher trying to find genetic variants associated with high blood pressure. Instead of determining the identity of all SNPs in a person’s DNA, the researcher would genotype a much smaller number of tag SNPs to determine the collection of haplotypes present in each subject. The researcher could focus on specific candidate genes that may be associated with a disease, or even look across the entire genome to find chromosomal regions that may be associated with a disease. If people with high blood pressure tend to share a particular haplotype, variants contributing to the disease might be somewhere within or near that haplotype.

August 9, 2009

Bioinformatics Tools: NCBI Tools for Data Mining – Part I

Filed under: Bioinformatics,Computational Biology — Biointelligence: Education,Training & Consultancy Services @ 11:41 am
Tags: , , , , ,

Here is a list of Tools hosted by NCBI for data mining:

Tools for Nucleotide Sequence Analysis

BLAST:

The Basic Local Alignment Search Tool for comparing gene and protein sequences against others in public databases, now comes in several types including PSI-BLAST, PHI-BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human, microbial, malaria, and other genomes, as well as for vector contamination, immunoglobulins, and tentative human consensus sequences.

Electronic PCR :

It allows you to search your DNA sequence for sequence tagged sites (STSs) that have been used as landmarks in various types of genomic maps. It compares the query sequence against data in NCBI’s UniSTS, a unified, non-redundant view of STSs from a wide range of sources.

Entrez Gene:

Each Entrez Gene record encapsulates a wide range of information for a given gene and organism. When possible, the information includes results of analyses that have been done on the sequence data. The amount and type of information presented depend on what is available for a particular gene and organism and can include: (1) graphic summary of the genomic context, intron/exon structure, and flanking genes, (2) link to a graphic view of the mRNA sequence, which in turn shows biological features such as CDS, SNPs, etc., (3) links to gene ontology and phenotypic information, (4) links to corresponding protein sequence data and conserved domains, (5) links to related resources, such as mutation databases. Entrez Gene is a successor to LocusLink.

Model Maker:

allows you to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to assembled genomic sequence to build a gene model and to edit the model by selecting or removing putative exons. You can then view the mRNA sequence and potential ORFs for the edited model and save the mRNA sequence data for use in other programs. Model Maker is accessible from sequence maps that were analyzed at NCBI and displayed in Map Viewer.

ORF Finder:

ORF Finder identifies all possible ORFs in a DNA sequence by locating the standard and alternative stop and start codons. The deduced amino acid sequences can then be used to BLAST against GenBank. ORF finder is also packaged in the sequence submission software Sequin.

SAGEMAP:

It is a tool for performing statistical tests designed specifically for differential-type analyses of SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP), which have been submitted to Gene Expression Omnibus (GEO). Gene expression profiles that compare the expression in different SAGE libraries are also available on the Entrez GEO Profiles pages. It is possible to enter a query sequence in the SAGEmap resource to determine what SAGE tags are in the sequence, then map to associated SAGEtag records and view the expression of those tags in different CGAP SAGE libraries.

Spidey:

It aligns one or more mRNA sequences to a single genomic sequence. Spidey will try to determine the exon/intron structure, returning one or more models of the genomic structure, including the genomic/mRNA alignments for each exon.

VecScreen:

It is a tool for identifying segments of a nucleic acid sequence that may be of vector, linker, or adapter origin prior to sequence analysis or submission. VecScreen was developed to combat the problem of vector contamination in public sequence databases.

Part II of NCBI Tools in the next post… Keep Visiting !!!!

August 5, 2009

BioGRID: A repository useful for Systems Biology

Filed under: Bioinformatics,Systems Biology — Biointelligence: Education,Training & Consultancy Services @ 6:04 am
Tags: , , , ,
Systems Biology is emerging as one of the biggest research trends these days. Talking about pathways, metabolomics, cellular cycles, interactions is common in this field.
While reading on Interaction Datasets , I came across “BioGrid”. Here is a small post on the same.
BioGRID can be explained as Biological General Repository for Interaction Datasets. It distributes collections of protein and genetic interactions from major model organism species. BioGRID currently contains over 198 000 interactions from six different species, as derived from both high-throughput studies and conventional focused studies.
BioGRID interactions are recorded as relationships between two proteins or genes (i.e. they are binary relationships) with an evidence code that supports the interaction and a publication reference. The term “interaction” includes, as well as direct physical binding of two proteins, co-existence in a stable complex and genetic interaction. It should not be assumed that the interaction reported in BioGRID is direct and physical in nature; the experimental system definitions below indicate the nature of the supporting evidence for an interaction between the two biological entities. It should also be noted that some interactions in BioGRID have various levels of evidential support. BioGRID simply curates the result of the experiment from the publication and we do not guarantee that any individual interaction is true, well-established or the current consensus view of the community. Curating all available evidence supporting for an interaction enables orthogonal data from various sources to be collated, allowing users of the database to decide confidence in the existence and/or physiological relevance of that interaction.
More information on Biogrid can be found at: http://www.thebiogrid.org

Systems Biology is emerging as one of the biggest research trends these days. Talking about pathways, metabolomics, cellular cycles, interactions is common in this field.

While reading on Interaction Datasets , I came across “BioGrid“. Here is a small post on the same.

BioGRID can be explained as Biological General Repository for Interaction Datasets. It distributes collections of protein and genetic interactions from major model organism species. BioGRID currently contains over 198 000 interactions from six different species, as derived from both high-throughput studies and conventional focused studies.

BioGRID interactions are recorded as relationships between two proteins or genes (i.e. they are binary relationships) with an evidence code that supports the interaction and a publication reference. The term “interaction” includes, as well as direct physical binding of two proteins, co-existence in a stable complex and genetic interaction. It should not be assumed that the interaction reported in BioGRID is direct and physical in nature; the experimental system definitions below indicate the nature of the supporting evidence for an interaction between the two biological entities. It should also be noted that some interactions in BioGRID have various levels of evidential support. BioGRID simply curates the result of the experiment from the publication and we do not guarantee that any individual interaction is true, well-established or the current consensus view of the community. Curating all available evidence supporting for an interaction enables orthogonal data from various sources to be collated, allowing users of the database to decide confidence in the existence and/or physiological relevance of that interaction.

More information on Biogrid can be found at: www.thebiogrid.org