Due to continued research there is a continuous groth in the amount of biological data available. The exponential growth of the amount of biological data available raises two problems:

1. Efficient information storage and management and, on the other hand, the extraction of useful information from these data.

2. It requires the development of tools and methods capable of transforming all these heterogeneous data into biological knowledge about the underlying mechanism.

There are various biological domains where machine learning techniques are applied for knowledge extraction from data. The below figure shows the main areas of biology such as genomics, proteomics, microarrays, evolution and text mining where computational methods are being applied.

In addition to all the above applications, computational techniques are used to solve other problems, such as efficient primer design for PCR, biological image analysis and backtranslation of proteins (which is, given the degeneration of the genetic code, a complex combinatorial problem). Machine learning consists in programming computers to optimize a performance criterion by using example data or past experience. The optimized criterion can be the accuracy provided by a predictive model—in a modelling problem—, and the value of a fitness or evaluation function—in an optimization problem. Machine learning uses statistical theory when building computational models since the objective is to make inferences from a sample. The two main steps in this process are:

1. To induce the model by processing the huge amount of data

2. To represent the model and making inferences efficiently.

The process of transforming data into knowledge is both iterative and interactive. The iterative phase consists of several steps. In the first step, we need to integrate and merge the different sources of information into only one format. By using data warehouse techniques, the detection and resolution of outliers and inconsistencies are solved. In the second step, it is necessary to select, clean and transform the data. To carry out this step, we need to eliminate or correct the uncorrected data, as well as decide the strategy to impute missing data. This step also selects the relevant and non-redundant variables; this selection could also be done with respect to the instances. In the third step, called data mining, we take the objectives of the study into account in order to choose the most appropriate analysis for the data. In this step, the type of paradigm for supervised or unsupervised classification should be selected and the model will be induced from the data. Once the model is obtained, it should be evaluated and interpreted—both from statistical and biological points of view—and, if necessary, we should return to the previous steps for a new iteration. This includes the solution of conflicts with the current knowledge in the domain. The model satisfactorily checked—and the new knowledge discovered—are then used to solve the problem.

An article published in the journal ‘Briefings in Bioinformatics’ gives an insight of various machine learning techniques used in Bioinformatics. It also throws light on some major techniques such as Bayesian classifiers, logistic regression, discriminant analysis, classification trees, nearest neighbour, neural networks, Support vector machines, clustering, Hidden Markov Models and much more.

The article can be found here: http://bib.oxfordjournals.org/cgi/content/full/7/1/86?maxtoshow=&HITS=&hits=&RESULTFORMAT=&fulltext=bioinformatics&andorexactfulltext=and&searchid=1&FIRSTINDEX=0&resourcetype=HWCIT