November 17, 2009

OrChem: A Chemistry Search Engine for Oracle

Registration, indexing and searching of chemical structures in relational databases is one of the core areas of cheminformatics. Research on the topic goes back to the 1960s and probably before that. However, little detail has been published on the inner workings of search engines and developments have been mostly closed-source. This has led to the situation that despite more than thirty years of research and publications very few open reference code is available for use and study. The cheminformatics open source community has been working since the mid 1990s to overcome this problematic situation.

OrChem, an extension for the Oracle 11G database that adds registration and indexing of chemical structures to support fast substructure and similarity searching. The cheminformatics functionality is provided by the Chemistry Development Kit. OrChem provides similarity searching with response times in the order of seconds for databases with millions of compounds, depending on a given similarity cut-off. For substructure searching, it can make use of multiple processor cores on today’s powerful database servers to provide fast response times in equally large data sets.

OrChem is built on top of the Chemistry Development Kit (CDK) and depends on this Java library in numerous ways. For example, compounds are represented internally as CDK molecule objects, the CDK’s I/O package is used to retrieve compound data, and its subgraph isomorphism algorithms are used for substructure validation. OrChem adds its own Java layer on top of the CDK to implement fast database storage and retrieval. With the CDK loaded into Oracle, a large cheminformatics library becomes readily available to PL/SQL. With little effort developers can build database functions around the CDK and so quickly implement chemistry extensions for Oracle. OrChem works in the same way, using the CDK where possible.

It uses chemical fingerprints to find compounds by substructure or similarity criteria. Fingerprints are bitsets in which each bit indicates the presence or absence of a particular chemical aspect. During a similarity search the fingerprints are used to calculate a Tanimoto measure. A Tanimoto similarity measure between two binary fingerprints is defined by the ratio of the number of common bits set to one to the total number of bits set to one in the two fingerprints. For substructure searching the fingerprint has a different function: it is used to screen possible candidates before a computationally more expensive isomorphism test.

The Orchem search engine would definately prove benefical to the cheminformatics community.


