An exploration of OS cheminformatics tools

In November I’m starting a new project in collaboration with Cambridge and Leeds Universities to link the information of the MCM website and the IUPAC Chemical Kinetics database. To achieve our aims we are going to need to improve the cheminformatics tools underlying these sites, so I have been reviewing what options we have for using OpenSource products.

Casting about for promising OS projects I was lead to the Blue Obelisk website — a gathering of chemical informaticians working on various projects two of which seem to stand out from the crowd: OpenBabel and the Chemistry Development Kit.

These two projects clearly have a thriving developer community and a lot going for them. In my experimenting I had cause to contact developers from both projects and was delighted to get quick constructive feedback (more on this later).

OpenBabel is written in C++ with SWIG based bindings to Python, Perl and Ruby. As it’s name suggests it concentrates on translating between chemical file formats, however the API provides a good general cheminformatics toolbox. CDK by contrast aims to provide a toolkit to underpin interactive tools such as JChemPaint and Jmol and is written in Java. The range of features covered by CDK are much broader than OpenBabel but quite a lot of them appear to be work in progress.

The common field between the MCM and IUPAC is atmospheric chemical kinetics and therefore central to our requirements is the representation of radical species. Perhaps not surprisingly in a field dominated by pharmaceuticals and biochemistry, cheminformatics tools have tended to do a poor job of representing radicals. As a starting point I wanted to see how OpenBabel and CDK handled radical input, particularly in SMILES and MDL Molfile formats since the MCM already uses these formats.

SMILES is a particular problem here because it has never officially supported radicals. The MCM has used the Accord Excel & Access plugins which use an extension where “[C.]” signifies a carbon radical centre. So how did the two toolkits match up?

OpenBabel

Browsing the OpenBabel wiki turned up this page showing that the OpenBabel developers are very aware of the problem. Although OpenBabel supports “[C.]” etc. as input, it always generates SMILES radicals using explicit hydrogens. E.g.

$ echo "CC[C.]" | babel -ismi -osmi
CC[CH2]
$ echo "CC[CH]C" | babel -ismi -osmi
CC[CH]C

This form is unambiguous and was also recognised by a couple of non-opensource toolkits with accademic licences (Marvin and CACTVS). OpenBabel handled the “RAD” property of MDL molfiles successfuly so I thought it had everything we needed. There were a couple of hitches yet to uncover but before I go into them let’s turn to CDK.

CDK

I’m not a Java developer but with the languages’ dominance in many areas I’m always happy to find an oportunity to learn the language properly. For my test I used JPype as a familiar python shell to play with the CDK API. Even though I’d already taken a liking to OpenBabel I wanted to give CDK a good test because there was one feature CDK has that OpenBabel doesn’t: molecule depiction, that is generating 2D coordinates and a resulting image for chemical structures.

Unfortunately CDK’s radical support is less complete. It rejected the “[C.]” SMILES extension and added hydrogens rather than radicals when fed hydrogen deficient SMILES atoms. It did recognise radicals in MDL molfiles the CML spinMultiplicity attribute, producing a reasonably rendered image of a cyclic radical.

Since SMILES radical support is so important to us I submitted a bug report on sourceforge and got a quick response pointing out that SMILES doesn’t support radicals at all. This exposes’ SMILES’ great weakness.  SMILES is a proprietary format of the Daylight Corporation and to my knowledge has never had a published standard, therefore different implementers have extended SMILES to support radicals in different ways. You can’t blame CDK for concentrating on compliance with the daylight toolkit but I hope opensource tools can converge on this issue. To my mind OpenBabel’s approach of using hydrogen-deficiency is the way forward.

It looks like we would be able to use CDK to depict species by passing molfiles or CML to it. In this way we could use OpenBabel and CDK together to do what we want.

Returning to OpenBabel

I was having a lot of success using the Python bindings to OpenBabel until I tried a peroxy radical. It’s easiest to illustrate the problem on the command-line:

$ echo "CCO[O]" | babel -ismi -osmi
CCOO

Here OpenBabel doesn’t appear to convert the hydrogen deficient oxygen atom into a radical. This problem is exposed further by feeding babel output back to babel:

$ echo "CCO[O.]" | babel -ismi -osmi | babel -ismi -osmi
CCOO

Disaster! babel forgets the radical centre on successive conversions. However, there is a happy ending. I have learned never to criticise an opensource project until you have tested the bleeding edge version, so I checked out the development release from the sourceforge SVN. With relief I discovered the bug had been fixed:

$ echo "CCO[O.]" | babel -ismi -osmi | babel -ismi -osmi
CCO[O]

The sting in the tail

Openbabel was to deal me one last surprise. The Python bindings on the SVN version failed to compile. I sent a message to openbabel-scripting and got a response within minutes (some of the developers must be in Europe). This is a very good sign for anyone thinking of starting critical development work with an opensource codebase. A small patch and everything was fine. By the time you read this it may well be in SVN.


About this entry