CML and document vs. database
In this second post in a series laying out my ideas for the MCM/IUPAC kinetic database integration I’m going to be taking a look at a suitable data format.
We are dealing with a mixture of symbolic information (chemical structure and reactions), numeric or algebraic information (rate coeeficients, temperature dependent rate expressions, branching ratios) and textual information (references, explanation of experimental details, etc.). For the chemistry specific symbolic stuff there is, in my view, only one game in town: CML.
I’ve been keeping half an eye on CML for some time but I am a long way from knowing everything about it, so the following opinions come with the usual caveat of my fallibility. CML was born well before the world and his aunt jumped on the XML bandwagon. In keeping with the original vision of XML it is of the “XML as document” rather than “XML as data type” school, aiming to enable more semantically rich electronic publishing. In fact CML offers much more than just a representation of species and reactions. Through it’s various modular extensions STTML, CMLReact, CMLComp and CMLCM — to mention just the most mature ones — CML can capture a lot of the structure of a typical chemistry publication.
This is great news for us, particularly IUPAC where they currently publish data as a set of PDF files. If desired I expect it would be relatively straightforward to translate all the IUPAC datasheets, in their entirety, into a mixture of CMLCore, CMLReact and STTML.
CML can achieve this by being unashamedly liberal about semantics. The schema (CMLCore, STMML, CMLReact) are sprinkled with elements documented as “deliberately very general” (cml:substance), “content model is deliberately lax” (cml:identifier) and “no controled semantics” (cml:observation). This might be worrying if one had to write software to handle all possible constructions but that isn’t our requirement and there is a clear mechanism for being more restrictive with many elements having a convention attribute (again no controlled vocabulary
).
It’s worth noting here that there is a difference between the communication format and the underlying model. The MCM and IUPAC needn’t agree on how they store and serve the information provided the format and communication medium is agreed (What software engineers would call the interface). CML would be a good fit for communicating information from IUPAC to the MCM. It could also be used as the underlying data model of the IUPAC database and it would have the advantage of being similar to what they have now. However, I believe CML alone would be quite a restrictive model for the MCM, for here we have an example of where document-centric information falls down.
One could describe the overall structure of the MCM as a forrest of trees starting from a relatively small set of root species (those species thought to be representative of primary VOC emmisions), branching rapidly on each reaction but also overlapping as intermediate species are formed from different pathways. Finally everything ends up as CO2 + water.
There is, therefore, only two alternatives if you want to systematically divide the MCM into discreet documents (well, 3 if you include 1 document per reaction). Either put the whole 4500 species and 12600 reactions in one document or have 4500 documents, one for each reactant. Although the former is technically feasible, in practice people want to browse and extract subsets of the mechanism. A relational model is well suited to this as it doesn’t force document boundaries on the data. This is what is done at the moment with the MCM website’s MySQL backend.
However, at the BADC there is a consensus that dataset specific databases are bad news for data curation. Unlike a document or file you can’t pass a database around (without SQL compatability headaches), they pose subtle citation challenges and each instance has a different structure, requiring it’s own expert to maintain it.
Therefore, for deposition at the BAD at least, we will need a document-centric format for the entire MCM and the challenge is to try and keep all the nice features you get with a database. Maybe this could be achieved with one big CML document and XQuery? I know next to nothing about the technology. I would favour taking the 4500 CML documents and annotating them with RDF.
About this entry
You’re currently reading “ CML and document vs. database ,” an entry on lirico
- Published:
- 10.31.06 / 10pm
- Category:
- Uncategorized, Cheminformatics, MCM, RDF
No comments
Jump to comment form | comments rss [?] | trackback uri [?]