IUPAC workshop on XML and IChI

I. David Brown idbrown at mcmail.cis.mcmaster.ca
Wed Nov 19 16:02:24 GMT 2003


Dear Colleague,

	I have just returned from a workshop dealing with chemistry XML
and the IUPAC Chemical Indentifier (IChI).  I have appended below a report
on those aspects of the workshop that are likely to be of interest to
members of IUCr committees.

	I apologize to those of you who receive more than one copy of this
email.  I am circulating it two four groups who might be interested and
several of you will be members of more than one of these.

	I will be following up this report with further suggestions for
discussion by the coreCIFchem and phaseID groups, but those of you who
belong to other groups may find this report interesting.

			Best wishes

				David

Keep scrolling - More below
*****************************************************
Dr.I.David Brown,  Professor Emeritus
Brockhouse Institute for Materials Research,
McMaster University, Hamilton, Ontario, Canada
Tel: 1-(905)-525-9140 ext 24710
Fax: 1-(905)-521-2773
idbrown at mcmaster.ca
*****************************************************

Report on the workshop on Chemical XML and the IUPAC Chemical
Identifier (IChI) held at NIST 12-14 Nov. 2003.

I.D.Brown

Summary.
--------
There is currently no organization coordinating the XML
ontologies being developed for the various branches of chemistry,
even though several chemical specialties are developing detailed
ontologies in their own disciplines.  However, a project to
develop an IUPAC Chemical Identifier (IChI) in the form of an
electronic character string that uniquely identifies a compound,
is well advanced and shows promise as a search key.

Introduction
------------
IUPAC has appointed a Committee on Printed and Electronic
Publication (CPEP) which in turn has a subcommittee on Electronic
Data Standards (EDS).  The latter has two projects that were the
subject of a workshop held at NIST, Gaithersburg in November
2003.  The first is the development of a Chemical XML dictionary
and the second the development of an IUPAC Chemical Identifier
(IChI).  This document reports on this workshop for the benefit
of interested groups in the International Union of
Crystallography.

Chemical XML
------------
Although the EDS would appear to be the IUPAC equivalent of
COMCIFS, the two committees have very different mandates.  The
primary role of EDS is to define XML schema or dictionaries that
would allow IUPAC to produce web versions of its Gold Book
(definitions of chemical terms) and Green Book (mathematical
relations used in analytical chemistry).  This is equivalent to
producing web versions of International Tables for
Crystallography.  EDS is therefore interested in reproducing
text, mathematical equations and chemical structure diagrams on
the web using XML versions of the printed Gold and Green Books.
EDS is explicitly not interested in (or believes it does not have
the authority to) recommend or coordinate electronic ontologies
for chemistry as a whole, including defining such items as
chemical formulae that might be expected to appear in many
different chemistry XML schema.  In its more limited role, EDS is
proposing to express mathematical formulae using the existing
MathML (a general mark-up language prepared by mathematicians),
units using the similarly general UnitsML, and chemical diagrams
in a form that would allow them to be printed using SVG.

Even though the scope of EDS is limited, the workshop received
reports from several groups developing ontologies for specialists
branches of chemistry (including the report on CIF that I gave).
There was a general appreciation that the most important task is
to define the ontologies (the contents of the dictionaries) and
that one should not worry too much about the language in which
they are expressed.  XML is the current flavour of the year, but
XML might well be superceded by a different (better?) system in
five or ten years time.  A well designed ontology could easily
migrate from one delivery system to another.

Among the 8 to 10 groups working on specialized chemical
ontologies in the form of XML schema, ThermoML and SpectaML stood
out as being well advanced.  Their schema (schemae?) are more
directly comparable with CIF, in that they are designed to
capture of the results of experimental measurements in their
respective disciplines.  ThermoML has been adopted by five of the
leading thermodynamic journals (representing three different
publishers), but rather than requiring authors to submit papers
in ThermoML, the journals will continue to accept papers in
traditional formats (90% are submitted in MSWord).  The mark-up
into XML will be carried out by the publishers and XML versions
of the results will be submitted to a thermodynamic database.
Another group is producing a schema (a schemum?) for analytical
measurements (AniML) and a group in Prague is working on a Mark-
up Language for chemical structures based on Graph Theory (GTML).
Most of these projects are closely related to particular
experimental techniques where the concepts are specialized.
There is no group, either existing or proposed, that is charged
with coordinating these efforts to ensure that the definitions do
not conflict.

>From the crystallographer's point of view the most interesting
project is Peter Murray-Rust's Chemical Mark-up Language (CML)
which aims to capture the chemical structures that are at the
heart of any description of chemistry, specifically organic
chemistry.  Peter has been working on this project for many years
and his schema are well thought out and tested using software he
has written.  A number of publishers and the European Patent
Office have expressed interest in CML, and Peter has been working
closely with the chemical modelling community to develop a
version of CML for them.  The schema in CML are very general,
specifying only that molecules are composed of atoms which are
linked by bonds, but molecules, atoms and bonds are not defined,
leaving it to the user to decide which atoms are bonded and
therefore which atoms constitute a molecule.  One can see the
reasons for such an open-ended approach, but the philosophy is
very different from that adopted by CIF.  CML is not likely to
give us much guidance as we extend CIF to include chemical (as
opposed to crystallographic) concepts.  However, Peter has
written programs that will convert DDL1 CIF to cifML and vice
versa, cifML being a version of XML that explicitly employs CIF
datanames and ontologies.

One attractive feature of XML that we might consider incorporating into
CIF is the ability to avoid namespace collisions.  Two schema
(dictionaries), foo and fee, that both use the name 'bond_order', though
with different definitions, would give rise to items with names like
foo:bond_order and fee:bond_order where 'foo' and 'fee' are equivalenced
to web URLs where the respective schema can be found.  This allows two XML
files based on different schema to be concatenated, but it does not
provide precise definitions for the values of 'bond_order' in the
different schema.  They may be defined the same way or they may not.  A
search across databases would retrieve both kinds of bond_orders, but a
computer would have to assume that the quantities are unrelated.  The
resulting different dialects of chemistry would make it difficult to
synthesize information across different databases.

When I asked the EDS where one could find IUPAC recommendations
for an electronic coding of widely used chemical concepts such as
the chemical formulae, everybody in the room started pointing to
someone else (the scene was reminiscent of Alice in Wonderland!),
but the eventual consensus was that IUPAC has no mechanism for
making recommendations at this level of detail, because if it
did, the recommendations would probably be ignored by the
chemical community.  This may have been the experience with IUPAC
recommendations in the past, but a consortium of groups devising
chemMLs would have a strong motivation to adopt compatible
definitions for common chemical concepts.  At present it would
appear that, apart from the sum_chemical_formula for which rules
already exist, it is unlikely that the various chemMLs will adopt
compatible definitions of key chemical concepts.  The feeling
among the members of EDS is that it will be time enough to
resolve these conflicts when they arise!

IChI (IUPAC Chemical Identifier)
--------------------------------
This inability to coordinate ontologies is perhaps why EDS set up
the IUPAC Chemical Identifier (IChI) project which aims to
recommend an identifier that would be able to locate the same
compound in different databases.  This project was the subject of
the second half of the workshop.  When the IChI group was set up,
they approached the IUCr Nomenclature Commission for advice on
how identify different crystalline phases.  The chair of the
Commission at the time, S.C.Abrahams, asked me to set up working
group to make recommendations that could be passed back to IChI.
Our working group, acting independently of IChI, has discussed a
number of possibilities which, fortunately, should be easy to
incorporate into the recommended IChI identifier.

A proposal for the first version of the identifier covering
mostly organic compounds is nearly ready, and the IChI working
group has given thought to how a later version might cover a
wider range of compounds.  The identifier is built up of a number
of layers.  The top (first) layer contains only the chemical
formula and will, for many compounds, be sufficient to identify
the compound uniquely.  The second layer includes the chemical
structure, i.e. a normalized description of the connectivity.
The contents of this layer are determined by computer algorithms
from a connectivity diagram supplied by the author.  Insofar as
different authors may disagree on which atoms are bonded, the
same compound may end up with different identifiers, but this
layer of the identifier is made as robust as possible by ignoring
hydrogen atoms, bond orders and charge assignments.  Hydrogen
atoms are introduced at the third layer which can be ignored if
one is not interested in a particular tautomer.  Still lower
levels contain information about stereocenters and isotopes, and
are included only if required.  Searches can be deep, returning
only compounds with the same stereochemistry and isotopic
content, or they can be restricted to higher levels if tautomers,
stereochemistry and isotopes are not of interest.  Identification
of the crystallographic phase by including, e.g., the space group
number, can easily be added as yet a further layer.

Version 1 of IChI has impressed those who have been testing it.
It works well, as might be expected, for organic compounds, but
also for many inorganic and metallorganic compounds if the bonds
to the metal atoms (or cations) are not included in the second
layer.  They can be introduced in a lower layer if needed, e.g.,
to distinguish between isomers with different metal coordination.
At present the identifier is not designed to describe polymeric
structures, clusters or disordered structures but the IChI group
is interested in including these features in future versions.

We will probably wish to incorporate IChI into CIF when the final
standard is approved.

I.D.Brown
2003-11-19




More information about the comcifs mailing list