Approval of the CIF methods Dictionary Definition Language
David Brown
idbrown at mcmaster.ca
Wed Nov 5 19:54:23 GMT 2008
I am sending this email to a couple of COMCIFS discussion lists and I
apologize if you get it twice. Its purpose is to update you on
developments in two areas and to give notice that I will be looking for
advice on matters mentioned below:
(1) the COMCIFS approval of the alpha version of the methods Dictionary
Definition Language (DDLm)
(2) the request by COMCIFS to revive the discussion on providing a CIF
description of the molecular, as opposed to the crystallographic,
structure of a crystal, a topic that was discussed earlier but put on
hold in 2005 pending the adoption of DDLm.
In the first part of this email I describe the problems of defining a
molecular structure in
CIF . In the second part I analyze some of the issues that DDLm raises
for the structure of CIF dictionaries. This email is for information
only, you need take no action,
David Brown
++++++++++++++++++++++++++++++++++++++++++++++++++++++
A Chemical (Molecular) Description of a Crystal Structure
------------------------------------------------------------------------
A simple innocent question sometimes raises profound issues. A number
of years ago COMCIFS was requested by CCDC to provide a dataname and
definition for Z', the number of crystallographically independent
molecules in a crystal. Before we could provide this we had to have a
definition of a molecule. Diffraction experiments tell us about
electron density, but not about molecules. We may see the shape of a
molecule in the electron density but not everyone sees the same shape.
Although we can identify atoms in the electron density and we can
measure the distances between them, we don't all agree on which are
bonds, and even less do we agree on which atoms constitute a molecule.
As a result, a number of years ago COMCIFS established a special group
(CoreCIFchem) to recommend how we should define a molecule in the CIF
dictionary and how best to link the atoms in the molecule with the
corresponding atoms in the asymmetric unit.
We quickly discovered that the problem was far from trivial, and in 2004
we turned for help to Peter Murray-Rust who has spent much of his career
thinking about this problem. I quote here his response to our trial set
of CIF definitions.
I think you are addressing important points. There are several of them
and they are complex. They have not all been fully or even partially
solved elsewhere. The following points are taken from my 15 years'
writing software for CIF and related systems and I hope they will be
taken positively.
1. IMO a system has to be implementable. I adhere to the IETF motto
"rough consensus and running code". I tried hard in the mid-90's and
later at Syd's invite to implement the DDL system. It is far more
difficult to implement than it appears on reading. In any such system
there are lots of nuances which can be surprisingly difficult. I have
now formally implemented a complete CIF DOM using the SAX/DOM/Infoset
approach, but without dictionary control. Similarly I spent some time
looking at mmCIF and again found that the task was large. mmCIF has been
implemented, but has a considerable amount of dedicated resource. So do
you have similar resource?
So if you wish software to *read and understand* this design it will
need a lot of effort. Without prototyping the system you don't know
whether the design is complete or self-consistent. Note the questions I
have asked recently on COMCIFs - there are still several areas that I
have no definitive answers to and I am sure this is likely to happen
here. So I'm simply asking you to be aware of the size of the problem.
IMO this was the problem with MIF [the Molecular Information File
described in International Tables G] - it was a reasonable spec but
no-one really implemented it.
Also who or what will generate [the files giving the descriptions of
molecules and their relationship to the crystal structure]? I think the
analysis [of the crystal structure] will require quite a bit of
heuristic software, so presumably an author has to edit [the molecular
description]. At the least they will have to have an editing tool to
keep the referential integrity for the pointers [linking the molecular
atoms to the crystal atoms].
2. There are several concepts here that we are tackling with in CML
[chemical mark-up language] and have not fully solved. I distinguish
"design" where there is a formal spec and "implement" where it is
actually shown to work or not. They include:
- multiple conformations (designed but not implemented in software)
- unique atom ids (designed and implemented, but possibly fragile)
- levels of indirection (pointers) (designed but not widely implemented)
- role of atomSets [collections of atoms forming all or a distinctive part
of a molecule].(designed and partially implemented)
- polymeric systems (not designed; far too difficult at present)
It is possible to come up with reasonable solutions but it is often
unclear how well these will stand up to the variety of examples that
will be exposed to it when the system is released. It almost certainly
will require a redesign. That has happened for CML and I predict it will
occur here
As an example I am on a IUPAC group tackling stereochemical
representation. It is a tough problem. If/when some consensus is
reached all of that would ideally have to be implemented in CIF solely
to help describe what the substance actually is.
3. My suggestion would be to be somewhat less ambitious and to engage
the community in a structured program
- concentrate on molecular crystals (they are easier and the
informatics/representation is better understood)
- get the authors actually to submit the chemical structures (if this
doesn't happen then the design won't be tested)
- anticipate the problems of mapping crystallographic atoms
onto chemical connection tables. These are: primarily
- symmetry
- disorder
- unreported atoms
Around the time when we received this advice from Peter we decided to
defer any further discussion pending the adoption of DDLm as we thought
that dictionaries written in DDLm might simplify our task. That was in
January 2005. This project has since been dormant, but as an alpha
version of DDLm has now been approved by COMCIFS the time has come to
dust off this project. In the light of Peter's comments we might wish
to consider whether the definition of a molecule is even an idea worth
pursuing.
Development of DDLm (Dictionary Definition Language (methods))
-----------------------------------------------------------------------------------
DDLm is a language for writing CIF dictionaries. It introduces a number
of new features relevant to this discussion. The first and most
important is that DDLm is backwardly compatible with all existing CIFs.
Programs using CIF dictionaries written in DDLm will be able to read all
existing CIFs (small cell and macromolecular). Further, they will be
able to interpret them using the advanced DDLm features. These include
'methods', i.e., executable algorithms embedded in the CIF dictionary
that define how the value of an item can be calculated from other items
that may be present in a CIF. DDLm also makes it easy to generate a
virtual run-time dictionary by combining all or parts of a number of
existing dictionaries.
DDLm was prepared by Syd Hall and Nick Spadaccini, who demonstrated a
proof-of-concept version of a core dictionary at the Glasgow Congress in
1999. At the Florence Congress in 2005, COMCIFS agreed to evaluate
DDLm and see if it would be suitable for the crystallographic
community. James Hester has carried out this evaluation and at the
Osaka Congress in 2008 he recommended that with some minor modification
it should be adopted. COMCIFS duly approved this as the alpha release
of DDLm. Details are available on the IUCr web site.
As part of Nick and Syd's proof-of-concept demonstration in 1999, they
prepared a small CIF dictionary which is serving as the basis of the
DDLm coreCIF dictionary that will be submitted to COMCIFS for adoption.
In the course of this development a number of philosophical questions
have arisen. These were discussed by COMCIFS in Osaka and the following
principles were agreed on:
1. Items will be classified as basic or derived. Basic items are those
that are experimentally measured (e.g., measured density) or assigned
(e.g., space group). These are items that cannot be derived from other
CIF items. Derived items are those that can be calculated if the
appropriate basic items are present.
2. For derived items, the method (i.e., the equation used to calculate
the value of the item) given in the dictionary will be the primary
definition and will take precedence over the text definition. Therefore
only one method may be specified and this takes precedence over any
alternative, but equivalent, ways of calculating the value of the item.
This requires that there be a consensus on the best method of
calculation to use, for example: which coordinate system should the
used in the calculation?
3. A consequence of (2) is that the derived items must follow an
hierarchy, with the basic items forming the foundation and the derived
items defined from various basic items following a unique route.
Requesting an item may initiate a cascade of calculations, e.g.,
calculated density <--- cell volume <---- cell constants. Thus the
request for a particular item may populate the CIF with various
intermediate items . These intermediate items may, for computational
convenience, duplicate information already appearing in the CIF. In
previous CIF dictionaries this has been discouraged by providing only
one way of presenting a given piece of information. What are the
implications of this development?
The current versions of CIF are like a language that contains only nouns
and adjectives. It is great for describing a static object like a
crystal structure. It gives a description that can be published,
archived, retrieved and examined. Adding Methods is like adding verbs
to this language. With them, a CIF can grow, triggered into adding
derived items by a simple request. The relation between user and CIF
becomes interactive. In this scenario how does one ensure that the
derived items are all updated when a basic item is changed? Changing
the lattice parameters does not automatically update the cell volume,
and calculating the density will not necessarily force an update of the
cell volume if the CIF already contains an earlier (obsolete) value.
Murphy's Law (in its correct original version) states that 'if something
can happen, sooner or later it will happen'. We need to anticipate the
different ways in which a DDLm CIF might be used.
This email is intended to alert you to some of problems that lie ahead.
I intend to use these discussion lists as a sounding board for specific
problems as I work on the development of the DDLm coreCIF dictionary. I
look forward to receiving your comments on the points I raise,
particularly warnings if we seem to be straying into dangerous directions.
Watch this space.
David Brown
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/coredmg/attachments/20081105/6e81e04c/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idbrown.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
Url : http://scripts.iucr.org/pipermail/coredmg/attachments/20081105/6e81e04c/attachment-0001.vcf
More information about the coreDMG
mailing list