Approval of the CIF methods Dictionary Definition Language

David Brown idbrown at mcmaster.ca
Wed Nov 5 19:54:23 GMT 2008


I am sending this email to a couple of COMCIFS discussion lists and I 
apologize if you get it twice.  Its purpose is to update you on 
developments in two areas  and to give notice that I will be looking for 
advice on matters mentioned below:

(1) the COMCIFS approval of the alpha version of the methods Dictionary 
Definition Language (DDLm)

(2) the request by COMCIFS to revive the discussion on providing a CIF 
description of the molecular, as opposed to the crystallographic, 
structure of a crystal, a topic that was discussed earlier but put on 
hold in 2005 pending the adoption of DDLm. 

In the first part of this email I describe the problems of defining a 
molecular structure in
CIF . In the second part I analyze some of the issues that DDLm raises 
for the structure of CIF dictionaries.  This email is for information 
only, you need take no action,

David Brown

++++++++++++++++++++++++++++++++++++++++++++++++++++++

A Chemical (Molecular) Description of a Crystal Structure
------------------------------------------------------------------------
A simple innocent question sometimes raises profound issues.  A number 
of years ago COMCIFS was requested by CCDC to provide a dataname and 
definition for Z', the number of crystallographically independent 
molecules in a crystal.  Before we could provide this we had to have a 
definition of a molecule.  Diffraction experiments tell us about 
electron density, but not about molecules.  We may see the shape of a 
molecule in the electron density but not everyone sees the same shape.  
Although we can identify atoms in the electron density and we can 
measure the distances between them, we don't all agree on which are 
bonds, and even less do we agree on which atoms constitute a molecule.  
As a result, a number of years ago COMCIFS established a special group 
(CoreCIFchem) to recommend how we should define a molecule in the CIF 
dictionary and how best to link the atoms in the molecule with the 
corresponding atoms in the asymmetric unit.

We quickly discovered that the problem was far from trivial, and in 2004 
we turned for help to Peter Murray-Rust who has spent much of his career 
thinking about this problem.  I quote here his response to our trial set 
of CIF definitions.

I think you are addressing important points. There are several of them 
and they are complex. They have not all been fully or even partially 
solved elsewhere. The following points are taken from my 15 years' 
writing software for CIF and related systems and I hope they will be 
taken positively.

1. IMO a system has to be implementable. I adhere to the IETF motto 
"rough consensus and running code". I tried hard in the mid-90's and 
later at Syd's invite to implement the DDL system. It is far more 
difficult to implement than it appears on reading. In any such system 
there are lots of nuances which can be surprisingly difficult. I have 
now formally implemented a complete CIF DOM using the SAX/DOM/Infoset 
approach, but without dictionary control. Similarly I spent some time 
looking at mmCIF and again found that the task was large. mmCIF has been 
implemented, but has a considerable amount of dedicated resource. So do 
you have similar resource?

So if you wish software to *read and understand* this design it will 
need a lot of effort. Without prototyping the system you don't know 
whether the design is complete or self-consistent. Note the questions I 
have asked recently on COMCIFs - there are still several areas that I 
have no definitive answers to and I am sure this is likely to happen 
here. So I'm simply asking you to be aware of the size of the problem. 
IMO this was the problem with MIF [the Molecular Information File 
described in International Tables G] - it was a reasonable spec but 
no-one really implemented it.

Also who or what will generate [the files giving the descriptions of 
molecules and their relationship to the crystal structure]? I think the 
analysis [of the crystal structure] will require quite a bit of 
heuristic software, so presumably an author has to edit [the molecular 
description]. At  the least they will have to have an editing tool to 
keep the referential integrity for the pointers [linking the molecular 
atoms to the crystal atoms].

2. There are several concepts here that we are tackling with in CML 
[chemical mark-up language] and have not fully solved. I distinguish 
"design" where there is a formal spec and "implement" where it is 
actually shown to work or not. They include:
  - multiple conformations (designed but not implemented in software)
  - unique atom ids (designed and implemented, but possibly fragile)
  - levels of indirection (pointers) (designed but not widely implemented)
  - role of atomSets [collections of atoms forming all or a distinctive part
         of a molecule].(designed and partially implemented)
  - polymeric systems (not designed; far too difficult at present)

 It is possible to come up with reasonable solutions but it is often 
unclear how well these will stand up to the variety of examples that 
will be exposed to it when the system is released. It almost certainly 
will require a redesign. That has happened for CML and I predict it will 
occur here

As an example I am on a IUPAC group tackling stereochemical 
representation. It is a tough problem.  If/when some consensus is 
reached all of that would ideally have to be implemented in CIF solely 
to help describe what the substance actually is.

3. My suggestion would be to be somewhat less ambitious and to engage 
the community in a structured program
   - concentrate on molecular crystals (they are easier and the
          informatics/representation is better understood)
   - get the authors actually to submit the chemical structures (if this
          doesn't happen then the design won't be tested)
   - anticipate the problems of mapping crystallographic atoms
          onto chemical connection tables. These are: primarily
           - symmetry
           - disorder
           - unreported atoms

Around the time when we received this advice from Peter we decided to 
defer any further discussion pending the adoption of DDLm as we thought 
that dictionaries written in DDLm might simplify our task.  That was in 
January 2005.  This project has since been dormant, but as an alpha 
version of DDLm has now been approved by COMCIFS the time has come to 
dust off this project.  In the light of Peter's comments we might wish 
to consider whether the definition of a molecule is even an idea worth 
pursuing.

Development of DDLm (Dictionary Definition Language (methods))
-----------------------------------------------------------------------------------
DDLm is a language for writing CIF dictionaries.  It introduces a number 
of new features relevant to this discussion.  The first and most 
important is that DDLm is backwardly compatible with all existing CIFs.  
Programs using CIF dictionaries written in DDLm will be able to read all 
existing CIFs (small cell and macromolecular).  Further, they will be 
able to interpret them using the advanced DDLm features.  These include 
'methods', i.e., executable algorithms embedded in the CIF dictionary 
that define how the value of an item can be calculated from other items 
that may be present in a CIF.  DDLm also makes it easy to generate a 
virtual run-time dictionary by combining all or parts of a number of 
existing dictionaries.

DDLm was prepared by Syd Hall and Nick Spadaccini, who demonstrated a 
proof-of-concept version of a core dictionary at the Glasgow Congress in 
1999.   At the Florence Congress in 2005, COMCIFS agreed to evaluate 
DDLm and see if it would be suitable for the crystallographic 
community.  James Hester has carried out this evaluation and at the 
Osaka Congress in 2008 he recommended that with some minor modification 
it should be adopted.  COMCIFS duly approved this as the alpha release 
of DDLm.  Details are available on the IUCr web site.

As part of Nick and Syd's proof-of-concept demonstration in 1999, they 
prepared a small CIF dictionary which  is serving as the basis of the 
DDLm coreCIF dictionary that will be submitted to COMCIFS for adoption.  
In the course of this development a number of philosophical questions 
have arisen.  These were discussed by COMCIFS in Osaka and the following 
principles were agreed on:

1. Items will be classified as basic or derived.  Basic items are those 
that are experimentally measured (e.g., measured density) or assigned 
(e.g., space group).  These are items that cannot be derived from other 
CIF items.  Derived items are those that can be calculated if the 
appropriate basic items are present.

2. For derived items, the method (i.e., the equation used to calculate 
the value of the item) given in the dictionary will be the primary 
definition and will take precedence over the text definition.  Therefore 
only one method may be specified and this takes precedence over any 
alternative, but equivalent, ways of calculating the value of the item.  
This requires that there be a consensus on the best method of 
calculation  to use, for example:  which coordinate system should the 
used in the calculation?

3. A consequence of (2) is that the derived items must follow an 
hierarchy, with the basic items forming the foundation and the derived 
items defined from various basic items following a unique route.  
Requesting an item may initiate a cascade of calculations, e.g., 
calculated density <--- cell volume <---- cell constants.  Thus the 
request for a particular item may populate the CIF with various 
intermediate items .  These intermediate items may, for computational 
convenience, duplicate information already appearing in the CIF.  In 
previous CIF dictionaries this has been discouraged by providing only 
one way of presenting a given piece of information.  What are the 
implications of this development?

The current versions of CIF are like a language that contains only nouns 
and adjectives.  It is great for describing a static object like a 
crystal structure.  It gives a description that can be published, 
archived, retrieved and examined.  Adding Methods is like adding verbs 
to this language.  With them, a CIF can grow, triggered into adding 
derived items by a simple request.  The relation between user and CIF 
becomes interactive.  In this scenario how does one ensure that the 
derived items are all updated when a basic item is changed?  Changing 
the lattice parameters does not automatically update the cell volume, 
and calculating the density will not necessarily force an update of the 
cell volume if the CIF already contains an earlier (obsolete) value.  
Murphy's Law (in its correct original version) states that 'if something 
can happen, sooner or later it will happen'.  We need to anticipate the 
different ways in which a DDLm CIF might be used.

This email is intended to alert you to some of problems that lie ahead.  
I intend to use these discussion lists as a sounding board for specific 
problems as I work on the development of the DDLm coreCIF dictionary.  I 
look forward to receiving your comments on the points I raise, 
particularly warnings if we seem to be straying into dangerous directions.

Watch this space.

David Brown
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/corecifchem/attachments/20081105/6e81e04c/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idbrown.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
Url : http://scripts.iucr.org/pipermail/corecifchem/attachments/20081105/6e81e04c/attachment-0001.vcf 


More information about the coreCIFchem mailing list