A DDLm problem

Tue Feb 24 19:38:19 GMT 2009

Dear Colleagues,

I have now resumed work converting the coreCIF dictionary to the DDLm 
standard.  This email describes a problem I have encountered in the 
treatment of intermediates generated during the application of a 
method.  I need your feedback to ensure that the solution I am 
suggesting is generally acceptable.  Skip to the end of the email if you 
want to know my proposed solution (though I recommend reading the rest 
of the email to find out  the problem is that the solution is designed 
to resolve).

Let me remind you that the main new feature in DDLm is the inclusion of 
'methods' in the CIF dictionaries.  Methods are machine-executable 
algebraic expressions that can be used by a program to calculate the 
value of a derived item from measured or assigned items in the CIF.

You are receiving this email because you are on one or more mailing 
lists of people whose advice and approval I need for dealing with a 
number of issues that methods raise.  It is important that these issues 
be discussed while there is still time to influence the decisions that 
will have to be made.  The first of these issues is described in this 
email. 

I am circulating this email to two lists, and if you are on both you 
will get it twice.  I apologize and recommend that you quickly delete 
the second copy (unless you wish to reply to both lists.)

Read on.

METHODS ARE THE NEW DEFINITIONS
At the meeting of COMCIFS in Osaka it was decided that when a method is 
present in the dictionary it takes precedence over text in defining the 
item.  One immediate corollary to this is that only one method is 
allowed for each data item.  If a program call is made to an item that 
is not present in the CIF, the method will initiate a call to the other 
items needed to calculate its value.  These in turn may call further 
items.  The route between a derived item and the measured or assigned 
values in terms of which it is ultimately defined, constitutes a tree, 
and because the method is the definition, the tree must be unique.

'If-then' constructions give rise to branching.   They must therefore be 
treated with care because they alter the definition of an item according 
to the value of the CIF item that is tested.  It is reasonable to 
include such a construction if the branching depends, e.g., on whether 
the structure was determined by x-ray or neutron diffraction, but is it 
wise to use it if it depends only on the way the CIF is structured, 
rather than the conditions of the experiment that is being reported?  In 
other words, should a definition in this case be made to depend on the 
value of a second item that might not actually be present in the CIF?

AN EXAMPLE OF A DEFINITION TREE
Atomic displacement parameters (ADPs) illustrate the kind of problems 
that can arise.  ADPs are expressed in a number of different forms such 
as B, U and beta (the latter with two different definitions).  
Furthermore, each form also has an isotropic and an anisotropic 
version.  When the original core dictionary was prepared we decided to 
standardize on U because it has a direct physical significance in real 
space.  Standardizing on a single form simplifies programming because 
anyone reading a CIF can rely on finding the ADPs in the U form.  
However, in response to an insistent request from the macromolecular 
community, we also allowed ADPs to be given in the B form since this 
form is universally used in that field.  Thus anyone reading a CIF must 
now be prepared to find ADPs in either the U or the B form.  There is 
currently no definition of the beta form.

The ADP is a measured quantity and therefore cannot strictly be 
calculated, but with DDLm we can make life easier by adding a method to 
the definition of U that will calculate U from B if B, but not U, is 
present.   This gives rise to the tree 1.

1    U -> B
(The arrow -> indicates that U calls B to convert to contents of B to U)

U is now treated as a derivable item; but the value appearing in the CIF 
may be either directly measured or may be derived from the directly 
measured B if the ADP was originally stored in the CIF as B.  
Introducing this method means that any program looking for an ADP now 
only has to look for U.  If the ADPs are given in the B form, a call to 
U will automatically result in a conversion from B to U.  The reverse of 
course will not be true since the tree is unique and cannot be read 
backwards.

However, life is more complicated than this because the ADPs may be 
isotropic or anisotropic, so the full sequence is shown as tree 2:

2    Uaniso  -> Baniso -> Uiso -> Biso 

which is beginning to look a little cumbersome but, hey, the computer 
doesn't care if it has to make three conversions instead of just one, 
and one can always express an isotropic ADP in the form of an 
anisotropic one.  Of course if an external program intercepts this 
sequence further down, say at Baniso, it will fail to find an ADP given 
as Uaniso.  There is no way around this difficulty, but if the hierarchy 
is defined and understood, it shouldn't be a problem.  Note that the 
iso- versions will only be called if no aniso versions are present so 
there is no problem if both are given.  If the Ueq (which is stored as 
Uiso) is required the sequence can be intercepted at Uiso.   If U and B 
are both present (by some accident) then U automatically takes 
precedence since if U is present there will be no call to B.

So far so good.  No real problems.

THE PROBLEM OF INTERMEDIATE ITEMS
Some method calls may involve the calculation of intermediate items not 
currently included in the dictionary.  An example taken from the the 
proof-of-concept DDLm CIF dictionary, is the beta form of the ADP which 
is used in the calculation of structure factors because it makes the 
calculation simpler.  The method for generating beta uses an 'if-then' 
construction to test the value of _atom_site.adp_type to decide whether 
to calculate beta from Uiso, Biso, Uaniso or Baniso.  The method for 
_atom_site.beta thus contains several different algorithms for making 
the conversion, depending on the value of .adp_type.  This means that 
the definition is not unique but is determined by the value of an item 
that is not itself part of the tree.  Problems could arise if .adp_type 
is missing (a distinct possibility) or if it does not correctly describe 
the ADP format, either of which would  cause the calculation to fail.  
The sequential calls described in tree 2 are more robust because the 
calculation is based on the presence or absence of the items in the tree 
itself.

Since beta is an intermediate which must have its own method, it must 
have a dictionary definition and it must have a place in tree 2.  Only 
the beginning of end of the tree make any sense, and since it cannot go 
at the end for various obvious reasons, not the least being that in that 
position it is not allowed to have a method, the only logical place is 
at the beginning of the sequence,

3    beta -> Uaniso -> Baniso -> Uiso -> Biso 

Since external programs should access ADPs using the item at the head of 
the sequence, they should call beta rather than Uaniso, even though U 
has now superceded beta in normal use.  We cannot pretend that beta does 
not exist since it is defined in the dictionary, and sooner or later 
someone is going to archive their measured ADPs in this form.  This 
almost certainly will happen since many early papers report the ADPs in 
the beta form and they are stored this way in the ICSD (for example).  
ICSD has indicated that they would prefer to output ADPs from these 
early papers in the beta form if it were available.  However the ADPs  
would not then be found by an external program that calls Uaniso and 
most modern programs would consider it retrograde to have to call beta.

ANOTHER GOOD REASON FOR DISCOURAGING THE USE OF BETA
There is an excellent unrelated reason for not wanting to include beta 
in the dictionary.  There are two incompatible definitions of beta, 
depending on whether the '2' in the cross term is explicit or implicit.  
Unfortunately most papers that report betas do not state which 
convention they use which makes the information virtually useless.  The 
definition in the proof-of-principle dictionary arbitrarily chooses one 
of these conventions, but many people are not aware that there is an 
ambiguity, and allowing people to archive ADPs as betas would likely 
lead to many incorrect CIFs.  CIF should strive to avoid making it easy 
for such errors to occur.

POSSIBLE SOLUTIONS
Tree 3 above can be made to work if the beta form can be made 
invisible.  It cannot be completely invisible as it must appear in the 
appropriate CIF dictionary, and its text description will be displayed 
by any CIF editor such as publCIF or enCIFer.

One possible solution is to include a flag in the dictionary definition 
to indicate that the item should be hidden from the user or deleted 
after the calculation is complete.

A second possibility is to give the item a dataname that disguises its 
identity, e.g., a name such as _atom_site_aniso.intermediate1. The 
dictionary would contain the .description 'This item is an intermediate 
in an ADP calculation and is not to be used for archival or retrieval 
purposes'.

A third solution would be to rearrange the method for calculating the 
structure factors so that it works with directly with Uaniso and does 
not generate beta as an intermediate.  In this case there is no need to 
define beta in the dictionary.

THE SOLUTION I PROPOSE TO ADOPT
I propose that we adopt the second solution, i.e., we agree to use 
datanames and descriptions that indicate that the item is not to be used 
for archival purposes.  There will of course be many intermediates that 
are perfectly acceptable for archiving.  For example, when Uaniso is 
calculated from Biso, Uiso and Baniso are generated along the way and 
there is no need to hide them.  Calculation of structure factors would 
also generate atom_site.intermediate1 items containing the ADPs in the 
beta form, so the CIF may end up being a bit cluttered, but it should be 
possible to write a program that would clean up the CIF by removing any 
unwanted intermediate items.  In any case since external programs only 
need search for Uaniso, the presence of the other items will be of no 
concern.

I am looking for feed back.  However if I receive none, I will assume 
that you agree that unwanted intermediates should be hidden by giving 
them meaningless datanames and text definitions that conceal their content.

Please circulate your thoughts on this problem to the whole discussion list.

David Brown
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/comcifs/attachments/20090224/d6253360/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idbrown.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
Url : http://scripts.iucr.org/pipermail/comcifs/attachments/20090224/d6253360/attachment.vcf