A DDLm problem
David Brown
idbrown at mcmaster.ca
Tue Feb 24 19:38:19 GMT 2009
Dear Colleagues,
I have now resumed work converting the coreCIF dictionary to the DDLm
standard. This email describes a problem I have encountered in the
treatment of intermediates generated during the application of a
method. I need your feedback to ensure that the solution I am
suggesting is generally acceptable. Skip to the end of the email if you
want to know my proposed solution (though I recommend reading the rest
of the email to find out the problem is that the solution is designed
to resolve).
Let me remind you that the main new feature in DDLm is the inclusion of
'methods' in the CIF dictionaries. Methods are machine-executable
algebraic expressions that can be used by a program to calculate the
value of a derived item from measured or assigned items in the CIF.
You are receiving this email because you are on one or more mailing
lists of people whose advice and approval I need for dealing with a
number of issues that methods raise. It is important that these issues
be discussed while there is still time to influence the decisions that
will have to be made. The first of these issues is described in this
email.
I am circulating this email to two lists, and if you are on both you
will get it twice. I apologize and recommend that you quickly delete
the second copy (unless you wish to reply to both lists.)
Read on.
METHODS ARE THE NEW DEFINITIONS
At the meeting of COMCIFS in Osaka it was decided that when a method is
present in the dictionary it takes precedence over text in defining the
item. One immediate corollary to this is that only one method is
allowed for each data item. If a program call is made to an item that
is not present in the CIF, the method will initiate a call to the other
items needed to calculate its value. These in turn may call further
items. The route between a derived item and the measured or assigned
values in terms of which it is ultimately defined, constitutes a tree,
and because the method is the definition, the tree must be unique.
'If-then' constructions give rise to branching. They must therefore be
treated with care because they alter the definition of an item according
to the value of the CIF item that is tested. It is reasonable to
include such a construction if the branching depends, e.g., on whether
the structure was determined by x-ray or neutron diffraction, but is it
wise to use it if it depends only on the way the CIF is structured,
rather than the conditions of the experiment that is being reported? In
other words, should a definition in this case be made to depend on the
value of a second item that might not actually be present in the CIF?
AN EXAMPLE OF A DEFINITION TREE
Atomic displacement parameters (ADPs) illustrate the kind of problems
that can arise. ADPs are expressed in a number of different forms such
as B, U and beta (the latter with two different definitions).
Furthermore, each form also has an isotropic and an anisotropic
version. When the original core dictionary was prepared we decided to
standardize on U because it has a direct physical significance in real
space. Standardizing on a single form simplifies programming because
anyone reading a CIF can rely on finding the ADPs in the U form.
However, in response to an insistent request from the macromolecular
community, we also allowed ADPs to be given in the B form since this
form is universally used in that field. Thus anyone reading a CIF must
now be prepared to find ADPs in either the U or the B form. There is
currently no definition of the beta form.
The ADP is a measured quantity and therefore cannot strictly be
calculated, but with DDLm we can make life easier by adding a method to
the definition of U that will calculate U from B if B, but not U, is
present. This gives rise to the tree 1.
1 U -> B
(The arrow -> indicates that U calls B to convert to contents of B to U)
U is now treated as a derivable item; but the value appearing in the CIF
may be either directly measured or may be derived from the directly
measured B if the ADP was originally stored in the CIF as B.
Introducing this method means that any program looking for an ADP now
only has to look for U. If the ADPs are given in the B form, a call to
U will automatically result in a conversion from B to U. The reverse of
course will not be true since the tree is unique and cannot be read
backwards.
However, life is more complicated than this because the ADPs may be
isotropic or anisotropic, so the full sequence is shown as tree 2:
2 Uaniso -> Baniso -> Uiso -> Biso
which is beginning to look a little cumbersome but, hey, the computer
doesn't care if it has to make three conversions instead of just one,
and one can always express an isotropic ADP in the form of an
anisotropic one. Of course if an external program intercepts this
sequence further down, say at Baniso, it will fail to find an ADP given
as Uaniso. There is no way around this difficulty, but if the hierarchy
is defined and understood, it shouldn't be a problem. Note that the
iso- versions will only be called if no aniso versions are present so
there is no problem if both are given. If the Ueq (which is stored as
Uiso) is required the sequence can be intercepted at Uiso. If U and B
are both present (by some accident) then U automatically takes
precedence since if U is present there will be no call to B.
So far so good. No real problems.
THE PROBLEM OF INTERMEDIATE ITEMS
Some method calls may involve the calculation of intermediate items not
currently included in the dictionary. An example taken from the the
proof-of-concept DDLm CIF dictionary, is the beta form of the ADP which
is used in the calculation of structure factors because it makes the
calculation simpler. The method for generating beta uses an 'if-then'
construction to test the value of _atom_site.adp_type to decide whether
to calculate beta from Uiso, Biso, Uaniso or Baniso. The method for
_atom_site.beta thus contains several different algorithms for making
the conversion, depending on the value of .adp_type. This means that
the definition is not unique but is determined by the value of an item
that is not itself part of the tree. Problems could arise if .adp_type
is missing (a distinct possibility) or if it does not correctly describe
the ADP format, either of which would cause the calculation to fail.
The sequential calls described in tree 2 are more robust because the
calculation is based on the presence or absence of the items in the tree
itself.
Since beta is an intermediate which must have its own method, it must
have a dictionary definition and it must have a place in tree 2. Only
the beginning of end of the tree make any sense, and since it cannot go
at the end for various obvious reasons, not the least being that in that
position it is not allowed to have a method, the only logical place is
at the beginning of the sequence,
3 beta -> Uaniso -> Baniso -> Uiso -> Biso
Since external programs should access ADPs using the item at the head of
the sequence, they should call beta rather than Uaniso, even though U
has now superceded beta in normal use. We cannot pretend that beta does
not exist since it is defined in the dictionary, and sooner or later
someone is going to archive their measured ADPs in this form. This
almost certainly will happen since many early papers report the ADPs in
the beta form and they are stored this way in the ICSD (for example).
ICSD has indicated that they would prefer to output ADPs from these
early papers in the beta form if it were available. However the ADPs
would not then be found by an external program that calls Uaniso and
most modern programs would consider it retrograde to have to call beta.
ANOTHER GOOD REASON FOR DISCOURAGING THE USE OF BETA
There is an excellent unrelated reason for not wanting to include beta
in the dictionary. There are two incompatible definitions of beta,
depending on whether the '2' in the cross term is explicit or implicit.
Unfortunately most papers that report betas do not state which
convention they use which makes the information virtually useless. The
definition in the proof-of-principle dictionary arbitrarily chooses one
of these conventions, but many people are not aware that there is an
ambiguity, and allowing people to archive ADPs as betas would likely
lead to many incorrect CIFs. CIF should strive to avoid making it easy
for such errors to occur.
POSSIBLE SOLUTIONS
Tree 3 above can be made to work if the beta form can be made
invisible. It cannot be completely invisible as it must appear in the
appropriate CIF dictionary, and its text description will be displayed
by any CIF editor such as publCIF or enCIFer.
One possible solution is to include a flag in the dictionary definition
to indicate that the item should be hidden from the user or deleted
after the calculation is complete.
A second possibility is to give the item a dataname that disguises its
identity, e.g., a name such as _atom_site_aniso.intermediate1. The
dictionary would contain the .description 'This item is an intermediate
in an ADP calculation and is not to be used for archival or retrieval
purposes'.
A third solution would be to rearrange the method for calculating the
structure factors so that it works with directly with Uaniso and does
not generate beta as an intermediate. In this case there is no need to
define beta in the dictionary.
THE SOLUTION I PROPOSE TO ADOPT
I propose that we adopt the second solution, i.e., we agree to use
datanames and descriptions that indicate that the item is not to be used
for archival purposes. There will of course be many intermediates that
are perfectly acceptable for archiving. For example, when Uaniso is
calculated from Biso, Uiso and Baniso are generated along the way and
there is no need to hide them. Calculation of structure factors would
also generate atom_site.intermediate1 items containing the ADPs in the
beta form, so the CIF may end up being a bit cluttered, but it should be
possible to write a program that would clean up the CIF by removing any
unwanted intermediate items. In any case since external programs only
need search for Uaniso, the presence of the other items will be of no
concern.
I am looking for feed back. However if I receive none, I will assume
that you agree that unwanted intermediates should be hidden by giving
them meaningless datanames and text definitions that conceal their content.
Please circulate your thoughts on this problem to the whole discussion list.
David Brown
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/comcifs/attachments/20090224/d6253360/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idbrown.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
Url : http://scripts.iucr.org/pipermail/comcifs/attachments/20090224/d6253360/attachment.vcf
More information about the comcifs
mailing list