DDLm implementation discussion

Tue Mar 10 19:45:39 GMT 2009

Dear Colleagues,

My discussion paper on the use of intermediate computation items in DDLm 
has produced many interesting suggestions but not a lot of consensus.  I 
have included an edited version of the discussion bringing related 
comments together under three headings.  The first deals with the 
desirability or otherwise of defining beta in the dictionary, the second 
addresses the implications of methods serving as definitions, and the 
third the problem of hiding or removing intermediate items in order not 
to clutter or otherwise compromise the CIF.  The solution I am proposing 
as a result of the discussion is that all the derived items (i.e. those 
with methods) should be defined in a 'derived-item' dictionary, while 
the experimental and assigned items will appear in an archive 
dictionary.  Anyone looking for a derived item will obtain these if they 
read an archive CIF under control of the archive + derived-items 
dictionary.  This is more fully described at the end of this document.

AMBIGUITY IN THE USE OF THE BETA FORM OF THE ADP

Carol Brock (Senior editor of Acta Cryst. C and a user of CIF) in a 
private email provides a strong argument for why betas should be invisible.

What confusion ADPs cause.  I came into crystallography at about the 
time that calls were being made for a switch from betas (used in ORFLS 
and its variants) to Us.  While reading your email a picture of the page 
in ORTEP manual where all the ADP types are listed flashed through my 
mind.  You are so right about betas being nearly useless because of the 
confusion about the factor of two.  I remember drawing ellipsoids with 
and without the factor of two for structures in the literature when 
attempting (usually in vain) to figure out which form had been used.  
Please be sure no calculated beta is ever available where somebody might 
use it in the archival literature.

I would be very happy to help argue with the IT people.  Nobody should 
ever see a beta again.  A complication is that it is only the fossils 
who remember how much confusion betas caused.

Doug Duboulay (DD) on the other hand provides equally persuasive 
arguments why beta should be accessible.

If I understand correctly, beta is the *only*
true tensor form of the ADPs. If you want to convert between
different unit cells, transforming the Uijs is only possible by 
converting to
tensor form before the symmetry transform and back to Uijs afterwards.
To obfuscate its role in a definitive treatise seems lacking.
Of course I may be wrong :))

Nick Spadaccini (NS)
The _matrix_beta item is an item that does warrant definition and as
Doug states is a true tensorial form of ADPs. David alludes to another
problem that does not exist - there can be NO confusion in the definition of
the individual off-diagonal beta terms (historically there is a factor of 2
confusion). Why? Because the individual off diagonal terms can never be
accessed because there is no definition for them. There is never an error in
building the _matrix_beta because it is constructed directly from the U or B
matrices (where there is no confusion).

IDB Comment:
Nick is referring here to the way matrices are constructed from the 
individual matrix elements stored in the CIF.  The assumption is that 
experimental information will be given as matrix elements and the 
matrices themselves would only be created under dictionary control.  So 
Nick is right as long as no-one enters the beta matrix as a matrix 
because individual matrix elements do not appear in the dictionary.  
However, there is nothing I am aware of in DDLm that prevents the matrix 
from being generated by external software or a word processor, and the 
convenience of doing so may encourage users to take this shortcut.  The 
question is how to prevent this.

METHODS AS DEFINITIONS

 >> METHODS ARE THE NEW DEFINITIONS
 >> At the meeting of COMCIFS in Osaka it was decided that when a method is
 >> present in the dictionary it takes precedence over text in defining the
 >> item.  One immediate corollary to this is that only one method is
 >> allowed for each data item.

DD: I am not sure why that is a corollary.
Why is it not possible to have fallback methods when a particular
evaluation strategy fails?

My understanding of the derivation pathway was more like:

              U11,U22,...   B11,B22...
                /           /
 3    beta -> Uaniso -> Baniso
                 \         \
                  Uiso       Biso

i.e. there are discrete component forms  as well as matrix forms
as well as tensor forms.

Also, for calculating H atom adp's isn't there an algorithm based on 
Uiso of their coordinating C,N,O atoms?
How are you going to get Uiso from Uaniso?

IDB Comment.
Uiso is not the same as Uequiv.  If Uequiv is required we should define 
Uequiv -> Uaniso -> Uiso.  Presumably Uequiv will be the same as Uiso if 
no value of Uaniso is provided, providing we supply the correct methods.

NS: David suggests there is a problem with the fact that
different evaluation pathways exist to obtaining a value for an item, 
i.e., that multiple paths is problematic, whereas it is the expressed 
design in dREL.

The beta->Uani->Bani->Uiso->Biso pathway David describes is not correct, and
Doug makes an attempt to clarify (and gets closer). The actual calculation
pathway is

    |->Uani
    |->Bani
Beta|
    |->Uiso
    |->Biso

Which seems perfectly logical to me. dREL is a Turing complete language and
while there is one evaluation method in an item definition you can create a
multitude of paths to the answer - and that is a GOOD THING.

IDB Comment
In the proof-of-principle dictionary this branching depends on the value 
of .adp_type which may not be present (it has no method in the 
proof-of-principle dictionary and an enumeration default of Uiso which 
could be incorrect).  It would be better if branching were based on an 
ordered test for the existence of an item, thus if Uaniso is not 
present, look for Baniso.  However, as Doug points out, a search for the 
existence of Uaniso would have to allow Uaniso the opportunity to 
generate itself from its individual elements.  I am not sure how this is 
done in practice but that is a problem for the dREL implementation.

There seems to be a consensus here that branching is built into dREL and 
is desirable in a definition.  It is not clear if this has to be 
achieved using if-then constructs or an ordered loop defining the 
different branch methods.  If one goes for the if-then construction, how 
does one ensure the tested item is present, or can one make provision 
for a default procedure if it is not?  If the definitions form a tree 
(rather than a network with closed loops) it should not be possible to 
get stuck in a loop as mentioned below by DD.

HIDING THE INTERMEDIATE ITEMS

 >POSSIBLE SOLUTIONS from IDB's original discussion paper
 >> Tree 3

    beta -> Uaniso -> Baniso -> Uiso-> Biso

 >> can be made to work if the beta form can be made
 >> invisible.  It cannot be completely invisible as it must appear in the
 >> appropriate CIF dictionary, and its text description will be displayed
 >> by any CIF editor such as publCIF or enCIFer.
 >>
 >> One possible solution is to include a flag in the dictionary definition
 >> to indicate that the item should be hidden from the user or deleted
 >> after the calculation is complete.
 >>
 >> A second possibility is to give the item a dataname that disguises its
 >> identity, e.g., a name such as _atom_site_aniso.intermediate1. The
 >> dictionary would contain the .description 'This item is an 
intermediate in
 >> an ADP calculation and is not to be used for archival or retrieval
 >> purposes'.
 >>
 >> A third solution would be to rearrange the method for calculating the
 >> structure factors so that it works directly with Uaniso and does not
 >> generate beta as an intermediate.  In this case there is no need to 
define
 >> beta in the dictionary.

DD: I would tend towards the first solution, but with multiple evaluation
strategies (i.e. loop_ed), combined with dREL software that checks
for multiple iterations around an evaluation path and which falls back
to an alternative if it exists, as well as the flag to say don't print
this in a result CIF (and possibly hide this in an editor!).

James Hester (JH)
I agree with IDB's diagnosis of the problem, and, rather than clutter
the dictionary with unnecessary baggage as solutions 1 and 2 do, I
would suggest a variant of your 3rd solution:

Solution 4: As beta is used primarily for calculational convenience, a
dREL function 'beta()' is defined which calculates the beta value.
The structure factor calculation is rewritten to call this function.
A given dREL implementation can choose to cache values returned by
this function to improve efficiency.

In [David's] first solution invisibility can only be maintained if 
everyone respects the new DDLm attribute that will be created to flag 
it.  The value [of the item] will leak out.

The second solution is better than the first solution, but I believe an 
unnecessary cluttering of the dictionary

Herbert Bernstein (HB)
I like the idea of defining useful functions in the dictionary that
will not in and off themselves generate tags, but we need to
provide some control over scope and namespaces to make it easy to
combine useful functions from multiple dictionaries -- perhaps
adopting python's module based dotted notation to resolve
conflicts.

JH: Yes.  Currently all dREL functions (and one hopes builtin functions in
the final standard) belong to DDLm category 'Function'.  We could
usefully sketch out a hierarchy of subcategories here when putting
together the final DDLm standard, or alternatively/additionally the
standard DDLm importation mechanism when importing other dictionaries
could resolve name conflicts.

NS: Correct, any function definition can be handled by the "functions" 
category.

A quick response to David's original post is "there is no identifiable
problem" in what David has written.

The interim data items, many of which are the actual legitimate
crystallographic objects, like the cell vectors rather than their scalar
dimensions and hence I personally believe they should be part of the
dictionary, don't have to be exported. They are in our prototype parser
because I am too lazy to clean up the output, I simply dump the in-memory
Python dictionary.

This aspect of what David sees as a problem, can be made to go away by using
DDLm's import facility. That is, the parser reads in the core dictionary
(with only the data items David/community would like to see in a submission
file) and import a "fuller" dictionary to handle everything. On output the
parser can be restricted to only exporting the data items in the core
dictionary. Problem solved. The user would never see or know about the extra
data items.

James' idea of creating functions would work also, but there are two quite
different classes of items here. Those which are truly library/utility
functions like those that strip the ortep-like object 2_567 in to a symmetry
pointer 2, and a cell displacement vector [0, 1, 2]. That is a function.

The other type of items are legitimate crystallographic items that merit
definition and should not be obfuscated in code. For instance the cell
displacement vector is a legitimate item and merits definition.
Crystallographically [0, 1, 2] is meaningful whereas _567 is actually
syntactic rubbish - albeit popular syntactic rubbish.

You may insist on only seeing _567 but to deny the ability to define a truly
crystallographic object like [0, 1, 2] is not sensible. Especially when it
can be hidden using an importing functionality.

The solution I describe by using importation will solve any perceived
problem and is the very basis on which DDLm had an importing functionality
created.

IDB's PROPOSED SOLUTION
Branching will be allowed in definitions, though a tree structure would 
be necessary with items arranged in an hierarchy to ensure that loops 
are not created and that all paths end in experimental or assigned items 
such as cell dimensions or symmetry operations.  The application of this 
will require care to respect the crystallographic integrity of the 
definitions, e.g., if U is calculated from B there should be no route by 
which B can be calculated from U.  The question remains how this is 
implemented: by if-then or loop construction.  I welcome advice on the 
best way to include multiple methods without having to test a 
potentially missing item.

Various solutions for the treatment of intermediate items have their 
supporters, including flagging them, eliminating them in favour of 
functions, or segregating them into a 'derived-items' dictionary.  It is 
clear that most intermediates are good crystallographic items that would 
not be out of place in the CIF output, but leaving aside the special 
problem of beta, there would be a danger of cluttering up the CIF with 
large numbers of derived items, many of them duplicating in a different 
format information already present.  A further danger is one that 
Herbert pointed out in Osaka.  He suggested that DDLm CIF dictionaries 
would make CIF more dynamic compared with the static character of CIF1 
and CIF2, and if some experimental or assigned value (e.g., the cell 
constants), were changed there would be no way we could ensure that all 
the derived items would be automatically updated. 

The relevant items in DDLm dictionaries can be classified as either 
derived or experimental (or assigned) (ignoring those used for data 
management and description which do not concern us).  The derived items 
will all have methods, in principle the experimental values will not 
have methods.  Following Nick's suggestions, we could arrange that the 
derived items are placed in a 'derived-items' dictionary, while those 
giving the experimental and assigned values, such as cell constants and 
space group symmetry, would appear in an archive dictionary. 

This represents an important change in the way we handle and think about 
CIF: the basic experimental information would be found in an archival 
CIF, but reading this CIF under DDLm dictionary control would allow any 
desired derived items to be retrieved as if they had been originally in 
the CIF.  These requested items could be passed to a user program or 
exported as a CIF, though the default exported CIF would contain only 
items from the archival CIF.  Provision would be needed to allow derived 
items already in the archival CIF to be optionally retained rather than 
recalculated.  Derived items currently present in CIF such as cell 
volume and calculated density would appear in the derived-items 
dictionary.  The derived-item dictionary would contain a rich supply of 
derived items, e.g., a person requiring a set of bond vectors could 
retrieve these by supplying the labels and symops of the terminal atoms, 
which in turn might be generated by an external program so as to include 
all the bond distances of interest to the user.  The derived-items 
dictionary could include a definition of the beta matrix since this 
might be required by a user program designed to transform cell settings, 
but being in the derived-items dictionary there would be little 
temptation to use it archivally.  Editors such as publCIF which are 
designed to help in producing archival CIFs would not need to import the 
derived-items dictionary.

Of course we could also make more use of functions as suggested by James 
and Herbert if there are no issues with importing functions, but only 
when the intermediate item was not a potentially useful derived item.

As is well known, the devil is in the details, and in adapting the CIF 
dictionaries I will have to make many decisions in matters of detail.  
But if there is agreement on splitting the dictionaries into archive and 
derived-item dictionaries as described above, I will have a guideline to 
work with.  I will undoubtedly come back with other problems in the 
future but this split seems to be in the spirit of DDLm and appears to 
solve a number of important problems.

Is this plan acceptable to everyone?  If so, I will start to apply it 
and move this discussion on the next problem.

David Brown

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/comcifs/attachments/20090310/9e300c30/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idbrown.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
Url : http://scripts.iucr.org/pipermail/comcifs/attachments/20090310/9e300c30/attachment-0001.vcf