CIF Infoset

Mon Aug 30 16:15:17 BST 2004

Here are a few more comments from IDB:

>So how do you intend to get around this namespace issue? No CIFs that I 
>have encountered have ever declared their conformance to any dictionary.
>Even if they did, there is something called the dictionary stacking 
>protocol 
>which allows those definitions to be overridden without declaring a 
>namespace.
>On top of that there is the boundless capacity for making up your own
>data names on the fly for which there may never be any dictionary 
>definition
>at all. How can you reliably assign anything but a generic namespace to an 
>infoset? Its all just adhoc guesswork.
>
The core dictionary defines three items which can be looped:
    _audit_conform_dict_name
    _audit_conform_dict_version
    _audit_conform_dict_location        # Contains the URL where the 
dictionary can be found
As far as I know these have not been widely used - Acta Cryst. should 
start insisting that these be included in submitted papers.  There is no 
need to give the dictionary version in anything as ephemeral a comment.

># start Validation Reply Form
>_vrf_DIFF020_114
>;PROBLEM: _diffrn_standards_interval_count and
>RESPONSE: ... We have used an image-plate system
>;
>
>If intelligent software was ever intended to deal with such _vrf_s, why 
>embed the only pointer to their purpose in supposedly non parsable data 
>names rather than  in looped, discrete sets of tags such as 
>
>loop_
>    _vrf_suite _vrf_subroutine _vrf_error_code _vrf_authors_response
>
This would tidy things up, but the parser must be able to handle ad hoc 
data names without choking.

>>>>Q Is the order of "rows" in a loop_ unimportant? 
>>>>        
>>>>
>>>Yes (in CIF).
>>>      
>>>
>>That is very useful (and non-obvious from the spec. It then makes it
>>possible to confirm the identity of two sets of coordinates, symmetry
>>operations, etc.
>>
>>It is also debatable. 
>>The very recent introduction of _symmetry_equiv_pos_site_id means that
>>the data integrity of the majority of prior archived CIFs containing tag 
>>values like:    _geom_bond_site_symmetry_1  "4_564"
>>would be seriously impaired by a change of order in the 
>>loop_  _symmetry_equiv_pos_as_xyz
>>
This was a serious omission in the first version of CIF (you have to 
remember that this was produced before we even considered writing 
dictionaries in STAR format).  As you point out we have introduced the 
list reference _symmetry_equiv_posi_site_id (which incidentally has now 
been superceded by  _space_group_symop_id taken from the symmetry_cif 
dictionary - a dictionary which takes a more systematic and 
forward-looking approach to symmetry).  Again Acta Cryst. should insist 
on the inclusion of these id's.

>I had a hazy recollection that  "this is a string" and   this_is_a_string   
>were equally valid CIF constructs containing identical information 
>content, 
>used for example in space group names. Would they be formally identical in 
>an infoset? Does the white space in all strings have to be normalised (is 
>that the right word?)?
>
We had a discussion of this point while preparing the symmetry_CIF 
dictionary and came to the decision that these two strings were not 
equivalent, i.e., underscore is not white space..  For that reason  
P_21/c is no longer regarded as a valid space group symbol although 
there is a warning that some heritage CIFs may use that convention.  
There is an enumeration list for _space_group_name_H-M_ref which 
explicitly allows only 'P 21/c'.  Other space group symbols are 
similarly defined

>Would 1.2(2) and 1.3(2) be equivalent in an infoset? Lexically they are 
>different, but semantically they are the same value, within error.
>
They are not semantically the same, though they are not (scientifically) 
significantly different.  The distinction is important.

>>>>The difficulty is not pserving the data type, but the semantics of
>>>>downstream decisions. If one author writes _my_phone "123-45678"
>>>>they are announcing this is not a number while if another writes
>>>>_my_phone 123-45678 they are announcing it is a number.
>>>>
>>>> The
>>>>discussion so far seems to suggest that these statements overrule
>>>>the datatypes specified in the dictionary entries. There is a
>>>>particular problem in loop_s, where it is then possible to have
>>>>different data types within a column:
>>>>
>>>>loop_ _atom_site_occupancy
>>>>1.0
>>>>0.3
>>>>"not refined"
>>>>"0.3"
>>>>"."
>>>>
>>>>which makes the implementation very difficult. I believe that a
>>>>programmer should be able to look up the data type in the dictionary
>>>>entry and write a routine that relies on a value being of the
>>>>correct data type and throws an exception if not.
>>>>        
>>>>
>
>  
>
It is much worse that this.  There is a definition of what constitutes a 
number in DDL1 but it is given only as text and that only by way of 
examples (which incidentally do not include 123-45678).  The examples 
may not be intended as an exhaustive list, but no other guidance is 
given.  DDL2 is both better and worse since, although numbers are 
defined in terms of regular expressions, each dictionary defines its own 
set of data types and there appears to be no limit on how many data 
types are defined.  It sound to me as if all values should be treated as 
data strings unless a dictionary is used and the appropriate data types 
defined in the infoset.  Then some means is needed to preserve these 
types (if possible) in any realization of the infoset, e.g., by writing 
them in XML or a different version of CIF.  In any case DDL1 certainly 
needs to tighten up its definition of a number if typing is going to be 
important.

Good luck!

David

-- 
Dr. I.D.Brown, Professor Emeritus,
Department of Physics and Astronomy
McMaster University, Hamilton
Ontario, Canada

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/comcifs/attachments/20040830/2f1fff37/attachment.htm

Crystallography Online: the website of the International Union of Crystallography

CIF Infoset