CIF Infoset

Herbert J. Bernstein yaya at bernstein-plus-sons.com
Sat Sep 4 10:29:10 BST 2004


I suspect this discussion is starting to sound like "how many angels
can dance on the head of a pin" to some people.  My apologies.  For
most people this discussion _is_ irrelevant.  If you are simply
preparing a small molecule CIF for submission to Acta using the
tags needed for journal publication, you don't really need to
do anything different than what you are doing.

However, there are some people trying to understand if they should
be using CIF or XML or something else as a general data framework
for the creation of other sorts of documents.  For them it is very
important to understand the interaction between "normal" CIF files
that use tags from some standard CIf dictionary and documents using
"made-up" tags that are (not yet) in any official, agreed dictionary.
It is also important for them to understand if they can fiddle
a bit with the standard CIF rules (such as order-independence) and
do something different, and still have a CIF that other people might
be able to handle.


Let is begin with some basics:

1.  Implicit sub-text question:  Shouldn't we really be using XML?  After
all everyone else is doing it.  It has well-defined "name spaces",
does not impose order independence, etc.

Everyone is entitled to their opinion, but as a practical matter, if what
you are doing is populating databases, it is fairly easy to move back
and forth among CIF, XML and various internal database representations.
If what you are doing is marking up documents for publication, XML
is easier to use, but, if you don't have the equivalent of the discipline
imposed by CIF (order independence, well-defined tags) you are going
to have to re-invent it for documents that need to populate standard
templates, such as structure reports.  Any CIF documents has a
simple, direct translation into an XML document.  Going the other way
is harder because arbitrary XML documents do not necessarily have
clearly defined tags, and may need to have order preserved, and
may need to have complex trees translated into tables.  Doing these
things is not difficult.  However, we need to have discussions like
this one to help the community come to agreement on how best to do
that.

2.  Is order-independence in CIF really necessary?

Yes and no.  Clearly one could define a CIF-like language that differed
from CIF in allowing multiple uses of the same tag and in requiring
the ordering of tag-value pairs to be preserved.  However, that would
then make the publication of structure reports more difficult.  So,
in creating CIFs for use in the publication process, order-independence
is a good idea.  In addition, when you do need to provide order-dependent
data, such as atoms in a particular sequence, all you need to do
is to add a column in your table that contains the ordinal of the
item in that column of each row.  That may seem like a nuisance,
but it easy to hand off that nuisance to a bit of software, and
most crystallographers are fairly adept at dealing with numbers
anyway.  That being said, I for one think we should have a flavor
of CIF that allows XML-like order dependence and repeated tags,
and an agreed protocol to translate between order-independent
CIFs and order-dependent CIFs.

3.  What do multiple data block really mean?

With a given data block a given tag may only be used once.  It is
legal to use that same tag again in another data block.  This is
a convenient way around the order independence in CIF, but in
paractice, if a CIF represents a single structure report, you
are not going to want to do that, since it would produce
confusion as to which version of the data for which tag you
wish to include in your paper.  For example, you might have
propared your biblio in one data block and your coordicnates
in another.  You may have only one common tag between the
two data blocks -- something to help you keep the two data blocks
associated with the same study, but except from removing the
duplicate tag, you would have the same information if you
merged the two data blocks in any order, including shuffling
them together.  Alternatively, you might take some huge data block
and break it up into several smaller ones to help make a neater,
more reaable and organized file.


Now to the namespace/dictionary question


I. David Brown has pointed out that CIF has long had a set of tags
for specifying dictionary conformance, and has suggested that
we should require more formal use of those tags to help
readers understand what namespace is being used.  DDB then asks
about the precise mechanisms for using these tags in multiple
data block CIFs and what to do when multiple dictionaries are
involved, especially in deciding the order in which to apply
the dictionaries to avoid conflicts.

I must emphasize, that for most people this is not an issue.
Even if you are drawing tags from multiple CIF standard dictionaries,
yous document is remarkably unambiguous because, for the official
dictionaries, COMCIFS works to avoid overlap and duplicate use
of the same tags, or the use of two different tags for the
same concept.  The major exception is the replication of
the Core dictionary in the mmCIF dictionary using slightly
different tags.  The two dictionaries are kept aligned with
an "alias" mechanism, and most users do not have to worry about
a conflict.

For people working with their own locally-defined tags, however,
this is an interesting question.  There is a detailed protocol
for "layering" of dictionaries (including ones created locally)
( http://www.iucr.org/iucr-top/cif/spec/dictionaries/maintenance.html ).
Clearly it is time to instantiate this protocol in software, and,
as part of a software upgrade project for the IUCr we will be
doing that.  It would be very helpful if those who have an interest
in this subject would read the protocol and provide their
comments and suggestions for improvement, so that the software
we are writing will be useful to as many people as possible.

Regards,
   Herbert



At 6:43 PM +0900 9/4/04, ddb at owari.msl.titech.ac.jp wrote:
>Hi
>
>>  Here are a few more comments from IDB:
>>  >So how do you intend to get around this namespace issue? No CIFs that I
>>  >have encountered have ever declared their conformance to any
>dictionary.
>>  >Even if they did, there is something called the dictionary stacking
>>  >protocol
>>  >which allows those definitions to be overridden without declaring a
>>  >namespace.
>>  >On top of that there is the boundless capacity for making up your own
>>  >data names on the fly for which there may never be any dictionary
>>  >definition
>>  >at all. How can you reliably assign anything but a generic namespace to
>an
>>  >infoset? Its all just adhoc guesswork.
>>
>>  The core dictionary defines three items which can be looped:
>>      _audit_conform_dict_name
>>      _audit_conform_dict_version
>>      _audit_conform_dict_location        # Contains the URL where the
>>  dictionary can be found
>>  As far as I know these have not been widely used - Acta Cryst. should
>>  start insisting that these be included in submitted papers.  There is no
>>  need to give the dictionary version in anything as ephemeral a comment.
>
>
>That sounds like a positive step, but would that go in every data_block or
>is it a global_ thing?
>
>You may need to add something like _audit_conform_dict_stacking_order
>to ensure looped dictionaries of  symmetry overriding core don't get
>confused with core overriding symmetry, for example, (assuming loop order
>is not significan?) if that is possible?
>
>The problem I see is that the effort invested in implementing it for all
>newly created and submitted CIFs is wasted because it is an
>incomplete solution and no current software uses it or needs it.
>
>You still have to deal with existing archives of CIF which don't state
>their conformance, and even for CIFs that  do, users are free to
>conjure up any ad hoc data names they like and use them in any context.
>
>So, to try and resolve the namespace of each name, you would need to
>(1) check the _audit_conform list of dictionaries in reverse order
>(2) check against the list of registered prefixes for accidental matches
>(3) check all versions of all publically accessible dictionaries
>(4) then give up.
>
>Not an efficient process if there was a match and  no guarantee that
>it was a correct match if names were reused in different
>contexts in different dictionaries. Two simple things would fix that.
>Associating a distinguishable prefix on each name with the _audit_conform
>stuff and banning ad hoc data names.
>
>Anything else and you will always be just guessing.
>I don't really know what you are hoping to achieve.
>
>>
>>  ># start Validation Reply Form
>>  >_vrf_DIFF020_114
>>  >;PROBLEM: _diffrn_standards_interval_count and
>>  >RESPONSE: ... We have used an image-plate system
>>  >;
>>  >
>>  >If intelligent software was ever intended to deal with such _vrf_s, why
>  > >embed the only pointer to their purpose in supposedly non parsable data
>>  >names rather than  in looped, discrete sets of tags such as
>>  >
>>  >loop_
>>  >    _vrf_suite _vrf_subroutine _vrf_error_code _vrf_authors_response
>>
>>  This would tidy things up, but the parser must be able to handle ad hoc
>>  data names without choking.
>
>
>If its important enough to create a name for it then isn't it important
>enough
>define its purpose somewhere? Ad hoc data names seem to provide
>nothing useful besides a legitimate excuse for laziness in the
>specification. Theres no incentive to organize things tidily.
>Maybe they were important originally when COMCIFS were exploring
>the field, before dictionaries were introduced, but is it still important
>to be able to make up arbitrary stuff and stick it in a CIF without
>definition?
>Who is doing this and how are they using it?
>Do they really intend to save it for posterity?
>
>
>
>>  >>>>Q Is the order of "rows" in a loop_ unimportant?
>>  >>>
>>  >>>Yes (in CIF).
>>  >>
>>  >>That is very useful (and non-obvious from the spec. It then makes it
>>  >>possible to confirm the identity of two sets of coordinates, symmetry
>>  >>operations, etc.
>>  >>
>>  >>It is also debatable.
>>  >>The very recent introduction of _symmetry_equiv_pos_site_id means that
>>  >>the data integrity of the majority of prior archived CIFs containing
>tag
>>  >>values like:    _geom_bond_site_symmetry_1  "4_564"
>>  >>would be seriously impaired by a change of order in the
>>  >>loop_  _symmetry_equiv_pos_as_xyz
>>
>>  This was a serious omission in the first version of CIF (you have to
>>  remember that this was produced before we even considered writing
>>  dictionaries in STAR format).  As you point out we have introduced the
>>  list reference _symmetry_equiv_posi_site_id (which incidentally has now
>>  been superceded by  _space_group_symop_id taken from the symmetry_cif
>>  dictionary - a dictionary which takes a more systematic and
>>  forward-looking approach to symmetry).  Again Acta Cryst. should insist
>>  on the inclusion of these id's.
>
>Would a statement of conformance to an older dictionary version be
>sufficient grounds to escape these CIF changes (just checking :-)?
>
>But I guess my original concern here was that order independence of loop_
>structures based on earlier, and possibly alternative dictionaries, as
>well as
>ad hoc looped data (maybe thats not important, but you never know...),
>is not assured in general, particularly for raw data in whatever form it
>takes
>(nmr? image CIF?).
>
>
>>  >I had a hazy recollection that  "this is a string" and  
>this_is_a_string
>>  >were equally valid CIF constructs containing identical information
>>  >content,
>>  >used for example in space group names. Would they be formally identical
>in
>>  >an infoset? Does the white space in all strings have to be normalised
>(is
>>  >that the right word?)?
>>
>>  We had a discussion of this point while preparing the symmetry_CIF
>>  dictionary and came to the decision that these two strings were not
>>  equivalent, i.e., underscore is not white space..
>
>Bummer. I know one program that needs changes made :-(
>
>But perhaps I could also draw your attention to this:
>       http://journals.iucr.org/services/cif/stdcodes.html#Appdx4.3
>as evidence that underscores do seem to be an
>officially sanctioned form of white space in uchar data types.
>
>
>And maybe I can raise another issue,  in the context of PMR's interest in
>data_global, would the following construct be legitimate:
>
>data_global
>    _publ_contact_author_name  "Fred"
>
>data_a
>    _import_data_from_block      global
>
># defined in an associated dictionary  as:
>data_import_data_from_block
>     _name                      '_import_data_from_block'
>     _category                  obscure_semantics
>     _type                        uchar
>     _definition
>;
>  Import all data from the named data_block into the current data_block
>Watch out for duplicate _data_element_names though!
>Also watch out for circular imports!
>;
>
>As far as I am aware there is nothing that restricts such semantics.
>Everything seems to be above board in terms of the CIF content.
>its just that a request for _publ_contact_author_name  from
>within data block data_a  seems destined to fail at the software
>access stage. Does that mean CIF conformant software can never be
>totally CIF conformant?
>
>
>Thanks for the response.
>Doug
>
>_______________________________________________
>comcifs mailing list
>comcifs at iucr.org
>http://scripts.iucr.org/mailman/listinfo/comcifs

-- 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya at dowling.edu
=====================================================



More information about the comcifs mailing list