CIF Infoset

Tue Sep 7 15:58:35 BST 2004

I am sorry to come into this discussion so late in the day: I'm
tied up in a number of local projects that are occupying most of my
attention. However, it raises a lot of interesting points.

First, it seems to me that machine validation of "semantic" content
is still at a very elementary stage. (Well, zerothly, I'm not even
sure that the meaning of "semantic" can be determined
unequivocally in this context.) I've spent some time thinking about
Peter's request that the use of "data_global" to carry common
information between blocks be either formalised or deprecated. All I
can come up with at present is to say that it's not part of the
published standard. Therefore, a software designer trying to write
a general application that will handle any CIF in a standard-compliant
way need treat "data_global" no differently from any other data
block. On the other hand, it's a useful convention that is shared
between Acta, CCDC and some others, and I would not wish to see it
expressly forbidden, unless and until a better and more reliable way
to achieve its purpose can be found.

My suggestion for the "better way" is to look into expanding the
AUDIT_LINK mechanism: this allows for statements of relationships
between data blocks. In the present core dictionary the category is
present, but the potentially useful field _audit_link_block_description
is a free-text one; the dictionary example is
    loop_
    _audit_link_block_code
    _audit_link_block_description
       .             'discursive text of paper with two structures'
       morA_(1)      'structure 1 of 2'
       morA_(2)      'structure 2 of 2'

To be of real use in establishing relationships, there needs to be
an enumerated field that expresses all the relationships that one
can handle mechanically. I'd be interested in people's thoughts on
how feasible such a project would be.

Peter asks whether the semantics implied by the conventional use of
the data_global block are the same as would be achieved by use of
the global_ mechanism. It is an interesting question. In practice
the data items used in the data_global block are a different set
from those in the structural data blocks: they describe the article
title, authors and comment, while the separate structure data blocks
contain the atom coordinates etc. There is no provision for trying
to interpret a case where say _publ_section_title (the title of a
paper) appears in data_global while a different value of
_publ_section_title appears in data_foo. This differs from global_,
where there are clear rules of inheritance and precedence.

On the other hand, global_ also enforces an order on the subsequent
data blocks, while data_global can be placed anywhere one
chooses. It is therefore much easier for relatively undisciplined
software to make use of the data_global mechanism in the current ad
hoc way.

(It is worth bearing in mind that much of the implementation of CIF
arises from modifications to existing programs that were never
originally designed with an eye to the CIF data model. In that
sense the imposition of a very strict 'discipline' in writing and
even reading CIFs places a heavy burden on the maintainers of those
programs. This is not to criticise them for what - in scientific
terms - are immensely valuable programs. But it's quite likely that
many older programs, especially if written in Fortran, will struggle
ever to become fully CIF compliant. It's also worth observing that they
would likely struggle similarly to handle semantic-rich i/o in XML
or any other such representation.)

Couple of other topics: comments really should be considered as void
of semantic content in terms of the information designed to be
exchanged and archived in CIFs. If an application wishes to preserve
comments (out of politeness or conservatism) that's fine, but the
application writer must exercise his or her own judgement as to how
to handle this, especially if the order of items is changed.
The fact that statements of copyright can be found within comments
indicates only that the community has not yet considered it
important to carry statements of intellectual property ownership
along as tagged content. I am sure that many complex questions about
copyright arise when one considers the proper treatment of IPR to
extracts or reorderings of data files.

I consider the 'magic number' comment recommended to start a CIF as
special, but only in the sense that it's provided as a courtesy to 
graphical file managers etc - it identifies a file as of type CIF
(in the same way that Windows tries to identify the type of a file
by its name suffix). CIF application software should look instead
for _audit_conform* tags if the intention is to test conformance
against a particular dictionary or set of dictionaries.

The namespace and conformance questions arise from the original
design goal to permit software authors to include their own tags
alongside "standard" tags in CIFs. There have been some developments
to guard against naming clashes, e.g. the registry of reserved prefixes,
but the implementation is awkward, and a case might well be made for
simplifying things with a syntactic device (e.g. a colon) to make
the namespace identifier easier to parse. As Herbert points out, the
problem becomes real (and progressively more acute) as CIF becomes
more widely used in different subjects.

Many perceived problems arise because there are no reference
implementations (for example of the dictionary stacking protocol).
Perhaps this is because the community has not in fact seen a real
need for such features, but I think we can only gauge the real
utility (and limitations) of the different suggested approaches by
putting them to the test. As Herbert has mentioned, a project to
create a reference implementation for dictionary stacking is now
getting under way.

The impression I retain is that the CIF approach still offers
many novel features, and I'm not sure how many of those features are
being tested in other environments - Peter may have some helpful
examples for us from the XML world. It is in any case beneficial to
test these features to see which of them are really useful in
practice, and which in the long run are simply impractical.

Regards
Brian

Crystallography Online: the website of the International Union of Crystallography

CIF Infoset