CIF Infoset

Thu Aug 19 14:26:39 BST 2004

On Aug 19 2004, Herbert J. Bernstein wrote:

> There are two questions that Peter raises relative to comments and one
> relative to data types that call for a very clear response
> 
> >
> > Q. If my CIF parser automatically strips all comments from the 
> > document and, say, deposists them in a public repopsitory, does anyone 
> > feel this is a problem?
> 
> 
> This is not only a problem, but depending on who owns the illectual
> property right in the document involved, it well may a violation of
> copyright law. It is common practice to put copyright statements and
> references to  licenses in the comments of documents, whether they be in
> CIF, XML or some other language.  If you have created the document in
> question, what you extract from it and deposit in a public repository is
> your business. If the document was created by someone else, or you
> surrendered your intellectual propoerty rights to someone else, they get
> to decide how derived works are handled.  So, if you are designing a CIF
> parser to extract information from a CIF for some application to process
> internally, stripping all comments may well be a good idea, but if you are
> designing a CIF (or XML, or postscript, or ASN.1) or other parser to
> reformat  documents, then you need to be much more careful and inclusive
> of comments.

My own view is that comments should be preserved. Taking Herbert's view it 
then follows that comments are order-dependent.

# Here is a list of authors
# The first one is the lead author
# A.B.Foo
# D.E.Bar

It also suggest that we should have a "comment block" (since comments 
cannot span more than one line

However I think it would also be valuable to stress that any IPR, metadata, 
or other semantics are put in CIF items or loop_s and not in comments. I 
would prefer that authors are dissuaded from using comments for important 
information

- that the 
> 
> >
> > Q. Is the CIF version "comment" a special case and should it be 
> > preserved (I believe yes)
> 
> The handling of the CIF magic number comments depends on what you are
> doing with the document.  If you are reading the document, it is a good
> idea to read and parse the magic number to provide your parser with a hint
> as to the intended syntax (e.g. 80 character vs. 2048 character line
> length limit).  If you are writing a document, then rather than preserving
> the magic number comment from some starting document, you want to generate
> your own magic number comment that corectly specifies the syntax
> specification being followed by your CIF writer.  The sensible practice
> has been well established in the HTML/SGML/XML community, and proves very
> helpful in dealing with the dizzying variety of HTML/SGML/XML syntax
> versions.  Hopefully we will never have as many co-existing syntax
> versions in the CIF  community, but the practice is still a sound one to
> follow.
> 
There is currently only one syntax for XML (V1.0), though XML1.1 is under 
devlopment. The XML declaration: <?xml version="1.0"?> is not mandatory but 
encouraged. I assume that the CIF magic comment is of that form and 
therefore not fundamentally a comment (the XML declaration is not a 
processing instruction).

> > I am now unclear about the role of char and numb. I assumed they were 
> > for data validation and application programmers. The first would ensure 
> > that a data value was always a number - thus I would have believed that
> >
> > _cell_length_a 'too large to measure'
> >
> > was a validation error. The second aspect is now a nightmare for 
> > application programmers. Firstly the infoset (the result of the parse) 
> > has to retain knowledge of whether the value is quoted. Then the 
> > apllication has to take different action on whether the value is 
> > quoted. The author submits that _cell_length-a '12.1' _cell_length-a 
> > 12.1 have different meanings. (I cannot see what - as a programmer - I 
> > can or have to do). Formally if I get _cell_length-a '12.1' I would 
> > have to throw an exception "Cell_length_a is not a number, cannot 
> > continue".
> 
> As do many langauges, CIF has data types.  The number of types depends
> on the DDL, but in all cases, there is a distinction between numeric data
> and other, more string-oriented data types (e.g. char and text).

Agreed.

  Just as
> with most programming languages, a quoted "12" is not a number.  The
> application does not need to preserve the quotes, but it does need to
> recognize that the data type of the data that it just read is not a
> numeric data type, and if the context within which it is being used
> calls for a numeric data type (e.g. as a value to _cell_length_a) then
> a good parser really should inform the user of the conflict.  This
> does mean that the parser "has to take different action on whether the
> value is quoted", but that is one of the services the parser is there
> to perform for the user, if it can.  Yes, there may be justification
> for writing a light-weight parser that does not catch such errors, but
> that hardly makes it a "nighmare for application programmers" to
> write parser that do catch such errors.  Even when a dictionary is not
> being used, you really do want to recognize the distinction between
> number and non-numeric data.  For example, 1234-308 might well be
> intended as the number 1234*10**(-308) while '1234-308' is clearly
> intended to be the string of characters stated.

The difficulty is not pserving the data type, but the semantics of 
downstream decisions. If one author writes _my_phone "123-45678" they are 
announcing this is not a number while if another writes _my_phone 123-45678 
they are announcing it is a number. The discussion so far seems to suggest 
that these statements overrule the datatypes specified in the dictionary 
entries. There is a particular problem in loop_s, where it is then possible 
to have different data types within a column:

loop_ _atom_site_occupancy
1.0
0.3
"not refined"
"0.3"
"."

which makes the implementation very difficult. I believe that a programmer 
should be able to look up the data type in the dictionary entry and write a 
routine that relies on a value being of the correct data type and throws an 
exception if not.

P.

Crystallography Online: the website of the International Union of Crystallography

CIF Infoset