CIF Infoset

Dr P. Murray-Rust pm286 at cam.ac.uk
Wed Aug 18 12:24:51 BST 2004


On Aug 18 2004, Nick Spadaccini wrote:

Greetings Nick,

> On Tue, 17 Aug 2004, Peter Murray-Rust wrote:
> 
> > Q. Are comments part of the infoset? My current belief is no, but
> > certain comments (e.g. #\\#CIF_1.1. convey important information. Also
> > some comments such as
> > 
> > # Supplementary Material (ESI) for Organic & Biomolecular Chemistry #
> > This journal is © The Royal Society of Chemistry 2003
> > 
> > may suffer by being lost
> 
> This question is more interesting than any answer. If infosets define
> lexically equivalent files why ask this question? 

If the CIF community feels that the lexical integrity of CIFs is important, 
then the infoset is irrelevant - the CIF is the lexical content and every 
lexical variant is a different CIF. In that case:

data_a _foo f

and 
data_a 
_foo 'f'

are different and that's an end to my questions! It means that the 
communality in CIF software is limited to checking the lexical and some 
semantic validity but not its interetation or re-use. In that case every 
CIF implementer is free to take their own view on how a CIF should be 
*interpreted*. Personally I would feel this was unfortunate as it detracts 
from interoperability.



>If there is a comment in
> the file, then there should exist an infoset that can handle it - isn't
> that the idea? Whether at an application level one chooses to use the
> comments is a different question.

In the XML community the design is now:

datainstance (*.xml) -parser-> infoset -SAX/DOM API -> application

There is communal benefit in having all parsers produce the same infoset 
from a given xmlInstance and exposing it through a common API such as SAX 
and DOM. (see http://sax.sf.net for a discussion of the history and benfits 
of this approach).

In writing a CML application I can use any XML parser in the knowledge that 
it will produce the same SAX callback content. I do not have to write a 
parser myself. (I also do not have to write a validator, a transformer, 
etc.) If, in CIF, everyone has to write all components in an application it 
reduces the communal benefit of having a shared lexical structure.

FWIW I have written my own CIFinfoset. I'm looking for communal feedback 
before publishing it...

> 
> StarBase (an application) *chooses* to interpret comments as lexical
> whitespace and removes them in the tokenising phase.
> 
> Does an infoset for HTML that says
> 
> <b><!--interpret hello as goodbye-->hello</b> is equivalent to
> <b>hello</b>? If so, wouldn't that be somewhat dangerous?

HTML chooses to make comments part of the infoset so those two documents 
have different infosets. However the following *are* equivalent at infoset 
level:

<b>hello</b>
<b    >hello</b>
<b>&#104;<![CDATA[ell]]>o</b>

> > Q. Does the presence or absence of a dictionary affect the infoset? (it
> > is formally impossible to deconvolute namespaces or categories without a
> > dictionary) Moreover defaults, etc (see below) depend on a dictionary.
> 
> Why is the deconvolution of namepsaces and categories (in the Star 
> syntax) a lexical issue? That is a higher order issue. The datanames 
> would have to be identical (up to case) in either file, though their 
> placement could be very different.

In CIF DDL1 namespaces are indicated by prefixes affixed through 
underscores. A tag such as _my_local_namespace_atom_xyz_angstrom_B12

is unparsable unless there is a lookup table of what the allowable 
namespaces are, the dictionaries that belong to them, the allowed sufiixes, 
etc.

> 
> > the presence of a dictionary is important, is it an error to have a CIF
> > without a dictionary?
> 
> The lexical level I am trying to see how you need a dictionary.

DDL1 semantics require that only tags of the same category are found within 
a given loop_. It is impossible to determine the category from the tag name 
without a dictionary.

 If it is a
> question of a value like "?" versus a another file with the default value
> substituted then these are very different things, and the infoset should
> highlight them as such.
> 
> > Q Should the (a) fact (b) manner of quoting be preserved in the 
> > infoset? The specification suggests that '12' and 12 should be 
> > interpreted differently in certain circumstances, but I cannot work out 
> > which and how. (The type of a data item is defined by the dictionary 
> > entry char/numb - does the quoting overrule this? If not, what is its 
> > role?)
> 
> This is a throwback from the very first versions of STAR. It was a weak
> attempt at some type information (only char and num - woefully
> inadequate). However it seems to me the declaration as char or numb had to
> do with its lexical appearance - not its actual type. So if something is
> numb, you expect it to be a number, irrespective of the lexical eye candy
> provided by a variety of delimited string forms. If _cell_length is
> declared numb, then '12.1' and 12.1 are equivalent in interpretation (at
> the application level).

The CIF specification indicates that there have different semantics. If 
this is now obsolete or deprecated it would make implementations simpler.

> 
> Mmmmmm. Now I can see why you think you need dictionaries. However if the
> above is what you are supposed to do with infosets the I have
> misunderstood what its intent is. I guess that infosets states the
> following to XML entities are lexically equivalent, <blah></blah> and
> <blah />, but this is a well defined operation

It is in XML. I am asking whether it is in CIF :-)

 - like order independence
> in STAR. I wouldn't *expect* an infoset to deal with the semantic
> equivalences of delimited versus non delimited strings.

Agreed. If they are equivalent then they don't concern the infoset. If, 
however, the CIF community sees quoting as important to pass to the 
application, then the infoset has to retain it.
> 
> > Q Is the order of data items and loops in a data_ block unimportant?
> 
> By definition.

Agreed, but I couldn't see it in the syntax/semantic specifications, so 
suggest it is formally added to them

> 
> > Q is the order of names in a loop_ header important? Do
> 
> At any single level yes, but not through a full nesting (STAR not a CIF
> issue).

Good. So again this would be useful to include explicitly

> 
> > Q Is the order of "rows" in a loop_ unimportant? Do
> 
> Yes (in CIF).

That is very useful (and non-obvious from the spec. It then makes it 
possible to confirm the identity of two sets of coordinates, symmetry 
operations, etc.

> 
> > have identical infosets? (In a relational model they would).
> > 
> > Q Does data_global have any semantics? I suspect that formally it does
> > not, but it seems in widespread use:
> 
> 
> data_global doesn't exist. 

It does (frequently). (I appreciate that gloabl_ is different and 
irrelevant to CIF/DDL1). data_gloabl is very frequently used as the first 
block in a multiblock CIF to indicate information that (I assume) the 
author wishes to apply to all blocks. I think it either needs deprecating 
or accepting and formalising.



global_ does (in STAR and CIF?). Its semantics
> are well defined.
> 
> > 
>   global_
> > _foo foo
> > 
> > data_a
> > _bar a
> > 
> > data_b
> > _bar b
> > 
> > seems to have the semantics equivalent to:
> > 
> > data_a
> > _foo foo
> > _bar a
> > 
> > data_b
> > _foo foo
> > _bar b
> 
> Yes, furthermore 
> 
>  global_
>  _foo foo
> 
>  data_a
>  _bar a
> 
>  global_
>  _foo foo2
> 
>  data_b
>  _bar b
> 
>  seems to have the semantics equivalent to:
> 
>  data_a
>  _foo foo
>  _bar a
> 
>  data_b
>  _foo foo2
>  _bar b
> 
> > Q how should ? be treated in the infoset?
> 
> Strictly it should be treated as ? at the lexical level ie TOKEN(UNKNOWN).
> What you do with that at the higher level may require the dictionary.
> Similarly (at a lextical level) "." should be left as it is. It is up to
> the application to deal with it.
> 
> > Q how is '.' to be interpreted?
> 
> Again (I believe) an application level problem, not to be handled at a
> lexical level.
> 
> > This is extremely difficult to interpret in the infoset. The first part
> > suggests that the limitations come from a non-rectangular loop_ - it is
> > simply there so the syntax is not violated. The default value cannot be
> > applied without a program that understands and implements dictionary
> > entries. How common is this? (I suspect fairly rare.) If so, I would
> > argue that the default approach is dangerous and be phased out.

The question is whether the author can assert any behaviour or 
interpretation to these tokens. Does "?" impart information. There is a 
difference between:

"I have omitted this info without giving a reason"
"I have tried to find this information but cannot do so"

I would suggest that "?" be accompanied by a statement like "this symbol 
conveys no information about the data item and no inference can be made 
about its presence or absence"

"." is worse because the spec can be interpreted as requiring the 
implementer to insert the default value from the dictionary. At one stage 
this would be interpreted to mean that unless specified all extinstion 
corrections were, by default, Zachariasen. Defaults, and their insertion, 
have to be explicitly specified.

> 
> I suspect apart from Syd and I, almost no one sucks in dictionaries to
> validate STAR/CIF file contents. Most just assume they know what they need
> to and hope the definition of the data item has never changed..

That may be true, but I look to the future where documents will be curated 
and retrieved by machines and the machine-implementable semantics are 
critical. Since CIF is an outstanding example of a successful scientific 
ontology it is worth seeing how far it can be pursued
> 
> Good luck, Peter.

Thanks - more later.

P.

> 
> cheers
> 
> Nick
> 
> --------------------------------
> Dr N. Spadaccini                                      Head of School
> 
> School of Computer Science & voice: +(61 8) 6488 3452 Software 
> Engineering fax: +(61 8) 6488 1089 The University of Western Australia 
> email: nick at csse.uwa.edu.au 35 Stirling Highway w3: 
> www.csse.uwa.edu.au/~nick CRAWLEY, Perth, WA 6009 AUSTRALIA CRICOS 
> Provider Code: 00126G
> 
> 
> 
> _______________________________________________
> comcifs mailing list
> comcifs at iucr.org
> http://scripts.iucr.org/mailman/listinfo/comcifs
> 



More information about the comcifs mailing list