Problems with CIF BNF

Joe Krahn krahn at niehs.nih.gov
Mon Mar 12 20:30:37 GMT 2007


I realize that there are a few hacks in the BNF to deal with
context-dependence, like productions defined as multiple symbols, which
make it impossible to use as a working BNF. But, there are other
problems with grammar. With the end-of-line example, the lexer can do
something 'sensible', but it is still important to have a specific
definition of whether missing a terminal <eol> makes the CIF invalid.

I can look at CBFlib to see an interpretation of the CIF grammar, but
someone else's parser may have a different interpretation. In fact, it
would be good to have a collection of unusual CIF files for parser
testing, with a consensus as to which ones are valid and which are invalid.

Joe

Herbert J. Bernstein wrote:
> Without a defined lexer, you cannot do CIF as a BNF; it is context
> sensitive in its use of whitespace.  The question you are raising
> about EOF should be handled by the lexer, which should deal sensibly
> with the usual unix problem of disambiguating the case of a final
> line that ends with eof rather than eol-eof.  There is a rather
> complete bison grammar in CBFlib working on the level of tokens
> after lexing the input.  -- HJB
> 
> 
> At 1:44 PM -0400 3/12/07, Joe Krahn wrote:
>> Some parts of CIF are vague. I hoped that the BNF syntax would be a
>> precise syntax specification, but it has problems. It is central to
>> properly defining the CIF format, and should therefore be very accurate.
>>
>> First, there are some plain syntax errors, like unbalanced braces in the
>> production of <Float>, and an empty token in the TokenizedComments
>> production.
>>
>> There are also a few hacks like <noteol>, and the lack of rules for the
>> content of quoted strings. I think it is also a hack for a production
>> unit to be defined for two elements, like "<eol><UnquotedString>".
>>
>> Does EOF count as whitespace? Normally, a text file ends with an <eol>
>> on the last line, so it is not a problem. With Fortran, you may not be
>> able to distinguish between them, so it seems that EOF probably should
>> count as a whitespace token.
>>
>> There are also places where the grammar could be simplified, such as:
>>
>>   { {'e' | 'E' } | {'e' | 'E' } { '+' | '- ' } } <UnsignedInteger>
>>
>> written as:
>>   {'e' | 'E' } { '+' | '-' }?  <UnsignedInteger>
>>
>> Also note the error in the first form copied from the web page: the
>> minus sign has a space included.
>>
>> Should the logical-OR symbol always be contained within braces? This
>> appears to be inconsistent, but maybe the rule is to require braces when
>> the members include a quoted character element.
>>
>> I will try to edit my own version of the BNF to produce what I think it
>> is supposed to mean. Answers to some of the above questions will be
>> helpful in getting it right.
>>
>> Thanks,
>> Joe Krahn
>> _______________________________________________
>> comcifs mailing list
>> comcifs at iucr.org
>> http://scripts.iucr.org/mailman/listinfo/comcifs
> 
> _______________________________________________
> comcifs mailing list
> comcifs at iucr.org
> http://scripts.iucr.org/mailman/listinfo/comcifs


More information about the comcifs mailing list