Advice on COMCIFS policy regarding compatibility of CIFsyntax with other domains. .

Bollinger, John C John.Bollinger at STJUDE.ORG
Tue Mar 15 16:32:33 GMT 2011

On Tuesday, March 15, 2011 6:39 AM, Herbert J. Bernstein wrote:

>   My apologies.  This may take a while.  To avoid critical points getting
>lost, I'd like to focus on one sub-issue at a time, starting with the
>divide between syntax and semantics.  If we isolate all syntax development
>from semantics and relegate all semantics to the dictionaries, CIF becomes
>something very different from what it has been in the past, through
>CIF1.1.  CIF would be confined purely to considerations of which strings
>of characters are valid.

I agree that syntax cannot be wholly divorced from semantics, but I don't think anyone suggested such a split.  James's revised principles merely express a bias towards standardizing "behavior" via dictionaries vs. via the base syntax.  This is consistent with usage of CIF 1.1.  In any case, it is the *syntax* specification that is currently the focus of attention, and it is feasible for that document to be limited to exactly the scope Herbert describes, provided that it is accompanied by a companion document specifying the needed base semantics.

>  The dictionaries would deal with such issues
>as whether the numeric strings 13.45 and 1.345E1 are equivalent.  All
>the "common semantic features" of CIF 1.1 would have to be replicated
>dictionary by dictionary and no longer would have to be common.  The
>relationships between CIFS and their dictionaries now specified by
>the DDLs as part of CIF would have to be moved down purely to dictionary
>development, and instead of just having DDL1, DDL2 and DDLm, we could have
>one or more flavors of DDL for each subdomain using CIF, or even one per
>data file.

I confess I don't follow the latter part of those comments, but I am confident at least that DDLs will not proliferate.  DDLm may well not be the last DDL, but a new DDL requires too much investment and infrastructure to be created casually.  Moreover, I don't see why new DDLs would be needed to support the kinds of semantics that might be considered for inclusion in the base CIF specifications.

>   I, for one, think that the divide used in the past, in which as much
>as possible of the common semantics was treated along with the raw
>syntax, was a very useful approach and helped to reduce the drift
>of CIF into multiple dialects, and that we will consider all proposed
>features in terms of their total impact on the use of CIF, not just
>in terms of the validity or invalidity of particular strings.

The CIF syntax specification must document how to express logical, base CIF structure and content in concrete electronic form.  By "logical, base CIF structure and content" I mean:

1) The logical structure of CIF, consisting of data blocks, save frames, loops, data names, and data values

2) Logical data block names, save frame names, and data names, consisting of limited-length sequences of abstract characters (in the Unicode sense of the term), excluding certain characters.

3) A small set of base data types that values may have.  At minimum, these would be just character and null, but I think it useful to include a base numeric type as well, as Herbert's comments suggest he also does.

4) The properties of values of each base type.  For example, at the logical level, values of character type consist of a sequence of any number of arbitrary abstract characters (Unicode sense).  (Or are some characters excluded at this level?)  Following CIF 1.x, values of numeric type might be arbitrary-precision, arbitrary-scale floating point numbers with an optional associated standard uncertainty.

Supporting that model covers most of the ground that the CIF 1.1 "Common Semantic Features" document does, thus cleaving closely to "the divide used in the past".  For example, it follows from a mandate to support that model that the base CIF 2.0 specifications should provide, among other things,

a) either unlimited line length or some means of line-folding for data values

b) a means to express all characters allowed in logical data values, data block codes, save frame codes, and data names

c) a means to express logical data values that contain all data value delimiters employed by the syntax

It does not follow, however, that every or even several reasonable means of addressing each of those needs should be included in the base specifications.  It also does not follow that alternative means of addressing them cannot or should not be defined in dictionaries.



John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:

More information about the comcifs mailing list