Advice on COMCIFS policy regarding compatibility of CIFsyntaxwi th other domains. .. .
Bollinger, John C
John.Bollinger at STJUDE.ORG
Tue Mar 22 14:46:19 GMT 2011
At James's request, I post the following slightly-edited version of some comments that yesterday I sent to the DDLm group. I hope that this will help move the discussion of CIF design principles toward a consensus. My apologies for the length.
On Friday, March 18, 2011 5:20 AM, Herbert J. Bernstein wrote:
HJB>No, I do not see a problem with separate syntax and semantics documents any more than I see a problem with separate productions in a grammar.
HJB>I do see a problem with _considering_ the design or impact of either syntax or semantics in isolation from each other. I firmly believe that the result of a purely "bottom-up" syntax-first design in isolation from a "top-down" sematics design or of a top-down semantics-first design in isolation from a syntax design is inefficient and likely to walk us into dead-ends. I derive this view from decades of literature on software engineering on failed approaches to software designs, and the continuing success of "the scandanavian method" or "particpatory design" in which work on internal design is intertwined with design of externals.
HJB>As for the requested example -- I already gave one -- the design of the numeric types in CIF 1.0 and CIF 1.1, in which the equivalence classes of numbers (i.e. that 13.45 and 1.345E1 are the "same"
number) is simply an assumed semantic feature intimately coupled to the syntax. To give another, much more subtle equivalence class issue, the equivalance of "abc" and abc and 'abc' but the inequivalences of "123" and 123 and of "." and "?" from . and ?
HJB>are semantic issues intimately coupled to the syntax. The original design document for CIF was a semantics document with a bit of intertwined syntax intertwined. DDL came later and the pure syntax and semantics documents came long after the intertwined approach _after_ everybody had a clear view of the interaction of CIF1 syntax and semantics. I, for one, do not yet have a clear understanding of those interactions for CIF2 and DDLm.
Herbert's points are well made. I agree that the design of numeric types is a soft spot in the CIF 1.x specifications, and that it relies on a close relationship between syntax and semantics. The special data values . and ? depend even more on such a relationship. Herbert is right that CIF 2.0 syntax cannot be designed in isolation from CIF 2.0 semantics, and that these issues in particular should be addressed.
The discussion has drifted rather far afield from the original and pressing question, however, which was "[With respect to triple quote syntax,] should we seek maximum consistency with other usage of identical syntactical constructs, despite the imposition of unnecessary technical baggage? Or should we produce a standard as simple and streamlined as possible, despite the potential for confusion and unorthodox behaviour?" I would be happy for COMCIFS to issue broader guidance, as has been suggested, but I hope that decision will not be unduly delayed by a detour into minutiae such as the division and interplay between CIF 2.0 syntax and semantics.
In pursuit of broad rather than narrow guidance, therefore, I suggest a change in the terms of discussion. Rather than syntax vs. semantics, it may be more useful to partition CIF into 'base' CIF 2.0, which all CIF 2.0 processors are expected to accept and interpret equivalently, and 'domain-level' CIF encompassing those aspects of CIF semantics and convention that are defined via the dictionary system. The base contains CIF syntax and the common semantics, whereas domain-level CIF adds ontology, constraints, controlled vocabulary, etc.. The key distinction between these layers is, of course, which features "all CIF 2.0 processors are expected to accept and handle equivalently." It is fitting that that dovetails with some of the technical arguments about the triple quote syntax. Base CIF is I think equivalent to "the common syntax and semantics of the CIF language" in Hebert's latest proposed principles.
On that basis, I offer this re-couching of the proposed design principles:
Principles guiding development of Base CIF 2.0
CIF is a framework for exchanging and archiving scientific data, featuring a human-readable, machine-parseable, file format designed to serve as an exchange and archive medium. 'Base' CIF comprises the definitions and constraints that underlie CIF and apply to all CIF files; those aspects defining the CIF file format are documented in the CIF Syntax specification and the CIF Common Semantic Features specification.
Base CIF aims to remain as simple as possible by delegating considerations such as ontology, vocabulary, data relationships, and complex and rich data types to domain dictionaries and the DDL formalisms by which those dictionaries are defined. In the following, the phrase 'domain level' refers to such documents (though it is anticipated that only dictionaries, not DDLs, will be domain-specific). Definitions and constraints at domain level apply to a particular CIF file only as declared by that file or as required by a particular CIF processor in a particular context.
The design of base CIF 2.0 is guided by these principles:
1. A feature should be added to or changed in base CIF only if all of the following are satisfied:
(i) Implementation of the desired behavior by changes at the domain level is not feasible, or else such changes, while feasible, would significantly reduce human readability;
(ii) the change provides significant new functionality that is widely applicable to most scientific domains
(iii) reliable transfer and archiving of data is not compromised
(iv) there is no simpler way of achieving the desired behaviour
(v) it has been shown possible to implement the change it at a cost commensurate with its benefits, as demonstrated in part by a rough consensus and running code.
2. As long as the requirements in (1) are satisfied, base CIF should:
(i) behave in a way that is consistent with common usage
(ii) align with pre-existing standards where those standards provide the required behaviour. CIF 1.1 can be considered a pre-existing standard for CIF 2.0 in this context.
3. Non-technical issues should be dealt with in non-technical arenas.
4. Draft changes to base CIF will be made available on the IUCr website for public comment for a period of at least 6 weeks, following which COMCIFS voting members, after consideration of any objections raised, can vote to accept the change. A change will be accepted if 3/4 of COMCIFS voting members approve it.
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital
Email Disclaimer: www.stjude.org/emaildisclaimer
More information about the comcifs