[Cif2-encoding] How we wrap this up

Bollinger, John C John.Bollinger at STJUDE.ORG
Mon Sep 27 15:45:51 BST 2010


Dear Colleagues,

On Monday, September 27, 2010 5:49 AM, Herbert Bernstein wrote:
>   Under the CIF2 specification with UTF8 in place of ASCII there is
>_no_ change in the use of elided ASCII sequences to represent non-ASCII
>characters until and unless the IUCr publications office decides that,
>for that particular application, they are ready to accept something
>new.

Absolutely correct.  The character elides of CIF1 are among its "common semantic features", which are expressly *not* part of the CIF1 format standard.  CIF2 explicitly omits them as well, leaving them in exactly the same place they are now.  None of this is at all affected by which encoding option we choose.

>   It is _only_ if you go forward with options 3, 4 or 5 that you
>are giving the green light to users to do precisely what you are
>concerned about -- using the unicode characters instead instead
>in possibly strange admixtures that nobody is ready to process.

The only way I can see that being true is if "text file" or "text" is intended to be interpreted, at least in part, as "containing only ASCII characters."  Is that your intended meaning, Herb?  Otherwise, CIF2's expansion to the full (more or less) Unicode character set opens the door for users to insert literal characters into their (conformant) CIFs in place of or in addition to elides, and none of the alternatives on the table change that.  What the various alternatives *do* affect is which byte-sequence representations of those characters will conform to CIF2, under which circumstances.

Independent of this particular issue, my greatest problem with options (1) and (2) is the imprecision of describing CIF2 simply as "text".  That this served well enough for CIF1 is irrelevant; CIF2's character set lends much more importance and impact to the interpretation of this aspect of the spec.  I see two, maybe even three, viable and functionally distinct possible definitions.  Would any of the proponents of that wording care to advance a definition of that term as it is intended to be interpreted in a CIF2 context?  This is substantially equivalent to James's open question, so no need to answer both.

[...]

>   My apologies to James, who I know is trying to do what he believes
>to be right, but I believe James has things backwards -- the "deep
>breath" is provided by my proposal -- taking the time to properly engineer
>the use of the extra characters UTF8 allows us to discuss clearly,
>while James' push for an immediate prescriptive use of UTF8 with
>prescriptions that differ drastically from what has been adopted
>by all other frameworks (HTML, XML, python, etc.) in ways that
>are untested and unsupported by most existing software is
>the untimely rush to judgement.

[...]

I doubt any of us could disagree that there is an engineering challenge here, but I have to agree with James that the only viable opportunity to leave this area for further development is to be explicitly restrictive now, ala option (3) or (4).  Not even my most preferred option (5) allows sufficient latitude for future extension without potentially invalidating some CIF2 CIFs and programs.

Furthermore, I don't think that "all other frameworks" adopt an entirely uniform approach, nor one that is necessarily equivalent to option (1) or (2).  For example, Sun's various implementations of the Java compiler seem to use "local" (in my sense of the term) unless the user passes an option to tell it otherwise.  XML and XHTML use UTF-8 unless a different encoding is explicitly named in the file, identified via a byte-order mark, or otherwise communicated at a higher level.  HTML tends to rely on a higher-level protocol to communicate encoding, but provides a mechanism for communicating it in-line.  ALL of the CIF2 options currently on the table share some characteristics with one or more of those.


Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital


Email Disclaimer:  www.stjude.org/emaildisclaimer



More information about the cif2-encoding mailing list