[Cif2-encoding] The discussion so far

Thu Aug 5 02:44:13 BST 2010

Thanks John for this summary.  I think it is a fair description of the
current state of our deliberations.  I am currently drafting a response to
your previous email which I hope to be able to send fairly soon.

On Wed, Aug 4, 2010 at 12:58 AM, Bollinger, John C <
John.Bollinger at stjude.org> wrote:

>  Thanks, Brian, for creating this list.
>
>
>
> Since no one else has had the combination of time, energy, and inclination
> to do so, I’ll open with a summary of the state of the CIF 2.0 character
> encoding discussion so far, as it currently stands at its original location
> on the DDLm group <http://www.iucr.org/__data/iucr/lists/ddlm-group/>.
> Specifically, the previous summary and the discussion that proceeded from it
> can be found at
> http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00744.html and
> http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00690.html, though
> some of the most recent messages seem not to be available at present via the
> web interface.
>
>
>
> The controversy derives from CIF 2.0's expansion of character set to all of
> Unicode.  It is magnified by CIF 1.0's explicit self description as an
> encoding-independent text format, and by the accumulated body of CIF
> software and author practices that rely on that text orientation.  There has
> been considerable debate about what it would mean for CIF 2.0 to be a text
> format *vs*. a binary format, and the relative advantages and
> disadvantages of each.  Among the points covered were:
>
>
>
> 1) A 'text' format implies that CIF content may comply with local,
> locale-specific conventions for electronic text representation, including
> details such as line termination conventions and, especially, *character
> encoding*.  Such files are suitable input for general-purpose text tools
> such as text editors, text extraction utilities, and text indexers.
> Alternatively, a conformant text CIF might be expressed according to some
> other convention suitable for a particular application or foreign
> environment.  Because conventions differ, correctly archiving text or moving
> it between environments requires accounting for the text conventions in use,
> and may involve conversions such as line terminator changes and
> transcoding.  This is the CIF 1.0 position, though CIF1's restricted
> character set significantly reduces the impact of character encoding
> considerations relative to CIF2.
>
>
>
> 2) A 'binary' format is anything else, but in this context, the key
> characteristic of binary-CIF2 proposals is that they add to the text
> specification a specification for text serialization to byte-oriented media,
> such as disks and networks.  In particular, one strongly advocated position
> in the CIF 2.0 standardization discussion is that CIF 2.0 should require
> serialization of the underlying CIF text according to the UTF-8 character
> encoding scheme.  This would be a text-like binary format, in that some text
> tools can handle UTF-8 encoded text (sometimes requiring a little
> persuasion), and therefore could be used to read, modify, and write
> binary-CIF2 files.
>
>
>
> 3) The many specific issues and arguments that have been raised mostly fall
> into one or both of two general areas:
>
>
>
> 3a) *reliability*, by which we mean that a CIF consumer should have
> justifiable high confidence that he is interpreting CIF data in the way the
> CIF producer intended, and
>
>
>
> 3b) *usability*, by which we mean that human authors, and to a lesser
> extent, software, should be able to manipulate CIF2 files as they are
> accustomed to manipulating CIF1 files, e.g. using the default configurations
> of their systems’ text editors.  In addition to general usability, arguments
> in this category include some appealing to respect of scientists of many
> nationalities, and similar ones appealing to freedom / liberty.
>
>
>
> Furthermore, a few of the points have appealed to
>
>
>
> 3c) *practicality*, by which we mean that CIF stakeholders should be able
> to use the specification effectively.  This aspect is subordinate to
> reliability and usability, but it cuts across both.  For the most part, it
> relates to the ability to develop software and practices that address the
> likely real-world usage (and misusage) of the standard.
>
>
>
> 4) The group as a whole appears to have agreed that UTF-8 is a highly
> suitable encoding for CIF2.  It can encode the entire Unicode code space, it
> is a superset of ASCII, it can be recognized heuristically with low
> probability of error, and it is widely implemented.  These characteristics
> yield high reliability at the cost of some usability.  The debate is not
> about whether UTF-8 should be used, but rather about whether the standard
> should forbid use of other encodings.  Essentially, this is a recasting of
> the text *vs*. binary debate.
>
>
>
> 5) Inasmuch as consensus on the issues described above has not yet been
> reached and does not appear likely, the group has issued a call for comments
> from a wider group of stakeholders.  No results of that call have yet been
> reported back to the group.
>
>
>
> 6) In the interim, the discussion has moved toward finding middle ground.
> In particular, James Hester asked:
>
>
>
> >If we consider CIF as text as the overriding priority:
> >
> >1. How do we then make exchanging and storing files according to text
> conventions sufficiently reliable for the purposes of CIF?  How far are we
> prepared to compromise?
> >
> >If we consider reliable exchange of information as the top priority:
> >
> >2. How do we then make CIFs sufficiently accessible to text-based tools?
> How far are we prepared to compromise?
>
>  A short series of proposed schemes for CIF exchange and storage proceeded
> from that call:
>
>
>
> 7) In response to question (2), James offered a scheme (A) that primarily
> would relax the explicit specification of UTF-8 into a set of
> characteristics that a CIF encoding would need to satisfy.  His
> characteristics would need to be further pared down or relaxed to in
> practice permit encodings other than UTF-8.  This scheme will be reproduced
> in a separate e-mail, as it is not currently available from the DDLm-list
> web archive.
>
>
>
> 8) In response to question (1), John Bollinger offered a scheme (B) that
> retains text character for CIF2, and relies on labeling when wanted or
> needed to convey text conventions, and on hashing to provide verification
> and reliability.  This scheme will be reproduced in a separate e-mail, as it
> is not currently available from the DDLm-list web archive.
>
>
>
> A limited amount of additional discussion proceeded from these proposed
> exchange and archiving schemes, and that’s where we currently stand.
>
>
>
>
>
> Regards,
>
>
>
> John
>
>
>
> --
>
> John C. Bollinger, Ph.D.
>
> Department of Structural Biology
>
> St. Jude Children's Research Hospital
>
> ------------------------------
> Email Disclaimer: www.stjude.org/emaildisclaimer
>
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding at iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>

-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100805/8ff3741d/attachment.html