Advice on COMCIFS policy regarding compatibility of CIFsyntaxwi th other domains.. .

Thu Mar 10 14:58:16 GMT 2011

On Thursday, March 10, 2011 3:22 AM, Matthew Towler wrote:

>I will agree with many of the points made by Peter.  I believe the decision on the byte order markings (BOM) should be made having considered what type of format CIF should be.  As I see it there are two options.
>
>1) An easily human editable, text based format, as CIF 1.1 is presently.  [...]
>
>2) A machine editable or non-text format, such as XML or PDF or a text file with non-standard encoding.  [...]

[...]

>In summary, I feel that creating a non-standard-standard will impede the usage of the new files, and therefore the best choice is to use standard Unicode files.

I would like to point out that the DDLm technical subcommittee devoted considerable time and energy to character encoding and related topics, to the extent that we prevailed upon IUCr to provide a discussion list specifically for that contentious debate.  You will find the early part of the discussion among the archives of the main DDLm list (http://www.iucr.org/__data/iucr/lists/ddlm-group/), and you will find the later, larger part of the discussion, including the genesis of our ultimate compromise, in the archives of the cif2-encoding list (http://www.iucr.org/__data/iucr/lists/cif2-encoding/).

A specification documenting the differences between CIF 1.1 and CIF 2.0 (http://www.iucr.org/__data/assets/pdf_file/0004/47434/cif2_syntax_changes_jrh20101115.pdf) was previously approved by COMCIFS.  Inasmuch as the CIF 2.0 syntax discussion continues, however, the changes already approved could yet be modified.  I encourage those interested in the topic of character encoding to read the "Change 2" section of the changes document to find how CIF 2.0, as currently constituted, will address those issues.  To summarize, however, the approved CIF 2.0 changes attempt to address the text-based historical legacy of CIF -- recognizing that "text" is a poorly-defined and system-dependent term -- while simultaneously looking forward to Unicode.  The only CIF 2.0 mechanisms currently supported for including literal characters that have no ASCII mapping are (1) to encode the whole document in UTF-8 with or without a UTF-8 BOM, or (2) to encode the whole document in UTF-16 with a UTF-16 BOM.

I am certain that COMCIFS would be interested in hearing from anyone who believes the compromise to be flawed or unreasonable, or that it would hinder adoption of CIF 2.0.  I do hope to avoid repeating the debate that the DDLm group already conducted on the topic, however.

Regards,

John

--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer