Advice on COMCIFS policy regarding compatibility of CIFsyntaxwi th other domains.. .
Herbert J. Bernstein
yaya at bernstein-plus-sons.com
Thu Mar 10 15:49:42 GMT 2011
Dear Colleagues,
Unfortunately, John Bollinger, in his desire to help clarify the
current CIF2 proposal with respect to encoding has overstated
the rules in his summary. What the change documcent currently says
is:
"CIF2 files are standard variable length plain text files, which for
compatibility with older processing systems will have a maximum line
length of 2048 characters. As discussed above and below, however, there
are some restrictions on the character set for token delimiters,
separators and data names. For compatibility with CIF1 behaviour, there is
no formal restriction on the encoding of CIF2 files, providing they
contain only code points from the ASCII range. If a CIF2 file contains
characters equivalent to Unicode code points greater than U+007F (127
decimal), then the particular encoding used must either be UTF8 or
algorithmically identifiable from the CIF2 file itself. Acceptable
identification algorithms will be published as necessary as annexes to
this standard (see description of magic code and encoding disambiguation
in Change 1). Annexes notwithstanding, (i) a CIF2 file containing
characters outside the ASCII range with no BOM and no disambiguation
signature will be a UTF8 file, and (ii) a CIF2 file containing characters
outside the ASCII range with a valid UTF8 or UTF16 BOM and no
disambiguation signature, will be a Unicode file written in the indicated
encoding.
In keeping with XML restrictions we allow the characters
U+0009 U+000A U+000D
U+0020 -- U+007E
U+00A0 -- U+D7FF
U+E000 -- U+FDCF
U+FDF0 -- U+FFFD
U+10000 -- U+10FFFD
In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is
any hexadecimal digit are disallowed. Unicode reserves the code points
E000 F8FF for private use. The IUCr and only the IUCr may specify what
characters are assigned to these code points in the context of CIF2.
Reasoning: There is growing demand for the wider character set afforded by
Unicode to be made available in applications, especially those where
internationalisation is an issue.
=====================================================
In particular, the statement
> The only CIF 2.0 mechanisms currently supported for
> including literal characters that have no ASCII mapping are (1) to
> encode the whole document in UTF-8 with or without a UTF-8 BOM, or (2)
> to encode the whole document in UTF-16 with a UTF-16 BOM.
implies the encoding issue is settled. That is not what the draft change
document says, and what is in the change document is certainly not the
last word on encodings.
I would urge those who have ideas on the subject to feel free to express
them, especially because the specification of "disambuguation signatures"
is an open, unresolved issue in the change document, and the concept of a
unicode BOM admits a much wider range of encodings than just UTF-8 and
UTF-16.
I have found what has been said thus far very helpful and educational, and
hope that the dicussion will continue.
Regards,
Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
yaya at dowling.edu
=====================================================
On Thu, 10 Mar 2011, Bollinger, John C wrote:
>
> On Thursday, March 10, 2011 3:22 AM, Matthew Towler wrote:
>
>> I will agree with many of the points made by Peter. I believe the
>> decision on the byte order markings (BOM) should be made having
>> considered what type of format CIF should be. As I see it there are
>> two options.
>>
>> 1) An easily human editable, text based format, as CIF 1.1 is
>> presently. [...]
>>
>> 2) A machine editable or non-text format, such as XML or PDF or a text
>> file with non-standard encoding. [...]
>
> [...]
>
>> In summary, I feel that creating a non-standard-standard will impede
>> the usage of the new files, and therefore the best choice is to use
>> standard Unicode files.
>
> I would like to point out that the DDLm technical subcommittee devoted
> considerable time and energy to character encoding and related topics,
> to the extent that we prevailed upon IUCr to provide a discussion list
> specifically for that contentious debate. You will find the early part
> of the discussion among the archives of the main DDLm list
> (http://www.iucr.org/__data/iucr/lists/ddlm-group/), and you will find
> the later, larger part of the discussion, including the genesis of our
> ultimate compromise, in the archives of the cif2-encoding list
> (http://www.iucr.org/__data/iucr/lists/cif2-encoding/).
>
> A specification documenting the differences between CIF 1.1 and CIF 2.0
> (http://www.iucr.org/__data/assets/pdf_file/0004/47434/cif2_syntax_changes_jrh20101115.pdf)
> was previously approved by COMCIFS. Inasmuch as the CIF 2.0 syntax
> discussion continues, however, the changes already approved could yet be
> modified. I encourage those interested in the topic of character
> encoding to read the "Change 2" section of the changes document to find
> how CIF 2.0, as currently constituted, will address those issues. To
> summarize, however, the approved CIF 2.0 changes attempt to address the
> text-based historical legacy of CIF -- recognizing that "text" is a
> poorly-defined and system-dependent term -- while simultaneously looking
> forward to Unicode. The only CIF 2.0 mechanisms currently supported for
> including literal characters that have no ASCII mapping are (1) to
> encode the whole document in UTF-8 with or without a UTF-8 BOM, or (2)
> to encode the whole document in UTF-16 with a UTF-16 BOM.
>
> I am certain that COMCIFS would be interested in hearing from anyone who
> believes the compromise to be flawed or unreasonable, or that it would
> hinder adoption of CIF 2.0. I do hope to avoid repeating the debate
> that the DDLm group already conducted on the topic, however.
>
>
> Regards,
>
> John
>
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
> Email Disclaimer: www.stjude.org/emaildisclaimer
>
> _______________________________________________
> comcifs mailing list
> comcifs at iucr.org
> http://scripts.iucr.org/mailman/listinfo/comcifs
>
More information about the comcifs
mailing list