Advice on COMCIFS policy regarding compatibility of CIF syntax with other domains
Herbert J. Bernstein
yaya at bernstein-plus-sons.com
Fri Mar 4 15:30:30 GMT 2011
There is a misunderstanding here. All CIF2 documents are _not_
required to use UTF-8. The current draft proposal is written
in terms of Unicode, but the proposal explicitly says:
"For compatibility with CIF1 behaviour, there is no formal
restriction on the encoding of CIF2 files, providing they contain
only code points from the ASCII range. If a CIF2 file contains
characters equivalent to Unicode code points greater than U+007F
(127 decimal), then the particular encoding used
must either be UTF8 or algorithmically identifiable from the CIF2 file itself.
Acceptable identification algorithms will be published as necessary
as annexes to this standard (see description of magic code and
encoding disambiguation in Change 1). Annexes notwithstanding, (i) a
CIF2 file containing characters outside the ASCII range with no BOM
and no disambiguation signature will be a UTF8 file, and (ii) a CIF2
file containing characters outside the ASCII range with a valid UTF8
or UTF16 BOM and no disambiguation signature, will be a Unicode file
written in the indicated encoding."
We have not yet been able to come to agreement on the "disambiguation
signatures to be used". We have space reserved on the first line.
At 2:07 PM +0000 3/4/11, Peter Murray-Rust wrote:
>On Fri, Mar 4, 2011 at 11:47 AM, James Hester
><<mailto:jamesrhester at gmail.com>jamesrhester at gmail.com> wrote:
>Thanks Peter for your comments. While you may not be a voting member
>of COMCIFS, you and other COMCIFS members fulfill an important
>advisory role and I would encourage everybody to take the opportunity
>to provide their perspectives.
>I assume you have no particular disagreement with the principles that
>you haven't commented on explicitly?
>None at all - it's just that I haven't been as heavily engaged in
>CIF recently and so wouldn't have meaningful comments.
>I've added some comments in response to your comments, inserted below:
> > I found the original ASCII escapes difficult/tedious for some code points
> > and woudl urge full unicode support (with numeric values).
>I perhaps wasn't clear that we have already taken this step. The
>current CIF2 draft envisions full Unicode support using UTF-8
>encoding. Some provision has been made for allowing other encodings
>in the future. The point of the example was to show how this decision
>to adopt Unicode was justifiable in terms of these principles.
>It's really important to manage encoding. I am completely
>supportive of UTF-8 but we don't mandate it in CML as XML can manage
>different encodings. The problem comes when non-conformant tools are
>used and this is particularly common with Microsoft tools which use
>CP-1252. This means that for any code points above 127 a
>cut-and-patse is likely to corrupt characters.
>So if I have understood correctly all CIF documents MUST use UTF-8
>and I'd strongly support this. It might be useful to announce this
>in the document (similarly to XML's <? encoding="UTF-8"?>). This is
>so that non-CIF tools can recognise the encoding.
>It does put requirements on the toolchain. If an author receives a
>CIF with high codepoints, pastes bits of it into (say) Windows and
>re-saves there is a good chance that characters will become
>corrupted. Anglophones often do not realise this as they do not have
>diacritics and high-code points. (I applaud the removal of the
>separate escaped diacritic that CIF originally had).
>Reader in Molecular Informatics
>Unilever Centre, Dep. Of Chemistry
>University of Cambridge
>CB2 1EW, UK
>comcifs mailing list
>comcifs at iucr.org
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
yaya at dowling.edu
More information about the comcifs