Advice on COMCIFS policy regarding compatibility of CIF syntax with other domains

Fri Mar 4 14:07:29 GMT 2011

On Fri, Mar 4, 2011 at 11:47 AM, James Hester <jamesrhester at gmail.com>wrote:

> Thanks Peter for your comments.  While you may not be a voting member
> of COMCIFS, you and other COMCIFS members fulfill an important
> advisory role and I would encourage everybody to take the opportunity
> to provide their perspectives.
>
> I assume you have no particular disagreement with the principles that
> you haven't commented on explicitly?
>

None at all - it's just that I haven't been as heavily engaged in CIF
recently and so wouldn't have meaningful comments.

>
> I've added some comments in response to your comments, inserted below:
> >
> > I found the original ASCII escapes difficult/tedious for some code points
> > and woudl urge full unicode support (with numeric values).
>
> I perhaps wasn't clear that we have already taken this step.  The
> current CIF2 draft envisions full Unicode support using UTF-8
> encoding.  Some provision has been made for allowing other encodings
> in the future.  The point of the example was to show how this decision
> to adopt Unicode was justifiable in terms of these principles.
>
>
It's really important to  manage encoding. I am completely supportive of
UTF-8 but we don't mandate it in CML as XML can manage different encodings.
The problem comes when non-conformant tools are used and this is
particularly common with Microsoft tools which use CP-1252. This means that
for any code points above 127 a cut-and-patse is likely to corrupt
characters.

So if I have understood correctly all CIF documents MUST use UTF-8 and I'd
strongly support this. It might be useful to announce this in the document
(similarly to XML's <? encoding="UTF-8"?>). This is so that non-CIF tools
can recognise the encoding.

It does put requirements on the toolchain. If an author receives a CIF with
high codepoints, pastes bits of it into (say) Windows and re-saves there is
a good chance that characters will become corrupted. Anglophones often do
not realise this as they do not have diacritics and high-code points. (I
applaud the removal of the separate escaped diacritic that CIF originally
had).

P.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/comcifs/attachments/20110304/6d1b5b80/attachment-0001.html