Advice on COMCIFS policy regarding compatibility of CIF syntax with other domains
pm286 at cam.ac.uk
Fri Mar 4 14:07:29 GMT 2011
On Fri, Mar 4, 2011 at 11:47 AM, James Hester <jamesrhester at gmail.com>wrote:
> Thanks Peter for your comments. While you may not be a voting member
> of COMCIFS, you and other COMCIFS members fulfill an important
> advisory role and I would encourage everybody to take the opportunity
> to provide their perspectives.
> I assume you have no particular disagreement with the principles that
> you haven't commented on explicitly?
None at all - it's just that I haven't been as heavily engaged in CIF
recently and so wouldn't have meaningful comments.
> I've added some comments in response to your comments, inserted below:
> > I found the original ASCII escapes difficult/tedious for some code points
> > and woudl urge full unicode support (with numeric values).
> I perhaps wasn't clear that we have already taken this step. The
> current CIF2 draft envisions full Unicode support using UTF-8
> encoding. Some provision has been made for allowing other encodings
> in the future. The point of the example was to show how this decision
> to adopt Unicode was justifiable in terms of these principles.
It's really important to manage encoding. I am completely supportive of
UTF-8 but we don't mandate it in CML as XML can manage different encodings.
The problem comes when non-conformant tools are used and this is
particularly common with Microsoft tools which use CP-1252. This means that
for any code points above 127 a cut-and-patse is likely to corrupt
So if I have understood correctly all CIF documents MUST use UTF-8 and I'd
strongly support this. It might be useful to announce this in the document
(similarly to XML's <? encoding="UTF-8"?>). This is so that non-CIF tools
can recognise the encoding.
It does put requirements on the toolchain. If an author receives a CIF with
high codepoints, pastes bits of it into (say) Windows and re-saves there is
a good chance that characters will become corrupted. Anglophones often do
not realise this as they do not have diacritics and high-code points. (I
applaud the removal of the separate escaped diacritic that CIF originally
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the comcifs