Advice on COMCIFS policy regarding compatibility of CIF syntax with other domains

Peter Murray-Rust pm286 at cam.ac.uk
Fri Mar 4 16:50:41 GMT 2011


On Fri, Mar 4, 2011 at 3:30 PM, Herbert J. Bernstein <
yaya at bernstein-plus-sons.com> wrote:

> Dear Peter,
>
>    There is a misunderstanding here.  All CIF2 documents are _not_
> required to use UTF-8.  The current draft proposal is written
> in terms of Unicode, but the proposal explicitly says:
>

My apologies again.

>
> "For compatibility with CIF1 behaviour, there is no formal
> restriction on the encoding of CIF2 files, providing they contain
> only code points from the ASCII range. If a CIF2 file contains
> characters equivalent  to Unicode code points greater than U+007F
> (127 decimal), then the particular encoding used
> must either be UTF8 or algorithmically identifiable from the CIF2 file
> itself.
> Acceptable identification algorithms will be published as necessary
> as annexes to this standard (see description of magic code and
> encoding disambiguation in Change 1). Annexes notwithstanding, (i) a
> CIF2 file containing characters outside the ASCII range with no BOM
> and no disambiguation signature will be a UTF8 file, and (ii) a CIF2
> file containing characters outside the ASCII range with a  valid UTF8
> or UTF16 BOM and no disambiguation signature, will be a Unicode file
> written in the indicated encoding."
>

This seems reasonable. I interpret it as meaning that a CIF1 document only
uses characters from U+0020 to U+007F so that is compatible with any
encoding. Presumably processing software may then create higher Unicode
points from appropriate escape sequences? In which case it should label the
output document with the given  encoding.

>
> We have not yet been able to come to agreement on the "disambiguation
> signatures to be used".  We have space reserved on the first line.
> Any suggestions?
>

I would suggest that encodings should only be taken from
http://www.iana.org/assignments/character-sets. That the encoding (including
UTF-8) should be recorded in the first line of the file using only ASCII
characters so that other software can recognise the encoding. I haven't
followed the discussions on syntax but would suggest

encoding="FooBar1234"

as being compatible with XML and therefore most easily human-interpretable.
I would not rely on the BOM as I expect that cut-and-paste will often
destroy it.

I found http://www.opentag.com/xfaq_enc.htm quite a useful resource...

P.

>
>   Regards,
>      Herbert
>
>
> At 2:07 PM +0000 3/4/11, Peter Murray-Rust wrote:
> >On Fri, Mar 4, 2011 at 11:47 AM, James Hester
> ><<mailto:jamesrhester at gmail.com>jamesrhester at gmail.com> wrote:
> >
> >Thanks Peter for your comments.  While you may not be a voting member
> >of COMCIFS, you and other COMCIFS members fulfill an important
> >advisory role and I would encourage everybody to take the opportunity
> >to provide their perspectives.
> >
> >I assume you have no particular disagreement with the principles that
> >you haven't commented on explicitly?
> >
> >
> >None at all - it's just that I haven't been as heavily engaged in
> >CIF recently and so wouldn't have meaningful comments.
> >
> >
> >I've added some comments in response to your comments, inserted below:
> >
> >  >
> >  > I found the original ASCII escapes difficult/tedious for some code
> points
> >  > and woudl urge full unicode support (with numeric values).
> >
> >I perhaps wasn't clear that we have already taken this step.  The
> >current CIF2 draft envisions full Unicode support using UTF-8
> >encoding.  Some provision has been made for allowing other encodings
> >in the future.  The point of the example was to show how this decision
> >to adopt Unicode was justifiable in terms of these principles.
> >
> >
> >It's really important to  manage encoding. I am completely
> >supportive of UTF-8 but we don't mandate it in CML as XML can manage
> >different encodings. The problem comes when non-conformant tools are
> >used and this is particularly common with Microsoft tools which use
> >CP-1252. This means that for any code points above 127 a
> >cut-and-patse is likely to corrupt characters.
> >
> >So if I have understood correctly all CIF documents MUST use UTF-8
> >and I'd strongly support this. It might be useful to announce this
> >in the document (similarly to XML's <? encoding="UTF-8"?>). This is
> >so that non-CIF tools can recognise the encoding.
> >
> >It does put requirements on the toolchain. If an author receives a
> >CIF with high codepoints, pastes bits of it into (say) Windows and
> >re-saves there is a good chance that characters will become
> >corrupted. Anglophones often do not realise this as they do not have
> >diacritics and high-code points. (I applaud the removal of the
> >separate escaped diacritic that CIF originally had).
> >
> >P.
> >
> >
> >--
> >Peter Murray-Rust
> >Reader in Molecular Informatics
> >Unilever Centre, Dep. Of Chemistry
> >University of Cambridge
> >CB2 1EW, UK
> >+44-1223-763069
> >
> >_______________________________________________
> >comcifs mailing list
> >comcifs at iucr.org
> >http://scripts.iucr.org/mailman/listinfo/comcifs
>
>
> --
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>
>                  +1-631-244-3035
>                  yaya at dowling.edu
> =====================================================
> _______________________________________________
> comcifs mailing list
> comcifs at iucr.org
> http://scripts.iucr.org/mailman/listinfo/comcifs
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/comcifs/attachments/20110304/22fe9d6a/attachment-0001.html 


More information about the comcifs mailing list