Advice on COMCIFS policy regarding compatibility of CIFsyntax with other domains.

Matthew Towler towler at ccdc.cam.ac.uk
Thu Mar 10 09:22:18 GMT 2011


I will agree with many of the points made by Peter.  I believe the decision on the byte order markings (BOM) should be made having considered what type of format CIF should be.  As I see it there are two options.

1) An easily human editable, text based format, as CIF 1.1 is presently.  Users can quickly edit files using the text editor of their choice.  Tools such as enCIFer that validate the content are helpful to users, but entirely optional.  For this situation to continue with the new format, both the BOM and encoding of Unicode characters need to be something standard that is already supported by a number of text editors.  IMO this means it must be a standard Unicode BOM (as described in the Unicode standard and on http://en.wikipedia.org/wiki/Byte_order_mark, and either UTF-8 or UTF-16 format.

2) A machine editable or non-text format, such as XML or PDF or a text file with non-standard encoding.  As an aside, I do realise that it is entirely practicable to edit XML by hand, but it is certainly more difficult than editing a CIF.  This would be the situation with a non-standard BOM or encoding and would imply that users need to use special tools to edit the files.  I feel that use of numeric encodings similar to HTML entity encodings (e.g. Ӓ) also falls into this category, as for a non HTML file standard text editors will not understand the encoding scheme.

A major disadvantage of (2) is that it will create a chicken and egg situation for the adoption of the new format.  Users will not be able to create the new files as there will be no tools, whilst providers of tools will be less inclined to develop these as the format is not in widespread use.  I expect a few enthusiasts would in this case produce some tools to get the community going, but they will likely be less featured than those already existing for CIF 1.1, imposing a barrier to adoption.  
A custom format will also raise the likelihood of users using the wrong type of editor to adjust files, resulting in more syntactically incorrect files or foreign characters being corrupted by use of different encodings.  Such errors will not help efforts to automatically store and curate crystallographic data.

In summary, I feel that creating a non-standard-standard will impede the usage of the new files, and therefore the best choice is to use standard Unicode files.

Matt

In case my signature does not make it apparent, although I work for the CCDC and have been involved in the development of enCIFer, these views are entirely my own as a scientific software developer, and have not been endorsed nor approved by the CCDC.

LEGAL NOTICE
Unless expressly stated otherwise, information contained in this
message is confidential. If this message is not intended for you,
please inform postmaster at ccdc.cam.ac.uk and delete the message.
The Cambridge Crystallographic Data Centre is a company Limited
by Guarantee and a Registered Charity.
Registered in England No. 2155347 Registered Charity No. 800579
Registered office 12 Union Road, Cambridge CB2 1EZ.


More information about the comcifs mailing list