Advice on COMCIFS policy regarding compatibility of CIF syntax with other domains
pm286 at cam.ac.uk
Fri Mar 4 08:25:10 GMT 2011
I add some comments arising out of my own experience with XML/CML which may
be useful. I don't think I am a full member of COMCIFs so feel free to
ignore all or any. I comment after significant paragraphs.
On Fri, Mar 4, 2011 at 6:03 AM, James Hester <jamesrhester at gmail.com> wrote:
> 1. A feature should only be added to CIF syntax if all of the
> following are satisfied:
> (i) implementation or use of equivalent behaviour at dictionary level
> is either significantly more cumbersome or not possible;
> (ii) the feature provides significant new functionality that is widely
> applicable to most scientific domains
> (iii) reliable transfer and archiving of data is not compromised
> (iv) there is no simpler way of achieving the desired behaviour
> I would add:
* a feature should only be added if it has been shown possible to implement
it with "reasonable ease". "Rough consensus and running code"
> Example 2: Unicode support in CIF2. This is broadly useful, given the
> international nature of science and range of symbols used in
> scientific papers. It could have been implemented in dictionaries
> using ASCII escapes, but this would have been cumbersome to use, so it
> satisfies Principle 1. We have adopted Unicode (rather than created
> our own international character set) and copied the XML character
> ranges (Principle 2)
I found the original ASCII escapes difficult/tedious for some code points
and woudl urge full unicode support (with numeric values).
> Example 3: Space-separated lists in CIF2. Lists, especially matrices,
> are important in science and cumbersome to implement in dictionaries
> (but possible) so lists satisfy principle 1. Using space separators
> is probably less mainstream than using commas - if we had chosen to
> use both we would have definitely satisfied rule 2. I think rule 2
> would argue that we should allow both space and comma, but principle
> 1(iv) would argue choosing one or the other.
> We use whitespace separated strings (i.e. including newline, tab, etc.) by
default in CML for numeric arrays and matrices. It works well. However for
lists of general strings, dates, etc. we allow the author to choose a
delimiter which they know is not present in the strings.
Some locales (e.g. DE) use commas for decimal points and this is often added
by the operating system. Thus 1.23,3.45 could be emitted as 1,23,3,45. It's
possible but tedious to refactor code always to use period as the point.
I would also support the use of dictionaries for extending human and machine
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the comcifs