Accent escape sequences

Tue Mar 6 15:41:22 GMT 2007

James Hester wrote:
> Thinking about the mechanics of implementing these suggestions, it would
> make sense to define different types of text field using the
> _item_type_list.code DDL2 attribute.  Currently mmCIF appears to have
> only 'text' for multiline data, and imgCIF has 'binary' in addition to
> this.  A new type (e.g. 'mime') could specify a regex that matches a
> mime header, something like what is done for the imgCIF 'binary' type.  
Should these follow standard MIME types for better standardization, or
maybe just have an optional subtype code specific to CIF to allow for
better specialization? I would go with the latter.

> 
> A variation on this would be to define a larger number of
> _item_type_list.codes corresponding to the various text formats of
> interest, for example 'ascii_markup','tex','html','mathml'.  This would
> mean that the format of a given data item would be determined at
> dictionary writing time if a single type code is given in the
> dictionary.  While this might work and be quite useful when writing
> dictionaries, it is probably too onerous when producing data files. So
> the data dictionary would specify a list of possible text type codes,
> and a magic number or mime header would be useful in the data item text
> field in order to disambiguate.
Why is that too onerous for text fields? CIF already has the problem
that everything is plain text in the absence of a dictionary, yet there
is no numeric flag. CIF would be much more self-defined if that were the
case, but the current design is to base everything on a dictionary.

In general, a generic CIF parser should be able to handle all of the
text fields un-processed, and leave it to the reader to make sense of
it. That is why the multipart/alternative is good; even the raw form is
readable. The only caveat added by MIME is that regions between the
multi-part boundaries may contain the "<eof>;" end mark.

> 
> Regarding the suggestion that there be several representations of the
> same text using a mime multipart approach, I think caution is warranted
> insofar as this might relate to dictionary data items (as opposed to
> data file data items), in that all of the parts should be kept
> synchronised, entailing more work, and work which involves specialised
> knowledge.
Perhaps alternative parts need to include a flag as to which form is the
authoritative representation. Then, if it is changed and the other
form(s) are not, the alternatives must be deleted or marked "invalid"
until they can be remade properly.

However, it is certainly worth avoiding excessive use. In general,
something like equations should only be edited by the author(s), whereas
most other users will handle it in "read only" form.

This could be used for something like internationalization of CIF
dictionaries as well. In that case, I assume that English would be the
primary reference, and other contributors can add translations. If the
English part changes, the translations could be flagged as out-of-date
until a language contributor can update the translation.

Of course, my native language is American English, so it is not a big
deal for me. What do non-English crystallographers think? I have
wondered about the possibility of having US and UK alternatives for
words like metre. Or should we just declare US spelling as wrong?

Joe

Crystallography Online: the website of the International Union of Crystallography

Accent escape sequences