Proposal to regulate markup in CIF files

Tue Oct 31 05:39:07 GMT 2017

There having been no further comment regarding the proposal below, I will
move it forward by
(1) giving the CIF syntax Vol G authors the go-ahead to include the
character encoding conventions as part of the CIF syntax chapter
(2) adding the 'Marked-up text' type (or alternative name) to the list of
possible types in DDLm
(3) suggesting to IUCr journals that they develop a markup specification.
Let me know if you would like to be part of that discussion.

James.

On 21 September 2017 at 10:59, James Hester <jamesrhester at gmail.com> wrote:

> Dear All,
>
> I've turned the proposal into a discussion document, separating non-ASCII
> character markup and other markup as per John's comments. The best way
> forward for 'other markup' is not clear to me, so I've asked a few
> questions at the end of the document that I invite you to consider.
>
> I've also inserted replies to John's comments inline below.
>
> James.
>
> On 14 September 2017 at 01:08, Bollinger, John C <
> John.Bollinger at stjude.org> wrote:
>
>> Dear Colleagues,
>>
>>
>>
>> I think the proposed approach of allowing markup convention to be
>> specified in data files could be useful.  Moreover, the proposal would
>> provide a mechanism for formalizing the specification of which items are
>> subject to markup in the first place.  Supposedly, that is determined by
>> items’ definitions, but in practice, few, if any, current dictionaries or
>> definitions actually address the topic explicitly.
>>
>>
>>
>> Considerations:
>>
>>
>>
>> 1. To the extent that the proposal envisions data files being enabled to
>> self-specify a particular markup convention from among several choices, it
>> seems to violate the principle that the meaning of an item should not
>> depend on the value of a different item.
>>
>
> In my opinion,this principle needs to be clarified.  For example, any
> non-key data name 'depends' on the value of key data names in its category,
> or the meaning of a fractional coordinate 'depends' on the values of the
> cell parameters. We need a precise rephrasing of the principle, e.g. "no
> new key data names may be added to a pre-existing category" or "where
> possible, the meaning of data names should depend only on data names that
> are identifiers".  Once this is clarified we might be in a better position
> to judge when a proposal is suspect.  Does anyone know the underlying basis
> for this principle?
>
>>
>>
>> 2. Since the codes for various supported markup conventions would be
>> defined in domain dictionaries, it seems we might be setting ourselves up
>> for future issues in the event that dictionary maintainers want to support
>> new markup conventions, for then we will need to change existing
>> definitions (which, granted, we have lately afforded ourselves some freedom
>> to do).
>>
>
> The point being that software written with one set of markup conventions
> in mind would be caught unawares by data values written according to a new
> markup convention. However, the idea is that _publ.markup_convention is
> used to flag the use of a new convention and allow such software to act
> appropriately.  Adding new enumerated values is a normal way of developing
> dictionaries and I don't think is controversial.  I have removed this from
> the proposal for now.
>
>>
>>
>> 3. Additionally, the proposal seems to imply that each domain dictionary
>> would need to either specify its own analogue(s) of
>> `_publ.markup_convention`, or else to explicitly depend on the Core item.
>> I am uncertain whether this is a good or bad thing.
>>
>
> Yes, this is true.  All of our dictionaries currently depend on the core
> dictionary in any case and this is unlikely to change. A completely
> different domain would need to define its own analogue.
>
>>
>>
>> 4. The proposal does not preclude individual items from being defined to
>> have whatever content type is desired for them, as long as the definitions
>> are not flagged as carrying `Marked-up` data.  That’s good, but I can
>> imagine hypothetical cases in which it would be confusing to dictionary
>> authors.  For example, if an item were defined to contain HTML markup, it
>> would be necessary for its definition to specify that its values were NOT
>> `Marked-up`.
>>
>
> This is true.
>
>>
>>
>> 5. Although we call the existing CIF conventions “markup” – and that’s
>> fitting – for the most part they don’t serve the same purpose as Markdown,
>> reStructured text, etc..  Almost all of our markup is aimed at encoding
>> specific characters, whereas the other markup conventions mentioned are
>> focused on document structure and styling.  These are largely orthogonal
>> considerations, so perhaps we should approach them separately.
>>
>
> I have done this in the revised proposal.
>
>>
>>
>> 6. The discussion accompanying the proposal distinguishes between items
>> that are intended to be machine-actionable and those intended purely for
>> human consumption, with the assertion that only the latter kind should be
>> subject to markup.  I would accept that if it referred to structural markup
>> only, but it is not so clear that such a rule is appropriate for markup
>> that serves to encode characters.  Consider, for example,
>> `_atom_site.label`.  As a key data name, it certainly has machine
>> significance, but its values are also meant to identify atoms to humans.
>> Should CIFs, then, be forbidden from using `Cα` (i.e. `C\a`) as or in
>> atom labels?  Forbidding use of markup would prevent literal `Cα` from
>> being expressed in an atom label in a CIF 1.1 document, but not in a CIF
>> 2.0 document, so that sets up a pathway wherein markup gets introduced into
>> data values through format transliteration from CIF2 to CIF1.  If only the
>> original document were considered valid in such cases, then that would
>> constitute a rather nasty trap to set for ourselves.
>>
>>
>>
> Agreed, hopefully the rewritten proposal addresses this.
>
>>
>>
>> Initial analysis:
>>
>>
>>
>> As presented, the proposal’s largest impact would probably be to provide
>> for DDLm dictionaries to specify which items are primarily (or exclusively)
>> intended for human consumption: those that are defined to have `Marked-up`
>> content, regardless of whether any markup is actually present in their
>> values.  Initially, at least, this would establish in which values to
>> interpret the standard CIF markup conventions.  That would be worthwhile.
>>
>>
>>
>> I am less certain about the prospects for or usefulness of enabling data
>> files to select alternative markup conventions.  Perhaps that could be used
>> to good effect to support revisions to the markup conventions.  Perhaps it
>> would be more broadly applicable.  But perhaps we should avoid making any
>> items’ meanings depend on other items’ values.
>>
>
> I have rewritten the proposal to remove the option of specifying
> particular types of markup, and indeed largely removed discussion of markup
> pending some feedback from this group.
>
> James.
> ================================================================
> Revised discussion paper for regulating markup of CIF text items
> ================================================================
>
> Summary
> =======
>
> 1. Data values containing backslash escapes for indicating non-ASCII
>    characters are to be considered entirely equivalent to values
>    obtained after all such substitutions and escapes have been applied.
>
> 2. Other mark-up (superscripts, subscripts, italic etc.) is given a
>    new type, but is otherwise unspecified and needs to be discussed.
>
> Introduction
> ============
>
> From the very first publication describing CIF, markup conventions
> have been provided in order to extend the range of characters and font
> effects representable in ASCII.  Which data values these conventions
> might apply to, and whether or not this is more properly a CIF syntax
> or dictionary (semantic) issue, has been left implicit.
>
> Marked-up text according to the ad-hoc definitions described in Vol G
> appears both in CIF data files and in dictionary definitions. While
> COMCIFS has control over the conventions applying within dictionaries,
> it has far less control over data values in data files, which are
> produced both by dedicated software, such as publCIF, and hand-editing
> or local ad-hoc solutions.  Marked-up text in data files plays an
> important role in the publication workflow.
>
> Vol G (First Edition) notes in section 2.2.5.3: "It is hoped that in
> future different types of such markup may be permitted so long as the
> data values affected can be tagged with an indication of their content
> type that allows the appropriate content handlers to be invoked". It is
> not, however, clear that multiple alternative markups are desirable.
>
> Moving forward
> ==============
>
> The markup in use can be divided into two classes: 'character encoding'
> and 'font effects'.  Under this proposal, each class is treated
> differently.
>
> 1. Character encoding
>
>    Character encoding markup represents non-ASCII letters using a
>    backslash followed by one or more ASCII characters, for example,
>    '\a' is 'greek letter alpha'.  This is a format-specific method of
>    allowing access to the full characterset used by the DDL textual
>    types. From the point of view of the (format-agnostic) dictionary
>    data model, how a particular format wishes to encode characters is
>    irrelevant. Therefore, the set of character escapes is most
>    appropriately documented as part of the description of CIF syntax,
>    not within a DDL dictionary.  In other words, CIF1/2 data values with
>    backslash character escapes are semantically identical to CIF2 data
>    values where those escapes have had their Unicode equivalents
>    substituted.
>
> 2. Font effects
>
>    Font effects differ from character encodings in not having a DDLm
>    type that they are a concrete realisation of. By analogy, then, we
>    could create a DDLm type 'Marked-up text', whose contents contain
>    marked-up text.  Particular implementations and syntaxes might then
>    specify what particular convention(s) 'Marked up text' should
>    conform to.
>
> Notes
> =====
>
> 1. An important function of the 'Marked-up text' type is to
>    designate data values that are not intended to be machine-actionable.
>    No DDLm functions or attributes are envisioned for manipulating the
>    markup. The type could alternatively be something like 'Rich text'.
>
> 2. Enumerated values and identifiers must not be of type marked-up text.
>
> 3. The 'marked-up text' data value is obtained from a CIF syntax
>    file after backslash character codes have been substituted.
>
> Open questions
> ==============
>
> The above proposal does not specify a particular markup convention. Leaving
> anything unspecified is dangerous for a standard, as it invites the
> appearance
> of multiple, incompatible solutions.  We should as a matter of urgency
> answer
> the following questions:
>
> (1) Should alternative markup conventions be possible?
> (2) If yes, should the markup convention in use be
>     (i) per dataname?
>     (ii) per datavalue (maybe via an embedded flag)?
>     (iii) per data block?
>     (iv) per dictionary?
>     (v) per syntax? (e.g. CIF/CIF-JSON/HDF5 etc.)
> (3) If no, the current convention is the only possible one for reasons of
>     backward compatibility. Should it be:
>     (i) a feature of CIF syntax?
>     (ii) a feature of CIF syntax when combined with a DDLm dictionary?
>     (iii) defined in DDLm?
>
> My answers to these questions would be
> (1) No alternatives should be possible, in order to simplify publishing
>     workflows and maintain the publCIF investment
> (3) (ii)
>
> Some explanation regarding (ii), which possibly sounds a bit
> abstruse. A CIF syntax file can be used (in theory) with an
> alternative dictionary language and associated data model. Likewise,
> DDLm dictionaries can be used to describe non-CIF files.  In each
> case, the way in which syntactical data values are constructed to
> match the dictionary types may differ (for example, numbers may be a
> text string or binary).  Each combination of syntax and dictionary
> must explicitly state how each dictionary type is represented in that
> syntax. So I am suggesting that for the combination 'CIF + DDLm' that
> we specify the current markup conventions to represent type 'Marked-up
> text'.
>
> --
> T +61 (02) 9717 9907 <+61%202%209717%209907>
> F +61 (02) 9717 3145 <+61%202%209717%203145>
> M +61 (04) 0249 4148
>

-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.iucr.org/pipermail/comcifs/attachments/20171031/d21c4ffd/attachment-0001.html>