Standardising inter-block linking

Tue Apr 3 01:23:57 BST 2018

Dear James,

I am no longer a member of the IUCr Executive Committee and am therefore
not the rep for COMCIFS.  Please update the email list.

Best wishes,

Mitchell

Professor Emeritus Mitchell Guss
School of Life & Environmental Sciences
University of Sydney
NSW 2006
Australia

Phone:+61 (0)2 9351 4302
Fax:    +61 (0)2 9351 5858

On 3 April 2018 at 10:19, James Hester <jamesrhester at gmail.com> wrote:

> Dear COMCIFS members,
>
> The 6-week discussion period having elapsed, the inter-block linking
> proposal is now accepted.  I will inform dictionary authors of the new tool
> at their disposal.
>
> James Hester (chair).
>
>
> On 19 February 2018 at 16:37, James Hester <jamesrhester at gmail.com> wrote:
>
>> Dear COMCIFS members
>>
>> One inadequately resolved issue that arises in dealing with the more
>> complex dictionaries is how links between data blocks are handled
>> (msCIF/pdCIF/magCIF). The end of Section 2.2.6 in Vol G, 1st edition also
>> alludes to a need for resolution of this problem. The following proposal
>> has therefore been developed as a way of standardising linking between data
>> blocks, and has the pleasant side-effect that it also explains the context
>> in which dREL methods operate when presented with multiple data blocks.
>> The dictionary most reliant on data block linking is the modulated
>> structure CIF, and Gotzon Madariaga, the original author, has confirmed
>> that this proposal would meet the needs of that dictionary.
>>
>> I've posted the proposal at https://github.com/COMCIFS/com
>> cifs.github.io/blob/master/InterBlockLinkProposal.rst.  I've also
>> included it as ASCII at the end of this email for the record, but strongly
>> urge you to go to the above link for a more nicely formatted version.
>>
>> The proposal remains open for discussion for 6 weeks. If all issues have
>> been resolved by that time, it will be considered accepted.
>>
>> James.
>>
>> Proposal to Improve Inter Block Linking
>> ***************************************
>>
>> Introduction
>> ============
>>
>> A number of current CIF dictionaries rely on pointers to other
>> data blocks. In particular:
>>
>> * A powder sample can consist of multiple components (perhaps
>>   confusingly called "phases").  Each of these materials is described
>>   in a separate data block and then referenced from the block in which
>>   the overall fit to the data is described.  Powder diffraction patterns
>> may
>>   also be referenced in the same way.
>>
>> * The modulated structures dictionary (msCIF) allows for composite
>>   structures, with the space group and cell of each member of the
>>   composite described in separate data blocks
>>
>> * The new magnetic structures dictionary envisions that alternative
>>   descriptions of the same magnetic structure could be described in
>>   separate data blocks.
>>
>> In addition, cif core defines the ``_audit_link.*`` data names, which
>> allow listing of
>> datablock identifiers together with some plain text description of the
>> nature of
>> the relationship between the blocks.
>>
>> Each scheme is different, expects different contents for the linked
>> data blocks, and varies in the degree to which computers can use the
>> linking information.
>>
>> In addition, work on 'CheckCIF for raw data' requires some way of
>> working with the many different ways of arranging image frames into
>> files and directories.
>>
>> This proposal therefore suggests formalising data block linkages in a
>> way that allows arbitrary expansion as well as making the precise
>> nature of the relationships between data blocks machine-readable in
>> all situations.
>>
>> General idea
>> ============
>>
>> First some definitions:
>>
>> **key**
>>    the set of datanames whose values together define a unique
>>    row
>>
>> **projection**
>>    choosing a single value for each key data name, and then
>>    selecting only those rows of categories that correspond to those
>>    particular values, ignoring any categories for which that data name
>>    is irrelevant.
>>
>> Any data collection can be split by projecting over the individual
>> values of a data name that forms part of a key, and putting the result
>> for each projection into a separate block.  These projected blocks do
>> not have to include the projected dataname within loops, as the
>> dataname is constant within a data block.
>>
>> If multiple blocks belonging to the same data collection are able to
>> be described as projections of one or more data names we have
>> completely defined their relationship.
>>
>> Proposal for a Universal System for Linking Data Blocks
>> =======================================================
>>
>> In order to completely define data block interrelationships, data
>> blocks resulting from projection over a key value must be:
>>
>> (1) linked and
>> (2) the key data name used for the projection indicated and correct
>> values assigned.
>>
>> Linkage
>> -------
>>
>> This is accomplished by including a dataname which has the same,
>> unique value in all linked blocks.  I propose calling this dataname
>> ``_audit_dataset.id``
>>
>> Key values
>> ----------
>>
>> Each data block must explicitly state the value of the parent key
>> against which it has been projected, by providing the parent key's
>> data name and value.
>>
>>
>> Effect on existing dictionaries
>> ===============================
>>
>> The following examines changes or enhancements required to implement
>> this approach in those dictionaries that already use inter-block
>> references. In no case do existing data names require redefinition or
>> removal; the new system can exist alongside the old system.
>>
>> Comparison for msCIF
>> --------------------
>>
>> The current msCIF arrangement splits the information for composite
>> structures over a number of data blocks. The 'master' block contains a
>> loop indexed by ``_cell_subsystem.code``, the value of which is used to
>> find structural data in a separate data block that has an
>> ``_audit_link.block_code`` identifier composed as
>> ``<arbitrary>REFRNCE_<code>``.
>> ``<arbitrary>`` is a string common to all linked blocks and ``<code>`` is
>> the
>> value of ``_cell_subsystem.code`` appearing in a row of the
>> ``_cell_subsystem`` loop. If the word MOD is used instead of REFRNCE, the
>> data block contains a description of the modulated structure of this
>> component. If no trailing ``<code>`` is present, the data block contains
>> either common structural features (REFRNCE) or the whole modulated
>> structure (MOD). If neither REFRNCE or MOD are present, the data block
>> contains common data items. In this approach, the links between data
>> blocks
>> are created through the ``_audit_link.block_code`` value matching, and
>> the nature of the link is signalled by the reserved words MOD and
>> REFRNCE.
>>
>> In the proposed approach (see example at end):
>>
>> 1. ``<arbitrary>`` becomes the value of ``_audit_dataset.id``
>> 2. ``_cell_subsystem.code`` should be given in each data block for which
>>    it is relevant
>> 3. the loop over ``_cell_subsystem.code`` in the "master" block should
>>    either be removed, or else a new child data name defined for use
>>    in projection (choice of dictionary authors).
>> 4. The msCIF dictionary should include child data names of
>> ``_cell_subsystem.code``
>>    in the ``space_group``, ``cell`` and all ``_atom_site_*`` categories
>>
>> Comparison for pdCIF
>> --------------------
>>
>> pdCIF defines a ``_pd.block_id`` and then links to blocks via this ID
>> in three distinct ways: (Vol G p 124):
>>
>> (1) To refer to a block containing a diffraction measurement
>> (2) To refer to a block containing a structure description
>> (3) To refer to a block containing calibration information
>>
>> In the proposed approach:
>>
>> 1. ``_audit_dataset.id`` can be used instead, or alongside,
>> ``_pd.block_id``. Whereas
>>    ``_pd.block_id`` is different for every block, ``_audit_dataset.id``
>> is identical for
>>    blocks belonging to a single dataset.
>> 2. A new (key) data name, e.g. ``_pd_diffractogram.id`` is created and
>> stated in any
>>    data blocks containing diffractograms.
>> 3. The value of ``_pd.phase_id`` is stated in any data blocks containing
>> structures
>> 4. New child key data names are added to any categories that could appear
>> in the
>>    separate ``_pd.phase_id`` or ``_pd_diffractogram.id`` blocks
>> 5. Calibration: an external calibration dataset is a combination of a
>>    diffractogram and a phase. A ``_pd_calib_std_external.phase_id`` and
>>    ``_pd_calib_std_external.diffractogram_id`` should be additionally
>>    defined. (1),(2) and (3) are carried out as relevant for the
>>    calibration dataset and phase. If calibrations involve more than
>>    powder diffraction measurements, further data names describing
>>    these measurements should be defined.
>>
>> magCIF
>> ------
>>
>> The magnetic structures dictionary wishes to link to alternative
>> descriptions of
>> the same magnetic structure in separate data blocks. In this case:
>>
>> 1. ``_audit.dataset_id`` is set to be identical in all relevant data
>>    blocks
>> 2. a dataname along the lines of ``_magn_structure_transform.id`` is set
>> in each of
>>    these data blocks
>> 3. Child data names of ``_magn_structure_transform.id`` are added to all
>> categories
>>    that might be used in describing an alternative structure.
>>
>> Advantages
>> ==========
>>
>> 1. To a large extent, data can be added to datasets by simply creating
>>    a new data block with the same ``_audit_dataset.id``.  For example,
>>    an extra measurement on a new sample of the same compound will
>>    automatically be (semantically) incorporated into a dataset simply
>>    by becoming present, whether in a separate file or an appended block
>> 2. dREL methods can be written in complete ignorance of the way in
>>    which data have been distributed over data blocks. In effect, a dREL
>>    method operates in the context of all data available for a given value
>> of
>>    ``_audit_dataset.id``.
>> 3. The effects of unexpected looping over 'Set' datanames that
>> ``_audit.schema``
>>    addresses can be reduced by using separate data blocks. So the choice
>>    exists to split multiple crystals, multiple space-groups etc. over
>>    multiple data blocks, without changing the underlying semantics.
>> 4. Formats which collate many files to form the dataset are easy to
>>    describe in this paradigm: for example, image frames in separate
>>    files are simply assigned to the same dataset, with each file
>>    including the value of the image identifier data name used to
>>    'project' the data file from the notional loop of images.
>> 5. The system is open-ended in terms of allowing disparate items of
>> information
>>    to be collated together with well-defined relationships.  This means it
>>    can essentially cover all ways of aggregating data into datasets.
>> 6. The old block linkage systems can remain in place and can be used to
>> provide
>>    double-checking where possible.
>>
>> Disadvantages
>> =============
>>
>>
>> 1. Flexibility in how data from complex datasets is distributed over
>>    data blocks may cause unnecessary work for data reading software
>>    programmers attempting to cover all situations.  This could be
>>    remedied by individual dictionaries recommending particular
>>    approaches.
>>
>> Interaction with ``_audit.schema``
>> ==================================
>>
>> We have recently defined a data name, ``_audit.schema``, that signals
>> when 'Set' categories have become looped in a data block. The present
>> proposal allows 'Set' categories to be always single-valued in a
>> single data block, yet take multiple values for the dataset as a
>> whole.  We must therefore choose between alternative meanings of
>> ``_audit.schema``: does it mean that 'Set' categories are looped
>> semantically or both semantically and syntactically (obviously if Set
>> categories are looped in a single data block (syntactically) then they
>> are also semantically looped)?  I propose that, even if all data
>> blocks conform to the default schema, at least some values in related
>> data blocks are likely to be materially significant for interpretation
>> of one another (for example, multiple crystal measurements feed into
>> final values of I_meas) and so ``_audit.schema`` should indicate
>> semantic looping, i.e.
>>
>> * ``_audit.schema`` **must** take a non-default value where Set categories
>>   can take multiple values **and** a data block contains loops over
>>   these Set categories.
>>
>> * ``_audit.schema`` **must** take the appropriate non-default value if
>>   information for a dataset has been spread over several data blocks.
>>
>> * ``_audit.schema`` **must** only take the default value if the dataset
>>   consists of a single block conforming to the core CIF dictionary.
>>
>> On datasets
>> ===========
>>
>> Note that a single data block can belong to multiple data sets, for
>> example
>> calibration information may be relevant to multiple data collections, or
>> a single
>> measurement may be relevant to different modelling exercises (e.g. joint
>> or
>> single refinement of X-ray and neutron data) and therefore have different
>> dataset identifiers in each case.
>>
>> Discussion
>> ==========
>>
>> This approach is close in spirit to the work of Nick Spadaccini and
>> Syd Hall in creating DDLm Ref-loops, which were projections of specified
>> Set categories into save frames. The current proposal removes the
>> syntactical element, exposes the behaviour of the keys, and adopts a
>> global relational view of the underlying semantics.
>>
>> Note that ``_audit.dataset_id`` is a grouping mechanism. The
>> particular value
>> taken by this data name is only relevant if the dataset is referred to
>> externally to that dataset. Therefore, if a data format allows data blocks
>> to be grouped in some other way (e.g. files in a directory, nodes in a
>> hierarchy) there is no need to explicitly assign a value to this dataname
>> during data block creation.
>>
>> Example
>> =======
>>
>> The following example shows part of a CIF for a modulated structure
>> composed of two components, LaS and NbS2. (based on `Example 3, p 271,
>> It Vol
>> G<http://it.iucr.org/Ga/ch4o3v0001/Catom_site_displace_Fourier.html>`_)
>> ::
>>
>>     # Common data
>>     data_LaSNbS2
>>     # The common dataset identifier
>>     _audit_dataset.id  1997-07-24|LaSNbS2|G.M.
>>     # Signal which categories are split across datablocks
>>     _audit.schema      'Modulated'
>>     # Signal the type of calculations used
>>     _audit.formalism   'Modulated Single Crystal'
>>     # The actual dictionary that this conforms to
>>     _audit_conform.dict_name 'msCIF.dic'
>>     # Old linkage data may be kept. Not all following blocks included in
>>     # this example for brevity
>>     loop_
>>              _audit_link_block_code
>>              _audit_link_block_description
>>     1997-07-24|LaSNbS2|G.M.|
>>                       'common experimental and publication data'
>>     1997-07-24|LaSNbS2|G.M.|_REFRNCE
>>                              'reference structure (common data)'
>>     1997-07-21|LaSNbS2|G.M.|_MOD
>>                              'modulated structure (common data)'
>>     1997-07-24|LaSNbS2|G.M.|_MOD_NbS2
>>                            'modulated structure (1st subsystem)'
>>     1997-07-24|LaSNbS2|G.M.|_REFRNCE_LaS
>>                            'reference structure (2nd subsystem)'
>>     1997-07-21|LaSNbS2|G.M.|_MOD_LaS
>>                            'modulated structure (2nd subsystem)'
>>
>>     _cell_subsystems_number                  2
>>     # The following loop is now split across data blocks
>>     # or retained with a child data name used for projection
>>     #loop_
>>     #     _cell_subsystem_code
>>     #     _cell_subsystem_description
>>     #     _cell_subsystem_matrix_W_1_1
>>     #     _cell_subsystem_matrix_W_1_4
>>     #     _cell_subsystem_matrix_W_2_2
>>     #     _cell_subsystem_matrix_W_3_3
>>     #     _cell_subsystem_matrix_W_4_1
>>     #     _cell_subsystem_matrix_W_4_4
>>     #             NbS2            '1st subsystem'  1 0 1 1 0 1
>>     #             LaS             '2nd subsystem'  0 1 1 1 1 0
>>
>>     # Common experimental and publication data elided ...
>>
>>     # Items concerning the modulated structure of the first
>>     # subsystem
>>
>>     data_LaSNbS2_MOD_NbS2
>>          # Old block identifier
>>          _audit_block_code         1997-07-24|LaSNbS2|G.M.|_MOD_NbS2
>>          # Common dataset identifier
>>          _audit_dataset.id         1997-07-24|LaSNbS2|G.M.
>>          # Signal which categories are split across datablocks
>>          _audit.schema      'Modulated'
>>          # Signal the type of calculations used
>>          _audit.formalism   'Modulated Single Crystal'
>>          # The actual dictionary that this conforms to
>>          _audit_conform.dict_name 'msCIF.dic'
>>          # Projected key data name
>>          _cell_subsystem_code      NbS2
>>          # Projected information for value = NbS2 of key data name
>>          _cell_subsystem_description  '1st subsystem'
>>          _cell_subsystem_matrix_W_1_1   1
>>          _cell_subsystem_matrix_W_1_4   0
>>          _cell_subsystem_matrix_W_2_2   1
>>          _cell_subsystem_matrix_W_3_3   1
>>          _cell_subsystem_matrix_W_4_1   0
>>          _cell_subsystem_matrix_W_4_4   1
>>
>>          loop_
>>              _atom_site_Fourier_wave_vector_seq_id
>>              _atom_site_Fourier_wave_vector_x
>>              _atom_site_Fourier_wave_vector_description
>>                   1      0.568     'First harmonic'
>>                   2      1.136     'Second harmonic'
>>
>>          loop_
>>              _atom_site_displace_Fourier_id
>>              _atom_site_displace_Fourier_atom_site_label
>>              _atom_site_displace_Fourier_axis
>>              _atom_site_displace_Fourier_wave_vector_seq_id
>>                   Nb1z1   Nb1     z       1
>>                   Nb1x2   Nb1     x       2
>>                   Nb1y2   Nb1     y       2
>>                   S1x1    S1      x       1
>>                   S1y1    S1      y       1
>>                   S1z1    S1      z       1
>>                   S1x2    S1      x       2
>>                   S1y2    S1      y       2
>>                   S1z2    S1      z       2
>>
>>     #### End of modulated structure first subsystem data ######
>>
>>     # Items concerning the modulated structure of the second
>>     # subsystem
>>
>>     data_LaSNbS2_MOD_LaS
>>          # Old block identifier
>>          _audit_block_code         1997-07-24|LaSNbS2|G.M.|_MOD_LaS
>>          # Common dataset identifier
>>          _audit_dataset.id         1997-07-24|LaSNbS2|G.M.
>>          # Signal which categories are split across datablocks
>>          _audit.schema      'Modulated'
>>          # Signal the type of calculations used
>>          _audit.formalism   'Modulated Single Crystal'
>>          # The actual dictionary that this conforms to
>>          _audit_conform.dict_name 'msCIF.dic'
>>          # Projected key data name
>>          _cell_subsystem_code      LaS
>>          # Projected information for value = LaS of key data name
>>          _cell_subsystem_code      LaS
>>          _cell_subsystem_description  '2nd subsystem'
>>          _cell_subsystem_matrix_W_1_1   0
>>          _cell_subsystem_matrix_W_1_4   1
>>          _cell_subsystem_matrix_W_2_2   1
>>          _cell_subsystem_matrix_W_3_3   1
>>          _cell_subsystem_matrix_W_4_1   1
>>          _cell_subsystem_matrix_W_4_4   0
>>
>>          loop_
>>              _atom_site_Fourier_wave_vector_seq_id
>>              _atom_site_Fourier_wave_vector_x
>>              _atom_site_Fourier_wave_vector_z
>>              _atom_site_Fourier_wave_vector_description
>>                   1      1.761   0.5   'First harmonic'
>>                   2      3.522   1.0   'Second harmonic'
>>
>>          loop_
>>              _atom_site_displace_Fourier_id
>>              _atom_site_displace_Fourier_atom_site_label
>>              _atom_site_displace_Fourier_axis
>>              _atom_site_displace_Fourier_wave_vector_seq_id
>>                   La1x1   La1     x       1
>>                   La1y1   La1     y       1
>>                   La1z1   La1     z       1
>>                   La1x2   La1     x       2
>>                   La1y2   La1     y       2
>>                   La1z2   La1     z       2
>>                   S2x1    S2      x       1
>>                   S2y1    S2      y       1
>>                   S2z1    S2      z       1
>>                   S2x2    S2      x       2
>>                   S2y2    S2      y       2
>>                   S2z2    S2      z       2
>>
>>     ### End of modulated structure second subsystem data ######
>>
>> --
>> T +61 (02) 9717 9907 <+61%202%209717%209907>
>> F +61 (02) 9717 3145 <+61%202%209717%203145>
>> M +61 (04) 0249 4148
>>
>
>
>
> --
> T +61 (02) 9717 9907 <+61%202%209717%209907>
> F +61 (02) 9717 3145 <+61%202%209717%203145>
> M +61 (04) 0249 4148
>
> _______________________________________________
> comcifs mailing list
> comcifs at iucr.org
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/comcifs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.iucr.org/pipermail/comcifs/attachments/20180403/1bc81266/attachment-0001.html>