Standardising inter-block linking

James Hester jamesrhester at gmail.com
Tue Apr 3 01:19:56 BST 2018


Dear COMCIFS members,

The 6-week discussion period having elapsed, the inter-block linking
proposal is now accepted.  I will inform dictionary authors of the new tool
at their disposal.

James Hester (chair).


On 19 February 2018 at 16:37, James Hester <jamesrhester at gmail.com> wrote:

> Dear COMCIFS members
>
> One inadequately resolved issue that arises in dealing with the more
> complex dictionaries is how links between data blocks are handled
> (msCIF/pdCIF/magCIF). The end of Section 2.2.6 in Vol G, 1st edition also
> alludes to a need for resolution of this problem. The following proposal
> has therefore been developed as a way of standardising linking between data
> blocks, and has the pleasant side-effect that it also explains the context
> in which dREL methods operate when presented with multiple data blocks.
> The dictionary most reliant on data block linking is the modulated
> structure CIF, and Gotzon Madariaga, the original author, has confirmed
> that this proposal would meet the needs of that dictionary.
>
> I've posted the proposal at https://github.com/COMCIFS/com
> cifs.github.io/blob/master/InterBlockLinkProposal.rst.  I've also
> included it as ASCII at the end of this email for the record, but strongly
> urge you to go to the above link for a more nicely formatted version.
>
> The proposal remains open for discussion for 6 weeks. If all issues have
> been resolved by that time, it will be considered accepted.
>
> James.
>
> Proposal to Improve Inter Block Linking
> ***************************************
>
> Introduction
> ============
>
> A number of current CIF dictionaries rely on pointers to other
> data blocks. In particular:
>
> * A powder sample can consist of multiple components (perhaps
>   confusingly called "phases").  Each of these materials is described
>   in a separate data block and then referenced from the block in which
>   the overall fit to the data is described.  Powder diffraction patterns
> may
>   also be referenced in the same way.
>
> * The modulated structures dictionary (msCIF) allows for composite
>   structures, with the space group and cell of each member of the
>   composite described in separate data blocks
>
> * The new magnetic structures dictionary envisions that alternative
>   descriptions of the same magnetic structure could be described in
>   separate data blocks.
>
> In addition, cif core defines the ``_audit_link.*`` data names, which
> allow listing of
> datablock identifiers together with some plain text description of the
> nature of
> the relationship between the blocks.
>
> Each scheme is different, expects different contents for the linked
> data blocks, and varies in the degree to which computers can use the
> linking information.
>
> In addition, work on 'CheckCIF for raw data' requires some way of
> working with the many different ways of arranging image frames into
> files and directories.
>
> This proposal therefore suggests formalising data block linkages in a
> way that allows arbitrary expansion as well as making the precise
> nature of the relationships between data blocks machine-readable in
> all situations.
>
> General idea
> ============
>
> First some definitions:
>
> **key**
>    the set of datanames whose values together define a unique
>    row
>
> **projection**
>    choosing a single value for each key data name, and then
>    selecting only those rows of categories that correspond to those
>    particular values, ignoring any categories for which that data name
>    is irrelevant.
>
> Any data collection can be split by projecting over the individual
> values of a data name that forms part of a key, and putting the result
> for each projection into a separate block.  These projected blocks do
> not have to include the projected dataname within loops, as the
> dataname is constant within a data block.
>
> If multiple blocks belonging to the same data collection are able to
> be described as projections of one or more data names we have
> completely defined their relationship.
>
> Proposal for a Universal System for Linking Data Blocks
> =======================================================
>
> In order to completely define data block interrelationships, data
> blocks resulting from projection over a key value must be:
>
> (1) linked and
> (2) the key data name used for the projection indicated and correct values
> assigned.
>
> Linkage
> -------
>
> This is accomplished by including a dataname which has the same,
> unique value in all linked blocks.  I propose calling this dataname
> ``_audit_dataset.id``
>
> Key values
> ----------
>
> Each data block must explicitly state the value of the parent key
> against which it has been projected, by providing the parent key's
> data name and value.
>
>
> Effect on existing dictionaries
> ===============================
>
> The following examines changes or enhancements required to implement
> this approach in those dictionaries that already use inter-block
> references. In no case do existing data names require redefinition or
> removal; the new system can exist alongside the old system.
>
> Comparison for msCIF
> --------------------
>
> The current msCIF arrangement splits the information for composite
> structures over a number of data blocks. The 'master' block contains a
> loop indexed by ``_cell_subsystem.code``, the value of which is used to
> find structural data in a separate data block that has an
> ``_audit_link.block_code`` identifier composed as
> ``<arbitrary>REFRNCE_<code>``.
> ``<arbitrary>`` is a string common to all linked blocks and ``<code>`` is
> the
> value of ``_cell_subsystem.code`` appearing in a row of the
> ``_cell_subsystem`` loop. If the word MOD is used instead of REFRNCE, the
> data block contains a description of the modulated structure of this
> component. If no trailing ``<code>`` is present, the data block contains
> either common structural features (REFRNCE) or the whole modulated
> structure (MOD). If neither REFRNCE or MOD are present, the data block
> contains common data items. In this approach, the links between data blocks
> are created through the ``_audit_link.block_code`` value matching, and
> the nature of the link is signalled by the reserved words MOD and
> REFRNCE.
>
> In the proposed approach (see example at end):
>
> 1. ``<arbitrary>`` becomes the value of ``_audit_dataset.id``
> 2. ``_cell_subsystem.code`` should be given in each data block for which
>    it is relevant
> 3. the loop over ``_cell_subsystem.code`` in the "master" block should
>    either be removed, or else a new child data name defined for use
>    in projection (choice of dictionary authors).
> 4. The msCIF dictionary should include child data names of
> ``_cell_subsystem.code``
>    in the ``space_group``, ``cell`` and all ``_atom_site_*`` categories
>
> Comparison for pdCIF
> --------------------
>
> pdCIF defines a ``_pd.block_id`` and then links to blocks via this ID
> in three distinct ways: (Vol G p 124):
>
> (1) To refer to a block containing a diffraction measurement
> (2) To refer to a block containing a structure description
> (3) To refer to a block containing calibration information
>
> In the proposed approach:
>
> 1. ``_audit_dataset.id`` can be used instead, or alongside,
> ``_pd.block_id``. Whereas
>    ``_pd.block_id`` is different for every block, ``_audit_dataset.id``
> is identical for
>    blocks belonging to a single dataset.
> 2. A new (key) data name, e.g. ``_pd_diffractogram.id`` is created and
> stated in any
>    data blocks containing diffractograms.
> 3. The value of ``_pd.phase_id`` is stated in any data blocks containing
> structures
> 4. New child key data names are added to any categories that could appear
> in the
>    separate ``_pd.phase_id`` or ``_pd_diffractogram.id`` blocks
> 5. Calibration: an external calibration dataset is a combination of a
>    diffractogram and a phase. A ``_pd_calib_std_external.phase_id`` and
>    ``_pd_calib_std_external.diffractogram_id`` should be additionally
>    defined. (1),(2) and (3) are carried out as relevant for the
>    calibration dataset and phase. If calibrations involve more than
>    powder diffraction measurements, further data names describing
>    these measurements should be defined.
>
> magCIF
> ------
>
> The magnetic structures dictionary wishes to link to alternative
> descriptions of
> the same magnetic structure in separate data blocks. In this case:
>
> 1. ``_audit.dataset_id`` is set to be identical in all relevant data
>    blocks
> 2. a dataname along the lines of ``_magn_structure_transform.id`` is set
> in each of
>    these data blocks
> 3. Child data names of ``_magn_structure_transform.id`` are added to all
> categories
>    that might be used in describing an alternative structure.
>
> Advantages
> ==========
>
> 1. To a large extent, data can be added to datasets by simply creating
>    a new data block with the same ``_audit_dataset.id``.  For example,
>    an extra measurement on a new sample of the same compound will
>    automatically be (semantically) incorporated into a dataset simply
>    by becoming present, whether in a separate file or an appended block
> 2. dREL methods can be written in complete ignorance of the way in
>    which data have been distributed over data blocks. In effect, a dREL
>    method operates in the context of all data available for a given value
> of
>    ``_audit_dataset.id``.
> 3. The effects of unexpected looping over 'Set' datanames that
> ``_audit.schema``
>    addresses can be reduced by using separate data blocks. So the choice
>    exists to split multiple crystals, multiple space-groups etc. over
>    multiple data blocks, without changing the underlying semantics.
> 4. Formats which collate many files to form the dataset are easy to
>    describe in this paradigm: for example, image frames in separate
>    files are simply assigned to the same dataset, with each file
>    including the value of the image identifier data name used to
>    'project' the data file from the notional loop of images.
> 5. The system is open-ended in terms of allowing disparate items of
> information
>    to be collated together with well-defined relationships.  This means it
>    can essentially cover all ways of aggregating data into datasets.
> 6. The old block linkage systems can remain in place and can be used to
> provide
>    double-checking where possible.
>
> Disadvantages
> =============
>
>
> 1. Flexibility in how data from complex datasets is distributed over
>    data blocks may cause unnecessary work for data reading software
>    programmers attempting to cover all situations.  This could be
>    remedied by individual dictionaries recommending particular
>    approaches.
>
> Interaction with ``_audit.schema``
> ==================================
>
> We have recently defined a data name, ``_audit.schema``, that signals
> when 'Set' categories have become looped in a data block. The present
> proposal allows 'Set' categories to be always single-valued in a
> single data block, yet take multiple values for the dataset as a
> whole.  We must therefore choose between alternative meanings of
> ``_audit.schema``: does it mean that 'Set' categories are looped
> semantically or both semantically and syntactically (obviously if Set
> categories are looped in a single data block (syntactically) then they
> are also semantically looped)?  I propose that, even if all data
> blocks conform to the default schema, at least some values in related
> data blocks are likely to be materially significant for interpretation
> of one another (for example, multiple crystal measurements feed into
> final values of I_meas) and so ``_audit.schema`` should indicate
> semantic looping, i.e.
>
> * ``_audit.schema`` **must** take a non-default value where Set categories
>   can take multiple values **and** a data block contains loops over
>   these Set categories.
>
> * ``_audit.schema`` **must** take the appropriate non-default value if
>   information for a dataset has been spread over several data blocks.
>
> * ``_audit.schema`` **must** only take the default value if the dataset
>   consists of a single block conforming to the core CIF dictionary.
>
> On datasets
> ===========
>
> Note that a single data block can belong to multiple data sets, for example
> calibration information may be relevant to multiple data collections, or a
> single
> measurement may be relevant to different modelling exercises (e.g. joint or
> single refinement of X-ray and neutron data) and therefore have different
> dataset identifiers in each case.
>
> Discussion
> ==========
>
> This approach is close in spirit to the work of Nick Spadaccini and
> Syd Hall in creating DDLm Ref-loops, which were projections of specified
> Set categories into save frames. The current proposal removes the
> syntactical element, exposes the behaviour of the keys, and adopts a
> global relational view of the underlying semantics.
>
> Note that ``_audit.dataset_id`` is a grouping mechanism. The
> particular value
> taken by this data name is only relevant if the dataset is referred to
> externally to that dataset. Therefore, if a data format allows data blocks
> to be grouped in some other way (e.g. files in a directory, nodes in a
> hierarchy) there is no need to explicitly assign a value to this dataname
> during data block creation.
>
> Example
> =======
>
> The following example shows part of a CIF for a modulated structure
> composed of two components, LaS and NbS2. (based on `Example 3, p 271,
> It Vol
> G<http://it.iucr.org/Ga/ch4o3v0001/Catom_site_displace_Fourier.html>`_)
> ::
>
>     # Common data
>     data_LaSNbS2
>     # The common dataset identifier
>     _audit_dataset.id  1997-07-24|LaSNbS2|G.M.
>     # Signal which categories are split across datablocks
>     _audit.schema      'Modulated'
>     # Signal the type of calculations used
>     _audit.formalism   'Modulated Single Crystal'
>     # The actual dictionary that this conforms to
>     _audit_conform.dict_name 'msCIF.dic'
>     # Old linkage data may be kept. Not all following blocks included in
>     # this example for brevity
>     loop_
>              _audit_link_block_code
>              _audit_link_block_description
>     1997-07-24|LaSNbS2|G.M.|
>                       'common experimental and publication data'
>     1997-07-24|LaSNbS2|G.M.|_REFRNCE
>                              'reference structure (common data)'
>     1997-07-21|LaSNbS2|G.M.|_MOD
>                              'modulated structure (common data)'
>     1997-07-24|LaSNbS2|G.M.|_MOD_NbS2
>                            'modulated structure (1st subsystem)'
>     1997-07-24|LaSNbS2|G.M.|_REFRNCE_LaS
>                            'reference structure (2nd subsystem)'
>     1997-07-21|LaSNbS2|G.M.|_MOD_LaS
>                            'modulated structure (2nd subsystem)'
>
>     _cell_subsystems_number                  2
>     # The following loop is now split across data blocks
>     # or retained with a child data name used for projection
>     #loop_
>     #     _cell_subsystem_code
>     #     _cell_subsystem_description
>     #     _cell_subsystem_matrix_W_1_1
>     #     _cell_subsystem_matrix_W_1_4
>     #     _cell_subsystem_matrix_W_2_2
>     #     _cell_subsystem_matrix_W_3_3
>     #     _cell_subsystem_matrix_W_4_1
>     #     _cell_subsystem_matrix_W_4_4
>     #             NbS2            '1st subsystem'  1 0 1 1 0 1
>     #             LaS             '2nd subsystem'  0 1 1 1 1 0
>
>     # Common experimental and publication data elided ...
>
>     # Items concerning the modulated structure of the first
>     # subsystem
>
>     data_LaSNbS2_MOD_NbS2
>          # Old block identifier
>          _audit_block_code         1997-07-24|LaSNbS2|G.M.|_MOD_NbS2
>          # Common dataset identifier
>          _audit_dataset.id         1997-07-24|LaSNbS2|G.M.
>          # Signal which categories are split across datablocks
>          _audit.schema      'Modulated'
>          # Signal the type of calculations used
>          _audit.formalism   'Modulated Single Crystal'
>          # The actual dictionary that this conforms to
>          _audit_conform.dict_name 'msCIF.dic'
>          # Projected key data name
>          _cell_subsystem_code      NbS2
>          # Projected information for value = NbS2 of key data name
>          _cell_subsystem_description  '1st subsystem'
>          _cell_subsystem_matrix_W_1_1   1
>          _cell_subsystem_matrix_W_1_4   0
>          _cell_subsystem_matrix_W_2_2   1
>          _cell_subsystem_matrix_W_3_3   1
>          _cell_subsystem_matrix_W_4_1   0
>          _cell_subsystem_matrix_W_4_4   1
>
>          loop_
>              _atom_site_Fourier_wave_vector_seq_id
>              _atom_site_Fourier_wave_vector_x
>              _atom_site_Fourier_wave_vector_description
>                   1      0.568     'First harmonic'
>                   2      1.136     'Second harmonic'
>
>          loop_
>              _atom_site_displace_Fourier_id
>              _atom_site_displace_Fourier_atom_site_label
>              _atom_site_displace_Fourier_axis
>              _atom_site_displace_Fourier_wave_vector_seq_id
>                   Nb1z1   Nb1     z       1
>                   Nb1x2   Nb1     x       2
>                   Nb1y2   Nb1     y       2
>                   S1x1    S1      x       1
>                   S1y1    S1      y       1
>                   S1z1    S1      z       1
>                   S1x2    S1      x       2
>                   S1y2    S1      y       2
>                   S1z2    S1      z       2
>
>     #### End of modulated structure first subsystem data ######
>
>     # Items concerning the modulated structure of the second
>     # subsystem
>
>     data_LaSNbS2_MOD_LaS
>          # Old block identifier
>          _audit_block_code         1997-07-24|LaSNbS2|G.M.|_MOD_LaS
>          # Common dataset identifier
>          _audit_dataset.id         1997-07-24|LaSNbS2|G.M.
>          # Signal which categories are split across datablocks
>          _audit.schema      'Modulated'
>          # Signal the type of calculations used
>          _audit.formalism   'Modulated Single Crystal'
>          # The actual dictionary that this conforms to
>          _audit_conform.dict_name 'msCIF.dic'
>          # Projected key data name
>          _cell_subsystem_code      LaS
>          # Projected information for value = LaS of key data name
>          _cell_subsystem_code      LaS
>          _cell_subsystem_description  '2nd subsystem'
>          _cell_subsystem_matrix_W_1_1   0
>          _cell_subsystem_matrix_W_1_4   1
>          _cell_subsystem_matrix_W_2_2   1
>          _cell_subsystem_matrix_W_3_3   1
>          _cell_subsystem_matrix_W_4_1   1
>          _cell_subsystem_matrix_W_4_4   0
>
>          loop_
>              _atom_site_Fourier_wave_vector_seq_id
>              _atom_site_Fourier_wave_vector_x
>              _atom_site_Fourier_wave_vector_z
>              _atom_site_Fourier_wave_vector_description
>                   1      1.761   0.5   'First harmonic'
>                   2      3.522   1.0   'Second harmonic'
>
>          loop_
>              _atom_site_displace_Fourier_id
>              _atom_site_displace_Fourier_atom_site_label
>              _atom_site_displace_Fourier_axis
>              _atom_site_displace_Fourier_wave_vector_seq_id
>                   La1x1   La1     x       1
>                   La1y1   La1     y       1
>                   La1z1   La1     z       1
>                   La1x2   La1     x       2
>                   La1y2   La1     y       2
>                   La1z2   La1     z       2
>                   S2x1    S2      x       1
>                   S2y1    S2      y       1
>                   S2z1    S2      z       1
>                   S2x2    S2      x       2
>                   S2y2    S2      y       2
>                   S2z2    S2      z       2
>
>     ### End of modulated structure second subsystem data ######
>
> --
> T +61 (02) 9717 9907 <+61%202%209717%209907>
> F +61 (02) 9717 3145 <+61%202%209717%203145>
> M +61 (04) 0249 4148
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.iucr.org/pipermail/comcifs/attachments/20180403/18a89405/attachment-0001.html>


More information about the comcifs mailing list