Variants

Herbert J. Bernstein yaya at bernstein-plus-sons.com
Thu Nov 26 02:13:09 GMT 2009


Dear Colleagues,

   Please look at what has been proposed.  The proposal currently put 
forward by David is for 2 variants of the wavelength in the same CIF:


>               loop_
>                   _diffrn_radiation_wavelength_id
>                   _diffrn_radiation_wavelength
>                   _diffrn_radiation_wavelength_determinaton
>                      1   1.23456   fundamental
>                      2   1.25      estimated

This is what was proposed by David based on Nick's suggestion: 2 
wavelengths in one CIF data block. The defect in the proposal is that it 
is then clumsy to couple the two wavelengths to other values than change 
when the wavelength changes, such as cell dimensions.  Use of variants 
helps to resolve that loose end.

Now, to the substance of James' objections to variant:

> (1) It seems to me that the closer a given CIF file is to the raw data, the
> more useful recording of variants is, as the best path forward has not yet
> been identified and so keeping different variations is useful; conversely,

I beg to differ.  Final, released PDB entries have alternate conformers
and multiple versions of nomenclature.  As a combination of noisy 
observations with uncertain models, our science tends to produce multiple
equally valid results (look at any NMR entry), and variants are a clean
way to deal with them all the way through an experiment, keeping track
of what was done instead of losing it.

> Taking the wavelength proposal as an example: once someone has refined a
> wavelength from a standard material, what the nominal wavelength was is no
> longer scientifically relevant, and so there is no reason to keep it and all
> derived values in the file (which is why Nick's alternative wavelength
> proposal is preferable, as only one wavelength is in the file).

Certainly, in a perfect world with perfect science, the derivation of a
wavelength with be a reliable, one-step process, but things go wrong,
and different things get tried with different values used to derive
other values (such as cells), and by failing to track and label things
as the experiment progresses, we create the non-trivial risk of coupling
the cell from one wavelength with the wavelength for another.  Worse,
in some cases, the choice of space group can flop around depending on
refinements of wavelengths and cells.  Yes, an author may decide to
publish only one resulting final determination, but both the author
and people trying to reproduce the results are likely to do better
science if there is an option to preserve an audit trail of how the
original author got to their "final" result.  Someone else might
decide to look at things differntly.  I am not saying everyone has to
use variants, any more than the PDB will tell and author they have
to present alternate conformers for some fuzzy density, but it really
is useful to have the option.

> (2) Introducing variants means that multiple values for simple items such as
> cell parameters could be present in a single datablock, and CIF reading
> software must be rewritten to recognise which of those instances of cell
> parameters it needs to care about (not to mention all those programs which
> expect unlooped cell parameters...).  This is a very serious issue for small
> molecule CIF, where many programs already exist.  I don't expect that this
> is so serious for imgCIF, where (unfortunately) imgCIF applications are thin
> on the ground.

Inasmuch as what is proposed conforms to all existing CIF specifications, 
it is not the CIF reading software that would need to be rewritten, but 
the false assumption in some minds that there is only one right answer to 
a question that needs to be rewritten.  It would be trivial to create a 
filter program to return one best variant (the one with the "preferred" 
role) if that was needed, but I would hope that at least some CIF users 
would not exercise such tunnel vision.

> (3) What are our use cases for this change?  What is the motivation? 
> Perhaps Herb could speak to this.

I just did.

> (4) Introduction of DDLm and dREL may change the variant scheme such that
> only a limited set of variant values would need to be made available in any
> CIF data file, and a dREL engine could then calculate out the corresponding
> alternative derived values (and all combinations...).  But again, for
> published data, we expect the author to have done this already and chosen
> the best result.

This is backwards.  DDLm and dREL make it easier to manage precisely the 
relationships among variants, but you need to actually tag the variants to 
be able to use it.  As for the last comment:

> But again, for published data, we expect the author to have done this 
> already and chosen the best result.

This is unrealistic and inconsistent with the ripply bottom was all have 
to deal with in non-linear least square refinements for model fitting, 
with the reality of NMR studies, and the reality of alternate conformers 
and micro-heterogeneity.

My apologies if this seems a brusque reply -- charge it to the aches, 
pains and headache that comes with my broken nose, but try to keep an open 
mind on this one.  It will get used by some people, inasmuch as it is part 
of the latest CBF dictionary, and even if people further downstream don't 
want to use it in their CIFS is would pay to understand it.

I see an unfortunate divergence developing between what imgCIF/CBF users
need and the way in which CIF itself it headed.  If everyone else is
OK with a somewhat different CIF dialect for images, I can live with that,
but I hope there we can at least make clear definitions of the interfaces
among dialects.  Variants are part of that interface.

Regards,
   Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya at dowling.edu
=====================================================

On Thu, 26 Nov 2009, James Hester wrote:

> I'll make some opening comments regarding the idea of variants:
> 
> (1) It seems to me that the closer a given CIF file is to the raw data, the
> more useful recording of variants is, as the best path forward has not yet
> been identified and so keeping different variations is useful; conversely,
> when publishing what is supposed to be the final ("correct") result, the
> interest of the wider community will primarily be in this result, rather
> than alternative results that are considered by the author to be inferior. 
> Taking the wavelength proposal as an example: once someone has refined a
> wavelength from a standard material, what the nominal wavelength was is no
> longer scientifically relevant, and so there is no reason to keep it and all
> derived values in the file (which is why Nick's alternative wavelength
> proposal is preferable, as only one wavelength is in the file).
> 
> (2) Introducing variants means that multiple values for simple items such as
> cell parameters could be present in a single datablock, and CIF reading
> software must be rewritten to recognise which of those instances of cell
> parameters it needs to care about (not to mention all those programs which
> expect unlooped cell parameters...).  This is a very serious issue for small
> molecule CIF, where many programs already exist.  I don't expect that this
> is so serious for imgCIF, where (unfortunately) imgCIF applications are thin
> on the ground.
> 
> (3) What are our use cases for this change?  What is the motivation? 
> Perhaps Herb could speak to this.
> 
> (4) Introduction of DDLm and dREL may change the variant scheme such that
> only a limited set of variant values would need to be made available in any
> CIF data file, and a dREL engine could then calculate out the corresponding
> alternative derived values (and all combinations...).  But again, for
> published data, we expect the author to have done this already and chosen
> the best result.
> 
> 
> On Thu, Nov 26, 2009 at 10:59 AM, James Hester <jamesrhester at gmail.com>
> wrote:
>       I'm reposting Herbert's message in a new thread to aid
>       organisation.  Herbert wrote:
>
>       ----
>       Dear Colleagues,
>
>        While you are revisiting this item, I would suggest you
>       consider the more complete (and, I believe, more elegant and
>       general) solution of defining "variants", that we have
>       introduced into the imgCIF dictionary to handled quantities that
>       may be determined in different ways.
>
>        Instead of adding
>
>        _diffrn_radiation_wavelength_determination
>
>       you would add
>
>        _diffrn_radiation_wavelength_variant
>
>       and a new variant category
>
>              _variant_variant
>              _variant_role
>              _variant_timestamp
>              _variant_variant_of
>              _variant_details
> 
> which would allow you with complete generality to manage any number
> a refined or redefined quantities, such as wavelengths.  This would
> then allow you to us the same variant identifier, for, say cell
> dimensions, which could be expected to change in a coupled manner
> with the changes in wavelength.
> 
>  If you are interested in this more complete approach, I can provide
> you with the full item definitions, but the short form is:
> 
>        _variant_variant
> 
>              The value of _variant_variant must uniquely identify
>              each variant for the given diffraction experiment and/or
>              entry
> 
>        _variant_role
> 
>              The value of _variant_role  specifies a role
>              for this variant.  Possible roles are null, "preferred",
>              "raw data", and "unsuccessful trial".
> 
>        _variant_timestamp
> 
>              The date and time identifying a variant.  This is not
>              necessarily the precise time of the measurement or
>              calculation of the individual related data items, but a
> timestamp that
>              reflects the order in which the variants were defined.
> 
>        _variant_variant_of
> 
>              The value of _variant.variant_of gives the variant
>              from which this variant was derived.  If this value is
> not
>              given, the variant is assumed to be derived from the
> default
>              null variant.
> 
>        _variant_details
> 
>              A description of special aspects of the variant
> 
>
>       An example of how this might be used is:
>
>               loop_
>                   _diffrn_radiation_wavelength_id
>                   _diffrn_radiation_wavelength
>                   _diffrn_radiation_wavelength_determinaton
>                      1   1.23456   fundamental
>                      2   1.25      estimated
> 
> 
> 
> would become
> 
>          loop_
>              _diffrn_radiation_wavelength_variant
>              _diffrn_radiation_wavelength
>                 final   1.23456
>                 pelim   1.25
>          loop_
>              _variant_variant
>              _variant_role
>              _variant_timestamp
>              _variant_variant_of
>              _variant_details
>              final preferred 2007-08-04T01:17:28 prelim refined
>              prelim .        2007-08-03T23:20:00 . .
> 
>          loop_
>             _cell_variant
>             _cell_length_a
>             _cell_length_b
>             _cell_length_c
>             _cell_angle_alpha
>             _cell_angle_beta
>             _cell_angle_gamma
>             final  22.5 22.5 22.5 90. 90. 90.
>             prelim 22.3 22.3 22.3 90. 90. 90.
> 
> 
>  Regards,
>    Herbert
> 
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
> 
>                 +1-631-244-3035
>                 yaya at dowling.edu
> =====================================================
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>


More information about the comcifs mailing list