Purely calculated structural data in CIF

Antanas Vaitkus antanas.vaitkus90 at gmail.com
Sun Jun 26 23:09:36 BST 2022


Dear all,

the Crystallography Open Database (COD) maintainers have also encountered
a similar problem of identifying and marking purely calculated
(theoretical) entries
that accidentally make it into the COD. Our approach is similar to the one
proposed
by John -- we use a set of heuristics to semi-automatically identify
potentially theoretical
entries and manually mark these entries using the
'_cod_struct_determination_method'
data item from the COD CIF dictionary. This data item currently takes 1 of
3 enumerated
values ['single crystal', 'powder diffraction', 'theoretical'] so in a
sense it can be viewed as
a rudimentary, COD-specific version of the '_exptl.method' data item.
Having a more
standardised approach would be extremely helpful.

Actually, the latest DDLm version of the CIF CORE dictionary [1] already
contains
the '_exptl.method' data item that John mentioned, although in a slightly
different
form than the one in the mmCIF dictionary [2]. The main difference is that
the
CIF CORE version is a free-form text field while the mmCIF version in an
enumerated set with 13 different values such as "X-RAY DIFFRACTION",
"ELECTRON MICROSCOPY", etc. one of which is "THEORETICAL MODEL".
I think that converting the CIF CORE version to an enumerated set would also
make sense, especially for the application discussed in this thread.

Several alternative approaches could also be explored, for example:
a) Introduce a new data item that marks a structure as theoretical
    (e.g. with yes/no values).
b) Introduce a new data item that specifies the *theoretical* method
    that was used (e.g. with values such as "*Ab initio* optimization",
    "Geometric modelling", "Molecular dynamics", etc.). This data item
    would only appear in theoretically calculated data files and in
combination
    with the '_exptl.method' data item would allow to describe various
    situations such as "theoretical X-ray diffraction data calculated using
    geometric modelling", "powder diffraction experiment calculated using
    the Monte-Carlo method", etc. If I am not mistaken, a similar approach
    is adopted by the ICSD (see paper [3], especially section 4.2).

Also, I would like to add to the list of heuristics that John proposed.
Often in the files of theoretically calculated structures:
* Lattice parameters are provided with a very precise decimal part
  (more than 4 digits) and without standard uncertainties (no trailing
  parentheses with the s.u. values).
* The Z number ('_cell_formula_units_Z') is not provided.
* Atomic displacement parameters are either not provided at all or
  all values are set to 0 ('_atom_site_U_iso_or_equiv',
  'ATOM_SITE_ANISO' loop).

Finally, I invite you to take a look at the theoretical structures that
are already in the COD to expand the set of heuristics even further.
Note, that the COD files have undergone some curation so some of
the strange features might have been stripped out, however, all of
the files contain references to the original publication in case you
would like to take a more purist approach. The full list of theoretical
structures can be retrieved using the following MySQL query:

mysql -u cod_reader -h www.crystallography.net cod -e 'SELECT `file` FROM
`data` WHERE `method`="theoretical"';

Feel free to write me a personal email in case you need further
advice on retrieving data from the COD.

[1] https://github.com/COMCIFS/cif_core/blob/master/cif_core.dic, commit
306cd53
[2] https://mmcif.wwpdb.org/dictionaries/ascii/mmcif_pdbx_v50.dic
[3] https://journals.iucr.org/j/issues/2019/05/00/in5024/index.html

Sincerely,
Antanas Vaitkus

On Fri, 25 Feb 2022 at 19:00, Bollinger, John C via coreDMG <
coredmg at iucr.org> wrote:

> Dear Mike,
>
>
>
> As far as I am aware, we have no convention for this in Core CIF, but in
> mmCIF, it appears that one would be expected to use …
>
>
>
> _exptl.method 'theoretical model'
>
>
>
> … to flag a computed structure.  Other values of that data name supported
> by mmCIF provide for identifying various kinds of diffraction and NMR
> experiments by which the associated structure was determined.  We could
> consider adding a corresponding item to Core CIF to support such marking
> going forward, but of course that does not help with recognizing existing
> CIFs describing computed structures.
>
>
>
> As for identifying existing core CIFs describing structures determined ab
> initio or from molecular modeling, I don’t see a better approach than
> heuristics such as you describe already using.  Additional characteristics
> that such heuristics might check, especially in the context of checkCIF,
> would be absence of non-null values for substantially all data names in the
> _diffrn*, _exptl*, _refine*, _refln* and _reflns* categories.  Exceptions
> that  might be expected to be present include the proposed _exptl_method
> item; *_details items; and a handful of items, such as
> _exptl_crystal_absorpt_coefficient_mu, that are actually computed from the
> structure rather than being measured.
>
>
>
> Best regards,
>
>
>
> John Bollinger
>
>
>
>
>
> --
>
> John C. Bollinger, Ph.D., RHCSA
>
> Computing and X-Ray Scientist
>
> Department of Structural Biology
>
> St. Jude Children's Research Hospital
>
> John.Bollinger at StJude.org
>
> (901) 595-3166 [office]
>
> www.stjude.org
>
>
>
>
>
>
>
>
>
> *From:* coreDMG <coredmg-bounces at iucr.org> *On Behalf Of *Mike Hoyland
> via coreDMG
> *Sent:* Thursday, February 24, 2022 11:03 AM
> *To:* coredmg at iucr.org
> *Cc:* Mike Hoyland <mh at iucr.org>
> *Subject:* Purely calculated structural data in CIF
>
>
>
> *Caution: External Sender. Do not open unless you know the content is
> safe.*
>
>
>
> Dear All,
>
> We are currently working on improving the checkCIF handling of powder
> diffraction CIFs, and have coincidentally fallen across an issue with
> handling purely calculated structural data, e.g. by DFT calculation. So far
> we have relied on finding the use of "DFT" within various datanames, e.g.
>
> _computing_structure_solution
> _diffrn_measurement_device_type
>
> There is no guarantee of course that it would be present in this form.
>
> Therefore, I would like to ask if anyone has any thoughts about how we
> would be able to simply identify or mark a particular structural datablock
> as containing calculated rather than experimental data.
>
> With thanks for any thoughts or suggestions,
>
> Mike Hoyland
> Systems Developer
> IUCr, Chester
>
> ------------------------------
>
> Email Disclaimer: www.stjude.org/emaildisclaimer
> Consultation Disclaimer: www.stjude.org/consultationdisclaimer
> _______________________________________________
> coreDMG mailing list
> coreDMG at iucr.org
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/coredmg
>


-- 
Antanas Vaitkus,
Vilnius University,
Life Sciences Center,
Institute of Biotechnology,
room C521, Saulėtekio al. 7,
LT-10257 Vilnius, Lithuania
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.iucr.org/pipermail/coredmg/attachments/20220627/7731f418/attachment.htm>


More information about the coreDMG mailing list