_database.dataset_doi - any problems if this might be a DOI for raw data?

Mon Jun 6 07:23:51 BST 2022

Hi Natalie,

Good to know that we wouldn't be stepping on database toes by using
_database.dataset_doi. I understand where Ian is coming from, and I for one
agree that we should move away from using "DOI" in data names. Might it be
sufficient to say "URI" and let the URI standard cover the various
identifiers in use rather than try and maintain a complete list of the
different identifier types?

The bigger problem from my point of view is that simply pointing to a
dataset doesn't explain how it slots into the CIF schema. What we have
recently done in imgCIF is to point to external images, and explain how the
contents that are pointed to should be incorporated into the CIF schema, so
that it is effectively as if those images had appeared within the CIF file.
This essentially means listing where values for data names defined in CIF
are found in the external file.

all the best,
James.

On Thu, 2 Jun 2022 at 18:31, Natalie Johnson <njohnson at ccdc.cam.ac.uk>
wrote:

> Hi all,
>
>
>
> The CCDC currently does not use this attribute at all. For CCDC DOIs (a
> DOI to the data entry in the CSD), audit_block_doi is used. I have looked
> in the CSD CIF Archive and found no CIFs containing information in this
> field (up to v5.43 release, October 2021).
>
>
>
> I have discussed this with colleagues at CCDC, in particular Ian Bruno,
> and he has offered this further comment:
>
>
>
>
> *DOI is not the only type of persistent identifier and other types of
> persistent identifier may be associated with raw data. To accommodate this,
> COMCIFs might want to consider, for example, _database.dataset_identifier
> which is accompanied by _database.dataset_identifier_type. _identifier_type
> could be a controlled vocabulary or free text. The DataCite schema
> specification recognises the following types: ARK, arXiv, bibcode, DOI,
> EAN13, EISSN, Handle, IGSN, ISBN, ISSN, ISTC, LISSN, LSID, PMID, PURL, UPC,
> URL, URN, w3id. *
>
> *Not all of these identifier types will be relevant to raw data but ARKs,
> Handles, PURLs or URLs for example could be reported. The list of types
> above is related to the Related Identifier element of the DataCite schema
> that is used to define "Identifiers of related resources" which could be
> samples, articles, or indeed other datasets. These are qualified by a
> controlled list of relation types of which there are many. CCDC uses this
> element to capture the association between structure and raw data
> identifiers using the "IsDerivedFrom" relation type. It maybe overkill for
> CIF to replicate such a general solution but perhaps there could be a more
> specific generalisation for capturing connections between different
> datasets associated with a diffraction experiment. Rather than relation
> type, perhaps capture the type of the dataset so there is something like:*
>
> *_related_dataset_type*
>
> *_related_dataset_identifier*
>
> *_related_dataset_identifier_type*
>
> *The question would be what types of dataset might one envisage. There
> could be an argument for capturing relation type also if dataset_type isn't
> enough to infer the relationship. The current DataCite schema can be found
> at
> https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf
> <https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf>.*
>
>
>
>
>
> *From:* coreDMG <coredmg-bounces at iucr.org> *On Behalf Of *James H via
> coreDMG
> *Sent:* 01 June 2022 04:39
> *To:* Horst Puschmann <horst.puschmann at gmail.com>
> *Cc:* James H <jamesrhester at gmail.com>; Forum for CIF software developers
> <cif-developers at iucr.org>; Distribution list of the IUCr COMCIFS Core
> Dictionary Maintenance Group <coredmg at iucr.org>
> *Subject:* Re: _database.dataset_doi - any problems if this might be a
> DOI for raw data?
>
>
>
> Thanks for the feedback Horst. It was a bit more broad than I needed, but
> you raise some interesting points.
>
>
>
> We are making slow progress in CIF towards having all forms of data
> properly described and incorporated into the CIF "scheme", regardless of
> the format that they are preserved in.  One prong of this work has been to
> ensure that the DDLm language used for data definition in CIF is not
> format-specific, that is, data names defined in our dictionaries describe
> data that might be stored in any format - but of course somebody still has
> to actually do the specification that says "in files of this format you can
> find CIF data item x_y_z at this location". Another prong of this work is
> to rigorously allow for multi-data-block data (this is now done). Putting
> those two prongs together means you can have some zip file or directory
> full of files in various formats, and have the relationships between the
> data stored in the various files rigorously specifiable by CIF
> dictionaries. The missing work here is the specification (and software that
> uses that specification) of where data name Y is found in format X (and
> given typical conformance issues "...at facility Z".)
>
>
>
> Another prong going in a bit of a different direction has been to define
> "pointer" data names that describe how to interpret externally-stored image
> data in CIF terms (see
> https://github.com/yayahjb/cbflib/doc/cif_img_1.8.6.dic and search for
> _array_data_external_data). This allows the bulky image data to be stored
> outside the CIF file while the CIF file itself contains the metadata. A
> couple of examples of this in action are in preparation, or you can have
> fun with https://github.com/jamesrhester/ImgCIFHandler.jl which takes
> such imgCIF files and loads the externally-located HDF5/CBF/ADSC image from
> optionally tar/gzipped archives.
>
>
>
> To circle back to your comments, DOIs that point to files that have no
> properly (in the sense of CIF data names for the contents of the files)
> defined roles are less useful than DOIs that can be resolved to an object
> that has an exact slot in a CIF-based description of a data collection. The
> latter can be transparently auto-processed with no human intervention.
> Given that DOIs resolve to landing pages they are currently not that useful
> for machine-based data processing, but perhaps DOI standards will evolve to
> allow the actual data files to be specified in a DOI.
>
>
>
> And finally, while the tsc files might be too bulky for storing in CIF
> syntax, the contents can eventually be brought into the CIF data world.
> Here are the steps:
>
> 1. Define CIF data names for the relevant quantities appearing in the
> files. These would logically go in a new dictionary as NoSpherA2 refinement
> is different to the refinement model of the core dictionary
>
> 2. Define where these data names are located in .tsc files
>
>
>
> Step 2 is a brave new world unfortunately, as there are no general
> machine-readable standards that I know of, or that we have developed, for
> specifying where "something" is located in a file with arbitrary format -
> if anyone reading this knows of such a thing, get in touch.  The
> _array_data_external_data work I linked to above does contain some ideas
> about how this might work in practice.
>
>
>
> anyway, thanks again for your thoughts.
>
> James.
>
>
>
> On Fri, 27 May 2022 at 22:59, Horst Puschmann <horst.puschmann at gmail.com>
> wrote:
>
> Hello James,
>
>
>
> I think providing a DOI in the CIF pointing at large amounts of data would
> be fantastic. I must say that I don't quite understand that wording here --
> what does the CSD or PDP have to do with it? Surely, all hkl data is now
> included in the CIF as standard -- but of course: it would also be
> excellent if the hkl could be handled via an 'hkl DOI'.
>
>
>
> I take 'raw data' to mean the diffraction images (frames) -- possibly
> together with a recipe for how they were processed. Having easy access to
> frames 'by default' would be fantastic.
>
>
>
> But there are other kinds of files, and some might be required to repeat
> the actual refinement. I am thinking of our own NoSpherA2 refinement, which
> requires a '.tsc' file -- a file which is clearly too large to be embedded
> in the CIF. A DOI for this kind of file would be really useful (right now,
> we just include the hkl and the exact parameters to repeat the generation
> of the '.tsc' file -- which is clearly not ideal.
>
>
>
> And then there might be a third kind of DOI -- one where people can
> deposit 'random' files -- like videos, images, descriptions of special
> setup etc, scripts etc.
>
>
>
> I am not sure whether this is the sort of feedback you were looking for,
> but there you are.
>
>
>
> Greetings
>
> Horst
>
>
>
> On Fri, 27 May 2022 at 04:09, James H via coreDMG <coredmg at iucr.org>
> wrote:
>
> (cross-posted to cif-developers and core DMG, apologies for cross-posting)
>
>
>
> Dear CIF Developers and core DMG,
>
>
>
> IUCr Journals are looking at using _database.dataset_doi to indicate the
> DOI of a raw data set associated with a data block. The meaning of
> "dataset" is not clear here, for example, it might have been intended to
> refer to hkl listings.
>
>
>
> So, please give feedback on any problems your software/database might
> encounter if this DOI might resolve to a raw dataset.
>
>
>
> The current definition:
>
>
>
> " The digital object identifier (DOI) registered to identify
>     a data set publication associated with the structure
>     described in the current data block. This should be used
>     for a dataset obtained from a curated database such as
>     CSD or PDB. "
>
>
>
> thanks,
>
> James.
>
> --
>
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
> _______________________________________________
> coreDMG mailing list
> coreDMG at iucr.org
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/coredmg
> <https://linkprotect.cudasvc.com/url?a=http%3a%2f%2fmailman.iucr.org%2fcgi-bin%2fmailman%2flistinfo%2fcoredmg&c=E,1,nN1F0ynHLDScaEGAz-NeGsgRTF-PUf4AWmv29_ogDomCRSMBPb3_23DAq-jPj2SHsx1fBKUgiFjMoY6smi8mR4dtmaP42fXgf7cafoTpHk2NpYlgmw,,&typo=1>
>
>
>
> --
>
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
> *I appreciate that your working hours may be different to mine, please
> feel free to respond during your normal working hours.*
> [image: CCDC] <https://www.ccdc.cam.ac.uk>
> [image: LinkedIn]
> <https://www.linkedin.com/company/2683138?trk=cws-btn-overview-0-0> [image:
> Twitter] <https://twitter.com/ccdc_cambridge> [image: Facebook]
> <https://www.facebook.com/ccdc.cambridge> [image: YouTube]
> <https://www.youtube.com/user/CCDCCambridge> *Natalie Johnson*
> Data Integrity Research Scientist
>
> Phone: +44 1223 3-36032
> Email: njohnson at ccdc.cam.ac.uk
> [image: Core Trust Seal] [image: Mindful Employer]
>
> LEGAL NOTICE
> Unless expressly stated otherwise, information contained in this message
> is confidential. If this message is not intended for you, please inform
> postmaster at ccdc.cam.ac.uk and delete the message. The Cambridge
> Crystallographic Data Centre is a company Limited by Guarantee and a
> Registered Charity. Registered in England No. 2155347 Registered Charity
> No. 800579
>

-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.iucr.org/pipermail/coredmg/attachments/20220606/64825b8e/attachment-0001.htm>