[Cif2-encoding] Splitting of imgCIF and other sub-topics

James Hester jamesrhester at gmail.com
Fri Sep 10 08:51:56 BST 2010


Thanks Herbert for this detailed information, which is a great help to
me in forming an opinion.  Please understand that we are not even
close to considering excluding imgCIF from CIF.  Rather, I am
collecting information in order to form an opinion and work with
everybody to find a solution which then goes back to the DDLm group
and then on to COMCIFS regarding CIF2.  Speculation about potential
consequences for imgCIF are just part of the information-gathering
process.  In general terms, CIF is now a 'framework', which I think
will make bringing XML and HDF5 developments under the CIF umbrella
relatively simple.

Please also understand that my comments about the usefulness of CBFlib
were in the context of a typical beamline user wishing to handle their
data, rather than from a programmer's point of view.  I was not
casting aspersions on CBFlib, rather seeking more information (which
you have provided).

I am afraid that terminology here may be confusing me: I would like to
talk about imgCIF as a pure ASCII format (eg IT Vol G p 40 para 15)
and CBF as the binary equivalent.  However, your previous statements
indicate that imgCIF could also be written in UTF16 encoding.  So:
when you speak of the Dectris detector output as 'imgCIF', what
encoding is used?

The point you make about embedding imgCIF into a text-only format (in
this case XML) is, I agree, a use-case that we have to consider.  I
see merit in the position that 'CIF2 content' inside a container is
not constrained by encoding, in those cases where the container is
able to specify the encoding itself.  This is *pedantically* true
already in that the 'header' of the container file as a whole is *not*
the CIF2 magic header.  So: what does everyone think of the following
statement being included in the standard?

"Note that a CIF2-conformant character stream that forms part of a
larger stream is not constrained to be in UTF8 encoding if the
encoding of the CIF2 stream is specified in a standards-conformant
manner within the enclosing stream.  For example, CIF2 content within
an XML file is not constrained to be UTF8-encoded as standard XML
attributes can be used to manage encoding."

(Perhaps John B, who has shown superior wordsmithing capabilities,
could polish this up a bit?)

On Fri, Sep 3, 2010 at 11:10 PM, Herbert J. Bernstein
<yaya at bernstein-plus-sons.com> wrote:
> Here is more detail on the use of CBFlib.
>
> I know for sure that CBFlib is used directly by mosflm and adxv.  While XDS
> uses code that was prototyped in the Fortran part of CBFlib, they work with
> their own versions.  However, Kay Diederichs has also used the CBFlib C code
> for work on simulations.  Paul Ellis started HKL2000 off with CBFlib, but I
> don't know if they stayed with it.
>
> As a practical matter, whether someone uses CBFlib itself, it is an
> essential part of the documentation that people use to understand how the
> various compression schemes work, and they use the utility cif2cbf from the
> package both as an external converter and as a validator and as a debugger
> when they don't want to put all the functionality in their own code.  If you
> have a funny CBF in any of the semi-infinite number of representations,
> cif2cbf allows you to check it, get a hex dump of it or convert it to a
> specific compression scheme or format that some other program needs to
> process that file.
>
> In other words, CBFlib on its own _is_ useful.
>
> Sorry about not giving you a list re imgCIF use, I thought you were asking
> me about CBFlib use -- every beamline that uses a Dectris Pilatus 6M
> produces imgCIF as the default.  This had been a byte-offset compressed
> binary with a mini-header.  Dectris has now moved up to writing a full
> header.  There were some beamlines with some of the older smaller Dectris
> detectors that were producing TIFF, but all currently delivered Dectris
> detectors of all sizes produce imgCIF as the default.
>
> All the major detector manufacturers now offer CBF as an option except for
> Bruker which is debugging an optional CBF output.  When I checked at the ACA
> meeting in July they all also said that their processing packages can accept
> CBF as an input.
>
> On the XML use, I would suggest a more broad-minded attitude. Judging from
> the workshop I was at in January at ESRF, it has much broader support than
> just from Diamond, especially for spectra which have smaller data volume
> than images. HDF5 is the most widely accepted scientific binary data format
> for the physics community, and XML is the easiest and most reliable way to
> port smaller HDF5 datasets from site to site. The problem with XML is that
> for large files such as crystallographic images ordinary straight-text XML
> produces huge, impractical files.  binutf allows for a compromise in which
> you have a true XML UCS-2 file but with the binary having only a 7%
> overhead.
>
> I have no choice -- I _will_ (indeed already do) produce CIFs with UCS-2
> binary sections.  If COMCIFS repeats the unfortunate decision of 1997 of
> saying that what the synchrotron community needs can't be called CIF, we'll
> just go back to calling it imgNCIF (which is an acronym for image-not-CIF),
> but we will still have to produce it for the community. In 1998 after we had
> a face-to-face discussion at a BNL workshop, that decision was reversed and
> what the synchrotron community needed was folded under the CIF umbrella, and
> imgNCIF became imgCIF.  I hope we can have discussions now to avoid the need
> for a pointless schism.
>
> Your proposal on the relationship between CIF2 and imgCIF sounds like a
> replay of the discussions we had in 1997, with CIF headers following one
> standard and binary sections following another. You can make that work, but
> it is clumsy and hard for users to work with.  It is better if we have one
> simple, comprehensible standard for the files they work with as a whole.
>
> Let me be clear -- imgCIF is produced worldwide and used for thousands of
> images daily.  These older "legacy" imgCIF images will be around for a long
> time to come, and whatever new imgCIF (or if you force us to it, imgNCIF)
> images we produce will need to be, and will be, supported by software that
> handles both the legacy and the new images and has a clean interface to HDF5
> and XML as well.  I would greatly prefer that this be coordinated with
> COMCIFS and done in a way that helps the community to understand the
> relationship between CIF and imgCIF, but if COMCIFS feels a need to return
> to its 1997 position and exclude the data we work with from its charge, then
> imgCIF can return to being imgNCIF.
>
> If we are to resolve this, then, as in 1998, we need a meeting or e-meeting.
>  Once you have a web-cam, I would suggest you and I have a skype meeting to
> frame the issues in dispute and organize a wider meeting.
>
> -- Herbert
>
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya at dowling.edu
> =====================================================
>
> On Fri, 3 Sep 2010, James Hester wrote:
>
>>> On Fri, 3 Sep 2010, James Hester wrote:
>>>
>>>> Thanks Herbert for providing the imgCIF perspective.
>>>>
>>>> I am unfortunately severely restricted in my ability to attend
>>>> overseas meetings at present, for family and work reasons.  I am also
>>>> keen to have our discussions written down and available for perusal by
>>>> those that will come later.
>>>
>>> How about an e-meeting?
>>
>> OK, I think we need to try online as my carefully crafted arguments
>> seem to be misunderstood more often than not.
>> Let me buy a web cam first!
>>
>>>> We need to discuss the relationship of imgCIF to CIF2 explicitly, if
>>>> imgCIF is going to influence our decisionmaking.  Some questions for
>>>> Herbert to answer for the record:
>>>>
>>>> 1. How widely used are non-CBF forms of imgCIF at present?  By "widely
>>>> used" I mean both
>>>>  (a) supported by software packages that allow one to do "useful
>>>> work", most obviously to extract diffraction spots
>>>
>>> I assume by "non-CBF" you mean the forms that do the binary sections
>>> in something that is not pure binary -- all software that uses CBFlib
>>> supports them automatically for reading.  For writing, most software
>>> chooses one representation for writing, usually byte-offset or
>>> packed binary, except when we have to debug -- then the ascii
>>> forms, esp. the hexdump form are very useful.
>>
>> You are correct in interpreting what I mean by "non-CBF".
>>
>> I understand that CBFlib supports everything, but CBFlib on its own is
>> not useful. Do you know approximately what programs use CBFlib?  I
>> know only of rasmol, but you presumably know of many more.
>>
>>>>  (b) provided as an output format (even optionally) by beamlines or
>>>> detector manufacturers
>>
>>> See above
>>
>> I see nothing in your reply on the availability of imgCIF files from
>> detectors or instruments.
>>
>>>> 2. What is the advantage of having "pure text" image files?  Why isn't
>>>> a format like CBF more appropriate?
>>>
>>> While I agree, when we deal with people who like XML e.g. the NeXus
>>> form of imgCIF, then we have no choice -- no binary is allowed, so
>>> UCS-2 becomes important.  Don't ask me to defend XML.  It is simply a
>>> fact of life.
>>
>> I am guessing that this NeXuS-XML requirement is coming from Diamond,
>> and if this is what they want I can see why you are keen to integrate
>> imgCIF into HDF5, so that HDF5-XML conversion can be carried out the
>> standard HDF5 way, rather than encapsulating the entire imgCIF file as
>> a NeXuS-XML dataset.  OK: so apart from this relatively recent and
>> frankly crazy-wierd use case, is there any other use-case for
>> pure-text imgCIF?  Can we regard the "Diamond" case as a
>> beaurocratically-driven kluge that will be resolved via your HDF5
>> work, leaving no other reason to create a space-efficient CIF2 version
>> of imgCIF?
>>
>>>> 3. What is the problem with a scenario where "pure text" imgCIF
>>>> remains in its current CIF1 form, and CIF2 advances are incorporated
>>>> into the CIF sections of CBF?
>>>
>>> I don't understand this question, nor the assumptions behind it.
>>
>> Let me be less obtuse:
>> I envision a CBF2 format, which is a CBF file with CIF2 instead of
>> CIF1 syntax.  A corresponding imgCIF2 format exists. We *do not care*
>> about the space-efficiency of these imgCIF2 files. We recommend that
>> all new crystallographic image-handling applications should target
>> CBF2 only, rendering space-efficiency of imgCIF2 files irrelevant.
>> Legacy applications, of which there are very few, will be restricted
>> to the original imgCIF, which is very rarely produced in any case
>> (anticipating your answers to my above questions).
>>
>> What are your (Herbert's, anybody else's) thoughts on such a plan?
>>
>>>> Herbert: your work merging a DDL2-based version with DDLm-like
>>>> features in HDF5 format sounds interesting.  Are you planning to
>>>> present a motivation and/or discussion of this work at some stage?
>>>
>>> This is the subject of some grant applications, so not appropriate for
>>> detailed open discussion in this forum at this time.  The motivations
>>> are simple -- to satisfy the demands of several major facilities for
>>> easy integration of crytallographic synchrotron images into HDF5-based
>>> data
>>> management systems while preserving access to metadata, and to extend
>>> HDF5
>>> with relational meta-data access.  This second aspect is an increasingly
>>> critical need and will go forward in any case.  If we have
>>> a meeting or e-meeting, I can explain better.
>>
>> OK, I think reading between the lines I see where this is coming from
>> (read your CACM article as well, BTW).  It'd be good to discuss some
>> of these plans at some stage.
>>
>>>>
>>>> On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein
>>>> <yaya at bernstein-plus-sons.com> wrote:
>>>>>
>>>>> Dear James,
>>>>>
>>>>>  I have not been at all reticent -- imgCIF will be very poorly
>>>>> supported
>>>>> by CIF2 as currently proposed.  Of necessity, imgCIF changes encodings
>>>>> internally -- that it why it uses MIME -- same problem as email with
>>>>> images, same solution.
>>>>>
>>>>>  Any purely text version has at least a 7% overhead as compared to
>>>>> pure binary.  Restricting to UTF-8 increases the overhead to at least
>>>>> 50%.
>>>>> We may get away with the 7% (UTF-16).  The 50% version (UTF-8) will be
>>>>> ignored by the community as unworkable.  The most likely to be used
>>>>> version
>>>>> will be the current DDL2-based version with embedded compressed
>>>>> binaries
>>>>> that I am augmenting with DDLm-like features
>>>>> and merging in with HDF5.
>>>>>
>>>>>  As I noted many months ago, the unfortunate reality is that the
>>>>> current CIF2 effort will not merge well with imgCIF.  If avoiding
>>>>> a split is a important -- we need a meeting.  I would suggest
>>>>> involving Bob Sweet and holding it at BNL in conjunction with
>>>>> something relevant to NSLS-II.
>>>>>
>>>>>  Regards,
>>>>>    Herbert
>>>>>
>>>>> =====================================================
>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>
>>>>>                 +1-631-244-3035
>>>>>                 yaya at dowling.edu
>>>>> =====================================================
>>>>>
>>>>> On Tue, 24 Aug 2010, James Hester wrote:
>>>>>
>>>>>> Hi Herbert: regarding imgCIF,  I agree that splitting it off is not a
>>>>>> desirable outcome.  I would like to get an idea of how well imgCIF can
>>>>>> be accommodated under the various encoding proposals currently
>>>>>> floating around, as you have been rather reticent to bring it up.  My
>>>>>> naive take on things is that a UTF8-only encoding scheme for CIF2
>>>>>> would not pose significant issues for imgCIF, and a decorated UTF16
>>>>>> encoding in the style of Scheme B would be even better, and quite
>>>>>> adequate, so imgCIF is not actually presenting any problems and so was
>>>>>> a red herring.
>>>>>>
>>>>>> I'm not sure that face-to-face or Skype discussions are necessarily
>>>>>> going to be more productive.  Writing things down, while slower,
>>>>>> allows me at least to collect my thoughts and those of other
>>>>>> participants, and hopefully make a reasoned contribution (my apologies
>>>>>> if I am too long-winded) and as an added bonus those thoughts are
>>>>>> recorded for later reference.  For example, where would I now find the
>>>>>> background on why a container format for imgCIF is such a bad idea?
>>>>>> Presumably that was all thrashed out in face to face discussions, and
>>>>>> no record now remains.
>>>>>>
>>>>>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein
>>>>>> <yaya at bernstein-plus-sons.com> wrote:
>>>>>>>
>>>>>>> Dear Colleagues,
>>>>>>>
>>>>>>>   James' and John's last interchange is so voluminous, I doubt any of
>>>>>>> us has been able to fully appreciate the rich complexity of ideas
>>>>>>> contained therein.  For example, one of the suggestions far down in
>>>>>>> the text is:
>>>>>>>
>>>>>>> (James now)  Indeed.  My intent with this specification was to ensure
>>>>>>> that third parties would be able to recover the encoding. If imgCIF
>>>>>>> is
>>>>>>> going to cause us to make such an open-ended specification, it is
>>>>>>> probably a sign that imgCIF needs to be addressed separately.  For
>>>>>>> example, should we think about redefining it as a container format,
>>>>>>> with a CIF header and UTF16 body (but still part of the
>>>>>>> "Crystallographic Information Framework")?
>>>>>>>
>>>>>>> The idea of an imgCIF "header" in CIF format and a image in another
>>>>>>> is
>>>>>>> an
>>>>>>> old, well-established, thoroughly discussed, and mistaken idea,
>>>>>>> rejected
>>>>>>> in 1998.  The handling of multiple images in a single file (e.g.
>>>>>>> a jpeg thumbnail and crystal image and a full-size diffraction image)
>>>>>>> requires the ability to switch among encodings within the file --
>>>>>>> something handled by the current DDL2 and MIME-based imgCIF format
>>>>>>> and
>>>>>>> which would be a serious problem in CIF2 has currently proposed,
>>>>>>> increasing the chances that we will have to move imgCIF entirely into
>>>>>>> HDF5 and abandon the CIF representation entirely, sharing only
>>>>>>> the dictionary and not the framework.
>>>>>>>
>>>>>>> If you look carefully, you will see a similar trend with mmCIF, in
>>>>>>> which
>>>>>>> and XML representation sharing the dictionary plays a much more
>>>>>>> important role than the CIF format.
>>>>>>>
>>>>>>> Is it really desirable to make the new CIF format so rigid and
>>>>>>> unadaptable that major portions of macromolecular crysallography
>>>>>>> end up migrating to very different formats, as they already are
>>>>>>> doing?  Yes, there is great value in having a common dictionary,
>>>>>>> but would there not be additional value in having a sufficiently
>>>>>>> flexible common format to allow for more software sharing than
>>>>>>> we now have?  It is really desirable for us to continue in the
>>>>>>> direction of a single macromolecular experiment having to
>>>>>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data
>>>>>>> during collection, CCP4-style CIF representations during processing
>>>>>>> and deposition and legacy PDB and PDBML representations in subsequent
>>>>>>> community use?  If we could be a little bit more flexible, we might
>>>>>>> be
>>>>>>> able to reduce the data interchange software burdens a little.
>>>>>>> Right now, this discussion seems headed in the direction of simply
>>>>>>> adding yet another data representation (DDLm/CIF2) to the mix,
>>>>>>> increasing the chances of mistranslation and confusion, rather
>>>>>>> that reducing them.
>>>>>>>
>>>>>>> Please, step back a bit from the detailed discussion of UTF8 and
>>>>>>> look at the work-flow of doing and publishing crystallographic
>>>>>>> experiments and let us try to make a contribution that simplifies
>>>>>>> it, not one that makes it more complex than it needs to be.
>>>>>>>
>>>>>>> I suggest we need to meet and talk, either face-to-face, or by skype.
>>>>>>>
>>>>>>> Regards,
>>>>>>>   Herbert
>>>>>>>
>>>>>>> =====================================================
>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>    Dowling College, Kramer Science Center, KSC 121
>>>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>
>>>>>>>                  +1-631-244-3035
>>>>>>>                  yaya at dowling.edu
>>>>>>> =====================================================
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> cif2-encoding mailing list
>>>>>>> cif2-encoding at iucr.org
>>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> T +61 (02) 9717 9907
>>>>>> F +61 (02) 9717 3145
>>>>>> M +61 (04) 0249 4148
>>>>>> _______________________________________________
>>>>>> cif2-encoding mailing list
>>>>>> cif2-encoding at iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>
>>>>> _______________________________________________
>>>>> cif2-encoding mailing list
>>>>> cif2-encoding at iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> T +61 (02) 9717 9907
>>>> F +61 (02) 9717 3145
>>>> M +61 (04) 0249 4148
>>>> _______________________________________________
>>>> cif2-encoding mailing list
>>>> cif2-encoding at iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>
>>> _______________________________________________
>>> cif2-encoding mailing list
>>> cif2-encoding at iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>
>>>
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> cif2-encoding mailing list
>> cif2-encoding at iucr.org
>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding at iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


More information about the cif2-encoding mailing list