[Cif2-encoding] Splitting of imgCIF and other sub-topics

Herbert J. Bernstein yaya at bernstein-plus-sons.com
Fri Sep 10 12:25:21 BST 2010


Dear James,

> "Note that a CIF2-conformant character stream that forms part of a
> larger stream is not constrained to be in UTF8 encoding if the
> encoding of the CIF2 stream is specified in a standards-conformant
> manner within the enclosing stream.  For example, CIF2 content within
> an XML file is not constrained to be UTF8-encoded as standard XML
> attributes can be used to manage encoding."

is almost reasoanble, but basically says that it will be easier to
handle CIF2 is almost any external container, rather than as itself.
I would suggest saying.

The description of a conformant CIF2 in terms of a UTF8 encoding
is intended to provide clarity in the description of a CIF2, not
to prevent use of CIF2 in terms of other encodings, such as UCS-2
unicode  or code-page-based encodings needed for editors in
particular system, nor to prevent used of transformed CIF2 in other
containers such as HDF5 and XML or imgCIF/CBF, as long as the 
decodings/encoding or other transformations that would be necessary to go 
to and from a UTF8 CIF2 representation are clearly and unambiguously 
defined.

This would bring us back essentially to where we have been for more than a 
decade with imgCIF/CBF and for nearly 2 decades with CIF1 itself.

Regards,
   Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya at dowling.edu
=====================================================

On Fri, 10 Sep 2010, James Hester wrote:

> Thanks Herbert for this detailed information, which is a great help to
> me in forming an opinion.  Please understand that we are not even
> close to considering excluding imgCIF from CIF.  Rather, I am
> collecting information in order to form an opinion and work with
> everybody to find a solution which then goes back to the DDLm group
> and then on to COMCIFS regarding CIF2.  Speculation about potential
> consequences for imgCIF are just part of the information-gathering
> process.  In general terms, CIF is now a 'framework', which I think
> will make bringing XML and HDF5 developments under the CIF umbrella
> relatively simple.
>
> Please also understand that my comments about the usefulness of CBFlib
> were in the context of a typical beamline user wishing to handle their
> data, rather than from a programmer's point of view.  I was not
> casting aspersions on CBFlib, rather seeking more information (which
> you have provided).
>
> I am afraid that terminology here may be confusing me: I would like to
> talk about imgCIF as a pure ASCII format (eg IT Vol G p 40 para 15)
> and CBF as the binary equivalent.  However, your previous statements
> indicate that imgCIF could also be written in UTF16 encoding.  So:
> when you speak of the Dectris detector output as 'imgCIF', what
> encoding is used?
>
> The point you make about embedding imgCIF into a text-only format (in
> this case XML) is, I agree, a use-case that we have to consider.  I
> see merit in the position that 'CIF2 content' inside a container is
> not constrained by encoding, in those cases where the container is
> able to specify the encoding itself.  This is *pedantically* true
> already in that the 'header' of the container file as a whole is *not*
> the CIF2 magic header.  So: what does everyone think of the following
> statement being included in the standard?
>
> "Note that a CIF2-conformant character stream that forms part of a
> larger stream is not constrained to be in UTF8 encoding if the
> encoding of the CIF2 stream is specified in a standards-conformant
> manner within the enclosing stream.  For example, CIF2 content within
> an XML file is not constrained to be UTF8-encoded as standard XML
> attributes can be used to manage encoding."
>
> (Perhaps John B, who has shown superior wordsmithing capabilities,
> could polish this up a bit?)
>
> On Fri, Sep 3, 2010 at 11:10 PM, Herbert J. Bernstein
> <yaya at bernstein-plus-sons.com> wrote:
>> Here is more detail on the use of CBFlib.
>>
>> I know for sure that CBFlib is used directly by mosflm and adxv.  While XDS
>> uses code that was prototyped in the Fortran part of CBFlib, they work with
>> their own versions.  However, Kay Diederichs has also used the CBFlib C code
>> for work on simulations.  Paul Ellis started HKL2000 off with CBFlib, but I
>> don't know if they stayed with it.
>>
>> As a practical matter, whether someone uses CBFlib itself, it is an
>> essential part of the documentation that people use to understand how the
>> various compression schemes work, and they use the utility cif2cbf from the
>> package both as an external converter and as a validator and as a debugger
>> when they don't want to put all the functionality in their own code.  If you
>> have a funny CBF in any of the semi-infinite number of representations,
>> cif2cbf allows you to check it, get a hex dump of it or convert it to a
>> specific compression scheme or format that some other program needs to
>> process that file.
>>
>> In other words, CBFlib on its own _is_ useful.
>>
>> Sorry about not giving you a list re imgCIF use, I thought you were asking
>> me about CBFlib use -- every beamline that uses a Dectris Pilatus 6M
>> produces imgCIF as the default.  This had been a byte-offset compressed
>> binary with a mini-header.  Dectris has now moved up to writing a full
>> header.  There were some beamlines with some of the older smaller Dectris
>> detectors that were producing TIFF, but all currently delivered Dectris
>> detectors of all sizes produce imgCIF as the default.
>>
>> All the major detector manufacturers now offer CBF as an option except for
>> Bruker which is debugging an optional CBF output.  When I checked at the ACA
>> meeting in July they all also said that their processing packages can accept
>> CBF as an input.
>>
>> On the XML use, I would suggest a more broad-minded attitude. Judging from
>> the workshop I was at in January at ESRF, it has much broader support than
>> just from Diamond, especially for spectra which have smaller data volume
>> than images. HDF5 is the most widely accepted scientific binary data format
>> for the physics community, and XML is the easiest and most reliable way to
>> port smaller HDF5 datasets from site to site. The problem with XML is that
>> for large files such as crystallographic images ordinary straight-text XML
>> produces huge, impractical files.  binutf allows for a compromise in which
>> you have a true XML UCS-2 file but with the binary having only a 7%
>> overhead.
>>
>> I have no choice -- I _will_ (indeed already do) produce CIFs with UCS-2
>> binary sections.  If COMCIFS repeats the unfortunate decision of 1997 of
>> saying that what the synchrotron community needs can't be called CIF, we'll
>> just go back to calling it imgNCIF (which is an acronym for image-not-CIF),
>> but we will still have to produce it for the community. In 1998 after we had
>> a face-to-face discussion at a BNL workshop, that decision was reversed and
>> what the synchrotron community needed was folded under the CIF umbrella, and
>> imgNCIF became imgCIF.  I hope we can have discussions now to avoid the need
>> for a pointless schism.
>>
>> Your proposal on the relationship between CIF2 and imgCIF sounds like a
>> replay of the discussions we had in 1997, with CIF headers following one
>> standard and binary sections following another. You can make that work, but
>> it is clumsy and hard for users to work with.  It is better if we have one
>> simple, comprehensible standard for the files they work with as a whole.
>>
>> Let me be clear -- imgCIF is produced worldwide and used for thousands of
>> images daily.  These older "legacy" imgCIF images will be around for a long
>> time to come, and whatever new imgCIF (or if you force us to it, imgNCIF)
>> images we produce will need to be, and will be, supported by software that
>> handles both the legacy and the new images and has a clean interface to HDF5
>> and XML as well.  I would greatly prefer that this be coordinated with
>> COMCIFS and done in a way that helps the community to understand the
>> relationship between CIF and imgCIF, but if COMCIFS feels a need to return
>> to its 1997 position and exclude the data we work with from its charge, then
>> imgCIF can return to being imgNCIF.
>>
>> If we are to resolve this, then, as in 1998, we need a meeting or e-meeting.
>>  Once you have a web-cam, I would suggest you and I have a skype meeting to
>> frame the issues in dispute and organize a wider meeting.
>>
>> -- Herbert
>>
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                 +1-631-244-3035
>>                 yaya at dowling.edu
>> =====================================================
>>
>> On Fri, 3 Sep 2010, James Hester wrote:
>>
>>>> On Fri, 3 Sep 2010, James Hester wrote:
>>>>
>>>>> Thanks Herbert for providing the imgCIF perspective.
>>>>>
>>>>> I am unfortunately severely restricted in my ability to attend
>>>>> overseas meetings at present, for family and work reasons.  I am also
>>>>> keen to have our discussions written down and available for perusal by
>>>>> those that will come later.
>>>>
>>>> How about an e-meeting?
>>>
>>> OK, I think we need to try online as my carefully crafted arguments
>>> seem to be misunderstood more often than not.
>>> Let me buy a web cam first!
>>>
>>>>> We need to discuss the relationship of imgCIF to CIF2 explicitly, if
>>>>> imgCIF is going to influence our decisionmaking.  Some questions for
>>>>> Herbert to answer for the record:
>>>>>
>>>>> 1. How widely used are non-CBF forms of imgCIF at present?  By "widely
>>>>> used" I mean both
>>>>>  (a) supported by software packages that allow one to do "useful
>>>>> work", most obviously to extract diffraction spots
>>>>
>>>> I assume by "non-CBF" you mean the forms that do the binary sections
>>>> in something that is not pure binary -- all software that uses CBFlib
>>>> supports them automatically for reading.  For writing, most software
>>>> chooses one representation for writing, usually byte-offset or
>>>> packed binary, except when we have to debug -- then the ascii
>>>> forms, esp. the hexdump form are very useful.
>>>
>>> You are correct in interpreting what I mean by "non-CBF".
>>>
>>> I understand that CBFlib supports everything, but CBFlib on its own is
>>> not useful. Do you know approximately what programs use CBFlib?  I
>>> know only of rasmol, but you presumably know of many more.
>>>
>>>>>  (b) provided as an output format (even optionally) by beamlines or
>>>>> detector manufacturers
>>>
>>>> See above
>>>
>>> I see nothing in your reply on the availability of imgCIF files from
>>> detectors or instruments.
>>>
>>>>> 2. What is the advantage of having "pure text" image files?  Why isn't
>>>>> a format like CBF more appropriate?
>>>>
>>>> While I agree, when we deal with people who like XML e.g. the NeXus
>>>> form of imgCIF, then we have no choice -- no binary is allowed, so
>>>> UCS-2 becomes important.  Don't ask me to defend XML.  It is simply a
>>>> fact of life.
>>>
>>> I am guessing that this NeXuS-XML requirement is coming from Diamond,
>>> and if this is what they want I can see why you are keen to integrate
>>> imgCIF into HDF5, so that HDF5-XML conversion can be carried out the
>>> standard HDF5 way, rather than encapsulating the entire imgCIF file as
>>> a NeXuS-XML dataset.  OK: so apart from this relatively recent and
>>> frankly crazy-wierd use case, is there any other use-case for
>>> pure-text imgCIF?  Can we regard the "Diamond" case as a
>>> beaurocratically-driven kluge that will be resolved via your HDF5
>>> work, leaving no other reason to create a space-efficient CIF2 version
>>> of imgCIF?
>>>
>>>>> 3. What is the problem with a scenario where "pure text" imgCIF
>>>>> remains in its current CIF1 form, and CIF2 advances are incorporated
>>>>> into the CIF sections of CBF?
>>>>
>>>> I don't understand this question, nor the assumptions behind it.
>>>
>>> Let me be less obtuse:
>>> I envision a CBF2 format, which is a CBF file with CIF2 instead of
>>> CIF1 syntax.  A corresponding imgCIF2 format exists. We *do not care*
>>> about the space-efficiency of these imgCIF2 files. We recommend that
>>> all new crystallographic image-handling applications should target
>>> CBF2 only, rendering space-efficiency of imgCIF2 files irrelevant.
>>> Legacy applications, of which there are very few, will be restricted
>>> to the original imgCIF, which is very rarely produced in any case
>>> (anticipating your answers to my above questions).
>>>
>>> What are your (Herbert's, anybody else's) thoughts on such a plan?
>>>
>>>>> Herbert: your work merging a DDL2-based version with DDLm-like
>>>>> features in HDF5 format sounds interesting.  Are you planning to
>>>>> present a motivation and/or discussion of this work at some stage?
>>>>
>>>> This is the subject of some grant applications, so not appropriate for
>>>> detailed open discussion in this forum at this time.  The motivations
>>>> are simple -- to satisfy the demands of several major facilities for
>>>> easy integration of crytallographic synchrotron images into HDF5-based
>>>> data
>>>> management systems while preserving access to metadata, and to extend
>>>> HDF5
>>>> with relational meta-data access.  This second aspect is an increasingly
>>>> critical need and will go forward in any case.  If we have
>>>> a meeting or e-meeting, I can explain better.
>>>
>>> OK, I think reading between the lines I see where this is coming from
>>> (read your CACM article as well, BTW).  It'd be good to discuss some
>>> of these plans at some stage.
>>>
>>>>>
>>>>> On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein
>>>>> <yaya at bernstein-plus-sons.com> wrote:
>>>>>>
>>>>>> Dear James,
>>>>>>
>>>>>>  I have not been at all reticent -- imgCIF will be very poorly
>>>>>> supported
>>>>>> by CIF2 as currently proposed.  Of necessity, imgCIF changes encodings
>>>>>> internally -- that it why it uses MIME -- same problem as email with
>>>>>> images, same solution.
>>>>>>
>>>>>>  Any purely text version has at least a 7% overhead as compared to
>>>>>> pure binary.  Restricting to UTF-8 increases the overhead to at least
>>>>>> 50%.
>>>>>> We may get away with the 7% (UTF-16).  The 50% version (UTF-8) will be
>>>>>> ignored by the community as unworkable.  The most likely to be used
>>>>>> version
>>>>>> will be the current DDL2-based version with embedded compressed
>>>>>> binaries
>>>>>> that I am augmenting with DDLm-like features
>>>>>> and merging in with HDF5.
>>>>>>
>>>>>>  As I noted many months ago, the unfortunate reality is that the
>>>>>> current CIF2 effort will not merge well with imgCIF.  If avoiding
>>>>>> a split is a important -- we need a meeting.  I would suggest
>>>>>> involving Bob Sweet and holding it at BNL in conjunction with
>>>>>> something relevant to NSLS-II.
>>>>>>
>>>>>>  Regards,
>>>>>>    Herbert
>>>>>>
>>>>>> =====================================================
>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>
>>>>>>                 +1-631-244-3035
>>>>>>                 yaya at dowling.edu
>>>>>> =====================================================
>>>>>>
>>>>>> On Tue, 24 Aug 2010, James Hester wrote:
>>>>>>
>>>>>>> Hi Herbert: regarding imgCIF,  I agree that splitting it off is not a
>>>>>>> desirable outcome.  I would like to get an idea of how well imgCIF can
>>>>>>> be accommodated under the various encoding proposals currently
>>>>>>> floating around, as you have been rather reticent to bring it up.  My
>>>>>>> naive take on things is that a UTF8-only encoding scheme for CIF2
>>>>>>> would not pose significant issues for imgCIF, and a decorated UTF16
>>>>>>> encoding in the style of Scheme B would be even better, and quite
>>>>>>> adequate, so imgCIF is not actually presenting any problems and so was
>>>>>>> a red herring.
>>>>>>>
>>>>>>> I'm not sure that face-to-face or Skype discussions are necessarily
>>>>>>> going to be more productive.  Writing things down, while slower,
>>>>>>> allows me at least to collect my thoughts and those of other
>>>>>>> participants, and hopefully make a reasoned contribution (my apologies
>>>>>>> if I am too long-winded) and as an added bonus those thoughts are
>>>>>>> recorded for later reference.  For example, where would I now find the
>>>>>>> background on why a container format for imgCIF is such a bad idea?
>>>>>>> Presumably that was all thrashed out in face to face discussions, and
>>>>>>> no record now remains.
>>>>>>>
>>>>>>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein
>>>>>>> <yaya at bernstein-plus-sons.com> wrote:
>>>>>>>>
>>>>>>>> Dear Colleagues,
>>>>>>>>
>>>>>>>>   James' and John's last interchange is so voluminous, I doubt any of
>>>>>>>> us has been able to fully appreciate the rich complexity of ideas
>>>>>>>> contained therein.  For example, one of the suggestions far down in
>>>>>>>> the text is:
>>>>>>>>
>>>>>>>> (James now)  Indeed.  My intent with this specification was to ensure
>>>>>>>> that third parties would be able to recover the encoding. If imgCIF
>>>>>>>> is
>>>>>>>> going to cause us to make such an open-ended specification, it is
>>>>>>>> probably a sign that imgCIF needs to be addressed separately.  For
>>>>>>>> example, should we think about redefining it as a container format,
>>>>>>>> with a CIF header and UTF16 body (but still part of the
>>>>>>>> "Crystallographic Information Framework")?
>>>>>>>>
>>>>>>>> The idea of an imgCIF "header" in CIF format and a image in another
>>>>>>>> is
>>>>>>>> an
>>>>>>>> old, well-established, thoroughly discussed, and mistaken idea,
>>>>>>>> rejected
>>>>>>>> in 1998.  The handling of multiple images in a single file (e.g.
>>>>>>>> a jpeg thumbnail and crystal image and a full-size diffraction image)
>>>>>>>> requires the ability to switch among encodings within the file --
>>>>>>>> something handled by the current DDL2 and MIME-based imgCIF format
>>>>>>>> and
>>>>>>>> which would be a serious problem in CIF2 has currently proposed,
>>>>>>>> increasing the chances that we will have to move imgCIF entirely into
>>>>>>>> HDF5 and abandon the CIF representation entirely, sharing only
>>>>>>>> the dictionary and not the framework.
>>>>>>>>
>>>>>>>> If you look carefully, you will see a similar trend with mmCIF, in
>>>>>>>> which
>>>>>>>> and XML representation sharing the dictionary plays a much more
>>>>>>>> important role than the CIF format.
>>>>>>>>
>>>>>>>> Is it really desirable to make the new CIF format so rigid and
>>>>>>>> unadaptable that major portions of macromolecular crysallography
>>>>>>>> end up migrating to very different formats, as they already are
>>>>>>>> doing?  Yes, there is great value in having a common dictionary,
>>>>>>>> but would there not be additional value in having a sufficiently
>>>>>>>> flexible common format to allow for more software sharing than
>>>>>>>> we now have?  It is really desirable for us to continue in the
>>>>>>>> direction of a single macromolecular experiment having to
>>>>>>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data
>>>>>>>> during collection, CCP4-style CIF representations during processing
>>>>>>>> and deposition and legacy PDB and PDBML representations in subsequent
>>>>>>>> community use?  If we could be a little bit more flexible, we might
>>>>>>>> be
>>>>>>>> able to reduce the data interchange software burdens a little.
>>>>>>>> Right now, this discussion seems headed in the direction of simply
>>>>>>>> adding yet another data representation (DDLm/CIF2) to the mix,
>>>>>>>> increasing the chances of mistranslation and confusion, rather
>>>>>>>> that reducing them.
>>>>>>>>
>>>>>>>> Please, step back a bit from the detailed discussion of UTF8 and
>>>>>>>> look at the work-flow of doing and publishing crystallographic
>>>>>>>> experiments and let us try to make a contribution that simplifies
>>>>>>>> it, not one that makes it more complex than it needs to be.
>>>>>>>>
>>>>>>>> I suggest we need to meet and talk, either face-to-face, or by skype.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>   Herbert
>>>>>>>>
>>>>>>>> =====================================================
>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>    Dowling College, Kramer Science Center, KSC 121
>>>>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>
>>>>>>>>                  +1-631-244-3035
>>>>>>>>                  yaya at dowling.edu
>>>>>>>> =====================================================
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> cif2-encoding mailing list
>>>>>>>> cif2-encoding at iucr.org
>>>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> T +61 (02) 9717 9907
>>>>>>> F +61 (02) 9717 3145
>>>>>>> M +61 (04) 0249 4148
>>>>>>> _______________________________________________
>>>>>>> cif2-encoding mailing list
>>>>>>> cif2-encoding at iucr.org
>>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>
>>>>>> _______________________________________________
>>>>>> cif2-encoding mailing list
>>>>>> cif2-encoding at iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> T +61 (02) 9717 9907
>>>>> F +61 (02) 9717 3145
>>>>> M +61 (04) 0249 4148
>>>>> _______________________________________________
>>>>> cif2-encoding mailing list
>>>>> cif2-encoding at iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>
>>>> _______________________________________________
>>>> cif2-encoding mailing list
>>>> cif2-encoding at iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>> _______________________________________________
>>> cif2-encoding mailing list
>>> cif2-encoding at iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>
>> _______________________________________________
>> cif2-encoding mailing list
>> cif2-encoding at iucr.org
>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>
>>
>
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding at iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>


More information about the cif2-encoding mailing list