[Cif2-encoding] Splitting of imgCIF and other sub-topics
James Hester
jamesrhester at gmail.com
Fri Sep 10 08:51:56 BST 2010
Thanks Herbert for this detailed information, which is a great help to
me in forming an opinion. Please understand that we are not even
close to considering excluding imgCIF from CIF. Rather, I am
collecting information in order to form an opinion and work with
everybody to find a solution which then goes back to the DDLm group
and then on to COMCIFS regarding CIF2. Speculation about potential
consequences for imgCIF are just part of the information-gathering
process. In general terms, CIF is now a 'framework', which I think
will make bringing XML and HDF5 developments under the CIF umbrella
relatively simple.
Please also understand that my comments about the usefulness of CBFlib
were in the context of a typical beamline user wishing to handle their
data, rather than from a programmer's point of view. I was not
casting aspersions on CBFlib, rather seeking more information (which
you have provided).
I am afraid that terminology here may be confusing me: I would like to
talk about imgCIF as a pure ASCII format (eg IT Vol G p 40 para 15)
and CBF as the binary equivalent. However, your previous statements
indicate that imgCIF could also be written in UTF16 encoding. So:
when you speak of the Dectris detector output as 'imgCIF', what
encoding is used?
The point you make about embedding imgCIF into a text-only format (in
this case XML) is, I agree, a use-case that we have to consider. I
see merit in the position that 'CIF2 content' inside a container is
not constrained by encoding, in those cases where the container is
able to specify the encoding itself. This is *pedantically* true
already in that the 'header' of the container file as a whole is *not*
the CIF2 magic header. So: what does everyone think of the following
statement being included in the standard?
"Note that a CIF2-conformant character stream that forms part of a
larger stream is not constrained to be in UTF8 encoding if the
encoding of the CIF2 stream is specified in a standards-conformant
manner within the enclosing stream. For example, CIF2 content within
an XML file is not constrained to be UTF8-encoded as standard XML
attributes can be used to manage encoding."
(Perhaps John B, who has shown superior wordsmithing capabilities,
could polish this up a bit?)
On Fri, Sep 3, 2010 at 11:10 PM, Herbert J. Bernstein
<yaya at bernstein-plus-sons.com> wrote:
> Here is more detail on the use of CBFlib.
>
> I know for sure that CBFlib is used directly by mosflm and adxv. While XDS
> uses code that was prototyped in the Fortran part of CBFlib, they work with
> their own versions. However, Kay Diederichs has also used the CBFlib C code
> for work on simulations. Paul Ellis started HKL2000 off with CBFlib, but I
> don't know if they stayed with it.
>
> As a practical matter, whether someone uses CBFlib itself, it is an
> essential part of the documentation that people use to understand how the
> various compression schemes work, and they use the utility cif2cbf from the
> package both as an external converter and as a validator and as a debugger
> when they don't want to put all the functionality in their own code. If you
> have a funny CBF in any of the semi-infinite number of representations,
> cif2cbf allows you to check it, get a hex dump of it or convert it to a
> specific compression scheme or format that some other program needs to
> process that file.
>
> In other words, CBFlib on its own _is_ useful.
>
> Sorry about not giving you a list re imgCIF use, I thought you were asking
> me about CBFlib use -- every beamline that uses a Dectris Pilatus 6M
> produces imgCIF as the default. This had been a byte-offset compressed
> binary with a mini-header. Dectris has now moved up to writing a full
> header. There were some beamlines with some of the older smaller Dectris
> detectors that were producing TIFF, but all currently delivered Dectris
> detectors of all sizes produce imgCIF as the default.
>
> All the major detector manufacturers now offer CBF as an option except for
> Bruker which is debugging an optional CBF output. When I checked at the ACA
> meeting in July they all also said that their processing packages can accept
> CBF as an input.
>
> On the XML use, I would suggest a more broad-minded attitude. Judging from
> the workshop I was at in January at ESRF, it has much broader support than
> just from Diamond, especially for spectra which have smaller data volume
> than images. HDF5 is the most widely accepted scientific binary data format
> for the physics community, and XML is the easiest and most reliable way to
> port smaller HDF5 datasets from site to site. The problem with XML is that
> for large files such as crystallographic images ordinary straight-text XML
> produces huge, impractical files. binutf allows for a compromise in which
> you have a true XML UCS-2 file but with the binary having only a 7%
> overhead.
>
> I have no choice -- I _will_ (indeed already do) produce CIFs with UCS-2
> binary sections. If COMCIFS repeats the unfortunate decision of 1997 of
> saying that what the synchrotron community needs can't be called CIF, we'll
> just go back to calling it imgNCIF (which is an acronym for image-not-CIF),
> but we will still have to produce it for the community. In 1998 after we had
> a face-to-face discussion at a BNL workshop, that decision was reversed and
> what the synchrotron community needed was folded under the CIF umbrella, and
> imgNCIF became imgCIF. I hope we can have discussions now to avoid the need
> for a pointless schism.
>
> Your proposal on the relationship between CIF2 and imgCIF sounds like a
> replay of the discussions we had in 1997, with CIF headers following one
> standard and binary sections following another. You can make that work, but
> it is clumsy and hard for users to work with. It is better if we have one
> simple, comprehensible standard for the files they work with as a whole.
>
> Let me be clear -- imgCIF is produced worldwide and used for thousands of
> images daily. These older "legacy" imgCIF images will be around for a long
> time to come, and whatever new imgCIF (or if you force us to it, imgNCIF)
> images we produce will need to be, and will be, supported by software that
> handles both the legacy and the new images and has a clean interface to HDF5
> and XML as well. I would greatly prefer that this be coordinated with
> COMCIFS and done in a way that helps the community to understand the
> relationship between CIF and imgCIF, but if COMCIFS feels a need to return
> to its 1997 position and exclude the data we work with from its charge, then
> imgCIF can return to being imgNCIF.
>
> If we are to resolve this, then, as in 1998, we need a meeting or e-meeting.
> Once you have a web-cam, I would suggest you and I have a skype meeting to
> frame the issues in dispute and organize a wider meeting.
>
> -- Herbert
>
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
> Dowling College, Kramer Science Center, KSC 121
> Idle Hour Blvd, Oakdale, NY, 11769
>
> +1-631-244-3035
> yaya at dowling.edu
> =====================================================
>
> On Fri, 3 Sep 2010, James Hester wrote:
>
>>> On Fri, 3 Sep 2010, James Hester wrote:
>>>
>>>> Thanks Herbert for providing the imgCIF perspective.
>>>>
>>>> I am unfortunately severely restricted in my ability to attend
>>>> overseas meetings at present, for family and work reasons. I am also
>>>> keen to have our discussions written down and available for perusal by
>>>> those that will come later.
>>>
>>> How about an e-meeting?
>>
>> OK, I think we need to try online as my carefully crafted arguments
>> seem to be misunderstood more often than not.
>> Let me buy a web cam first!
>>
>>>> We need to discuss the relationship of imgCIF to CIF2 explicitly, if
>>>> imgCIF is going to influence our decisionmaking. Some questions for
>>>> Herbert to answer for the record:
>>>>
>>>> 1. How widely used are non-CBF forms of imgCIF at present? By "widely
>>>> used" I mean both
>>>> (a) supported by software packages that allow one to do "useful
>>>> work", most obviously to extract diffraction spots
>>>
>>> I assume by "non-CBF" you mean the forms that do the binary sections
>>> in something that is not pure binary -- all software that uses CBFlib
>>> supports them automatically for reading. For writing, most software
>>> chooses one representation for writing, usually byte-offset or
>>> packed binary, except when we have to debug -- then the ascii
>>> forms, esp. the hexdump form are very useful.
>>
>> You are correct in interpreting what I mean by "non-CBF".
>>
>> I understand that CBFlib supports everything, but CBFlib on its own is
>> not useful. Do you know approximately what programs use CBFlib? I
>> know only of rasmol, but you presumably know of many more.
>>
>>>> (b) provided as an output format (even optionally) by beamlines or
>>>> detector manufacturers
>>
>>> See above
>>
>> I see nothing in your reply on the availability of imgCIF files from
>> detectors or instruments.
>>
>>>> 2. What is the advantage of having "pure text" image files? Why isn't
>>>> a format like CBF more appropriate?
>>>
>>> While I agree, when we deal with people who like XML e.g. the NeXus
>>> form of imgCIF, then we have no choice -- no binary is allowed, so
>>> UCS-2 becomes important. Don't ask me to defend XML. It is simply a
>>> fact of life.
>>
>> I am guessing that this NeXuS-XML requirement is coming from Diamond,
>> and if this is what they want I can see why you are keen to integrate
>> imgCIF into HDF5, so that HDF5-XML conversion can be carried out the
>> standard HDF5 way, rather than encapsulating the entire imgCIF file as
>> a NeXuS-XML dataset. OK: so apart from this relatively recent and
>> frankly crazy-wierd use case, is there any other use-case for
>> pure-text imgCIF? Can we regard the "Diamond" case as a
>> beaurocratically-driven kluge that will be resolved via your HDF5
>> work, leaving no other reason to create a space-efficient CIF2 version
>> of imgCIF?
>>
>>>> 3. What is the problem with a scenario where "pure text" imgCIF
>>>> remains in its current CIF1 form, and CIF2 advances are incorporated
>>>> into the CIF sections of CBF?
>>>
>>> I don't understand this question, nor the assumptions behind it.
>>
>> Let me be less obtuse:
>> I envision a CBF2 format, which is a CBF file with CIF2 instead of
>> CIF1 syntax. A corresponding imgCIF2 format exists. We *do not care*
>> about the space-efficiency of these imgCIF2 files. We recommend that
>> all new crystallographic image-handling applications should target
>> CBF2 only, rendering space-efficiency of imgCIF2 files irrelevant.
>> Legacy applications, of which there are very few, will be restricted
>> to the original imgCIF, which is very rarely produced in any case
>> (anticipating your answers to my above questions).
>>
>> What are your (Herbert's, anybody else's) thoughts on such a plan?
>>
>>>> Herbert: your work merging a DDL2-based version with DDLm-like
>>>> features in HDF5 format sounds interesting. Are you planning to
>>>> present a motivation and/or discussion of this work at some stage?
>>>
>>> This is the subject of some grant applications, so not appropriate for
>>> detailed open discussion in this forum at this time. The motivations
>>> are simple -- to satisfy the demands of several major facilities for
>>> easy integration of crytallographic synchrotron images into HDF5-based
>>> data
>>> management systems while preserving access to metadata, and to extend
>>> HDF5
>>> with relational meta-data access. This second aspect is an increasingly
>>> critical need and will go forward in any case. If we have
>>> a meeting or e-meeting, I can explain better.
>>
>> OK, I think reading between the lines I see where this is coming from
>> (read your CACM article as well, BTW). It'd be good to discuss some
>> of these plans at some stage.
>>
>>>>
>>>> On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein
>>>> <yaya at bernstein-plus-sons.com> wrote:
>>>>>
>>>>> Dear James,
>>>>>
>>>>> I have not been at all reticent -- imgCIF will be very poorly
>>>>> supported
>>>>> by CIF2 as currently proposed. Of necessity, imgCIF changes encodings
>>>>> internally -- that it why it uses MIME -- same problem as email with
>>>>> images, same solution.
>>>>>
>>>>> Any purely text version has at least a 7% overhead as compared to
>>>>> pure binary. Restricting to UTF-8 increases the overhead to at least
>>>>> 50%.
>>>>> We may get away with the 7% (UTF-16). The 50% version (UTF-8) will be
>>>>> ignored by the community as unworkable. The most likely to be used
>>>>> version
>>>>> will be the current DDL2-based version with embedded compressed
>>>>> binaries
>>>>> that I am augmenting with DDLm-like features
>>>>> and merging in with HDF5.
>>>>>
>>>>> As I noted many months ago, the unfortunate reality is that the
>>>>> current CIF2 effort will not merge well with imgCIF. If avoiding
>>>>> a split is a important -- we need a meeting. I would suggest
>>>>> involving Bob Sweet and holding it at BNL in conjunction with
>>>>> something relevant to NSLS-II.
>>>>>
>>>>> Regards,
>>>>> Herbert
>>>>>
>>>>> =====================================================
>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>
>>>>> +1-631-244-3035
>>>>> yaya at dowling.edu
>>>>> =====================================================
>>>>>
>>>>> On Tue, 24 Aug 2010, James Hester wrote:
>>>>>
>>>>>> Hi Herbert: regarding imgCIF, I agree that splitting it off is not a
>>>>>> desirable outcome. I would like to get an idea of how well imgCIF can
>>>>>> be accommodated under the various encoding proposals currently
>>>>>> floating around, as you have been rather reticent to bring it up. My
>>>>>> naive take on things is that a UTF8-only encoding scheme for CIF2
>>>>>> would not pose significant issues for imgCIF, and a decorated UTF16
>>>>>> encoding in the style of Scheme B would be even better, and quite
>>>>>> adequate, so imgCIF is not actually presenting any problems and so was
>>>>>> a red herring.
>>>>>>
>>>>>> I'm not sure that face-to-face or Skype discussions are necessarily
>>>>>> going to be more productive. Writing things down, while slower,
>>>>>> allows me at least to collect my thoughts and those of other
>>>>>> participants, and hopefully make a reasoned contribution (my apologies
>>>>>> if I am too long-winded) and as an added bonus those thoughts are
>>>>>> recorded for later reference. For example, where would I now find the
>>>>>> background on why a container format for imgCIF is such a bad idea?
>>>>>> Presumably that was all thrashed out in face to face discussions, and
>>>>>> no record now remains.
>>>>>>
>>>>>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein
>>>>>> <yaya at bernstein-plus-sons.com> wrote:
>>>>>>>
>>>>>>> Dear Colleagues,
>>>>>>>
>>>>>>> James' and John's last interchange is so voluminous, I doubt any of
>>>>>>> us has been able to fully appreciate the rich complexity of ideas
>>>>>>> contained therein. For example, one of the suggestions far down in
>>>>>>> the text is:
>>>>>>>
>>>>>>> (James now) Indeed. My intent with this specification was to ensure
>>>>>>> that third parties would be able to recover the encoding. If imgCIF
>>>>>>> is
>>>>>>> going to cause us to make such an open-ended specification, it is
>>>>>>> probably a sign that imgCIF needs to be addressed separately. For
>>>>>>> example, should we think about redefining it as a container format,
>>>>>>> with a CIF header and UTF16 body (but still part of the
>>>>>>> "Crystallographic Information Framework")?
>>>>>>>
>>>>>>> The idea of an imgCIF "header" in CIF format and a image in another
>>>>>>> is
>>>>>>> an
>>>>>>> old, well-established, thoroughly discussed, and mistaken idea,
>>>>>>> rejected
>>>>>>> in 1998. The handling of multiple images in a single file (e.g.
>>>>>>> a jpeg thumbnail and crystal image and a full-size diffraction image)
>>>>>>> requires the ability to switch among encodings within the file --
>>>>>>> something handled by the current DDL2 and MIME-based imgCIF format
>>>>>>> and
>>>>>>> which would be a serious problem in CIF2 has currently proposed,
>>>>>>> increasing the chances that we will have to move imgCIF entirely into
>>>>>>> HDF5 and abandon the CIF representation entirely, sharing only
>>>>>>> the dictionary and not the framework.
>>>>>>>
>>>>>>> If you look carefully, you will see a similar trend with mmCIF, in
>>>>>>> which
>>>>>>> and XML representation sharing the dictionary plays a much more
>>>>>>> important role than the CIF format.
>>>>>>>
>>>>>>> Is it really desirable to make the new CIF format so rigid and
>>>>>>> unadaptable that major portions of macromolecular crysallography
>>>>>>> end up migrating to very different formats, as they already are
>>>>>>> doing? Yes, there is great value in having a common dictionary,
>>>>>>> but would there not be additional value in having a sufficiently
>>>>>>> flexible common format to allow for more software sharing than
>>>>>>> we now have? It is really desirable for us to continue in the
>>>>>>> direction of a single macromolecular experiment having to
>>>>>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data
>>>>>>> during collection, CCP4-style CIF representations during processing
>>>>>>> and deposition and legacy PDB and PDBML representations in subsequent
>>>>>>> community use? If we could be a little bit more flexible, we might
>>>>>>> be
>>>>>>> able to reduce the data interchange software burdens a little.
>>>>>>> Right now, this discussion seems headed in the direction of simply
>>>>>>> adding yet another data representation (DDLm/CIF2) to the mix,
>>>>>>> increasing the chances of mistranslation and confusion, rather
>>>>>>> that reducing them.
>>>>>>>
>>>>>>> Please, step back a bit from the detailed discussion of UTF8 and
>>>>>>> look at the work-flow of doing and publishing crystallographic
>>>>>>> experiments and let us try to make a contribution that simplifies
>>>>>>> it, not one that makes it more complex than it needs to be.
>>>>>>>
>>>>>>> I suggest we need to meet and talk, either face-to-face, or by skype.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Herbert
>>>>>>>
>>>>>>> =====================================================
>>>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>
>>>>>>> +1-631-244-3035
>>>>>>> yaya at dowling.edu
>>>>>>> =====================================================
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> cif2-encoding mailing list
>>>>>>> cif2-encoding at iucr.org
>>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> T +61 (02) 9717 9907
>>>>>> F +61 (02) 9717 3145
>>>>>> M +61 (04) 0249 4148
>>>>>> _______________________________________________
>>>>>> cif2-encoding mailing list
>>>>>> cif2-encoding at iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>
>>>>> _______________________________________________
>>>>> cif2-encoding mailing list
>>>>> cif2-encoding at iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> T +61 (02) 9717 9907
>>>> F +61 (02) 9717 3145
>>>> M +61 (04) 0249 4148
>>>> _______________________________________________
>>>> cif2-encoding mailing list
>>>> cif2-encoding at iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>
>>> _______________________________________________
>>> cif2-encoding mailing list
>>> cif2-encoding at iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>
>>>
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> cif2-encoding mailing list
>> cif2-encoding at iucr.org
>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding at iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
More information about the cif2-encoding
mailing list