[Cif2-encoding] Splitting of imgCIF and other sub-topics
Herbert J. Bernstein
yaya at bernstein-plus-sons.com
Fri Sep 10 12:25:21 BST 2010
Dear James,
> "Note that a CIF2-conformant character stream that forms part of a
> larger stream is not constrained to be in UTF8 encoding if the
> encoding of the CIF2 stream is specified in a standards-conformant
> manner within the enclosing stream. For example, CIF2 content within
> an XML file is not constrained to be UTF8-encoded as standard XML
> attributes can be used to manage encoding."
is almost reasoanble, but basically says that it will be easier to
handle CIF2 is almost any external container, rather than as itself.
I would suggest saying.
The description of a conformant CIF2 in terms of a UTF8 encoding
is intended to provide clarity in the description of a CIF2, not
to prevent use of CIF2 in terms of other encodings, such as UCS-2
unicode or code-page-based encodings needed for editors in
particular system, nor to prevent used of transformed CIF2 in other
containers such as HDF5 and XML or imgCIF/CBF, as long as the
decodings/encoding or other transformations that would be necessary to go
to and from a UTF8 CIF2 representation are clearly and unambiguously
defined.
This would bring us back essentially to where we have been for more than a
decade with imgCIF/CBF and for nearly 2 decades with CIF1 itself.
Regards,
Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
yaya at dowling.edu
=====================================================
On Fri, 10 Sep 2010, James Hester wrote:
> Thanks Herbert for this detailed information, which is a great help to
> me in forming an opinion. Please understand that we are not even
> close to considering excluding imgCIF from CIF. Rather, I am
> collecting information in order to form an opinion and work with
> everybody to find a solution which then goes back to the DDLm group
> and then on to COMCIFS regarding CIF2. Speculation about potential
> consequences for imgCIF are just part of the information-gathering
> process. In general terms, CIF is now a 'framework', which I think
> will make bringing XML and HDF5 developments under the CIF umbrella
> relatively simple.
>
> Please also understand that my comments about the usefulness of CBFlib
> were in the context of a typical beamline user wishing to handle their
> data, rather than from a programmer's point of view. I was not
> casting aspersions on CBFlib, rather seeking more information (which
> you have provided).
>
> I am afraid that terminology here may be confusing me: I would like to
> talk about imgCIF as a pure ASCII format (eg IT Vol G p 40 para 15)
> and CBF as the binary equivalent. However, your previous statements
> indicate that imgCIF could also be written in UTF16 encoding. So:
> when you speak of the Dectris detector output as 'imgCIF', what
> encoding is used?
>
> The point you make about embedding imgCIF into a text-only format (in
> this case XML) is, I agree, a use-case that we have to consider. I
> see merit in the position that 'CIF2 content' inside a container is
> not constrained by encoding, in those cases where the container is
> able to specify the encoding itself. This is *pedantically* true
> already in that the 'header' of the container file as a whole is *not*
> the CIF2 magic header. So: what does everyone think of the following
> statement being included in the standard?
>
> "Note that a CIF2-conformant character stream that forms part of a
> larger stream is not constrained to be in UTF8 encoding if the
> encoding of the CIF2 stream is specified in a standards-conformant
> manner within the enclosing stream. For example, CIF2 content within
> an XML file is not constrained to be UTF8-encoded as standard XML
> attributes can be used to manage encoding."
>
> (Perhaps John B, who has shown superior wordsmithing capabilities,
> could polish this up a bit?)
>
> On Fri, Sep 3, 2010 at 11:10 PM, Herbert J. Bernstein
> <yaya at bernstein-plus-sons.com> wrote:
>> Here is more detail on the use of CBFlib.
>>
>> I know for sure that CBFlib is used directly by mosflm and adxv. While XDS
>> uses code that was prototyped in the Fortran part of CBFlib, they work with
>> their own versions. However, Kay Diederichs has also used the CBFlib C code
>> for work on simulations. Paul Ellis started HKL2000 off with CBFlib, but I
>> don't know if they stayed with it.
>>
>> As a practical matter, whether someone uses CBFlib itself, it is an
>> essential part of the documentation that people use to understand how the
>> various compression schemes work, and they use the utility cif2cbf from the
>> package both as an external converter and as a validator and as a debugger
>> when they don't want to put all the functionality in their own code. If you
>> have a funny CBF in any of the semi-infinite number of representations,
>> cif2cbf allows you to check it, get a hex dump of it or convert it to a
>> specific compression scheme or format that some other program needs to
>> process that file.
>>
>> In other words, CBFlib on its own _is_ useful.
>>
>> Sorry about not giving you a list re imgCIF use, I thought you were asking
>> me about CBFlib use -- every beamline that uses a Dectris Pilatus 6M
>> produces imgCIF as the default. This had been a byte-offset compressed
>> binary with a mini-header. Dectris has now moved up to writing a full
>> header. There were some beamlines with some of the older smaller Dectris
>> detectors that were producing TIFF, but all currently delivered Dectris
>> detectors of all sizes produce imgCIF as the default.
>>
>> All the major detector manufacturers now offer CBF as an option except for
>> Bruker which is debugging an optional CBF output. When I checked at the ACA
>> meeting in July they all also said that their processing packages can accept
>> CBF as an input.
>>
>> On the XML use, I would suggest a more broad-minded attitude. Judging from
>> the workshop I was at in January at ESRF, it has much broader support than
>> just from Diamond, especially for spectra which have smaller data volume
>> than images. HDF5 is the most widely accepted scientific binary data format
>> for the physics community, and XML is the easiest and most reliable way to
>> port smaller HDF5 datasets from site to site. The problem with XML is that
>> for large files such as crystallographic images ordinary straight-text XML
>> produces huge, impractical files. binutf allows for a compromise in which
>> you have a true XML UCS-2 file but with the binary having only a 7%
>> overhead.
>>
>> I have no choice -- I _will_ (indeed already do) produce CIFs with UCS-2
>> binary sections. If COMCIFS repeats the unfortunate decision of 1997 of
>> saying that what the synchrotron community needs can't be called CIF, we'll
>> just go back to calling it imgNCIF (which is an acronym for image-not-CIF),
>> but we will still have to produce it for the community. In 1998 after we had
>> a face-to-face discussion at a BNL workshop, that decision was reversed and
>> what the synchrotron community needed was folded under the CIF umbrella, and
>> imgNCIF became imgCIF. I hope we can have discussions now to avoid the need
>> for a pointless schism.
>>
>> Your proposal on the relationship between CIF2 and imgCIF sounds like a
>> replay of the discussions we had in 1997, with CIF headers following one
>> standard and binary sections following another. You can make that work, but
>> it is clumsy and hard for users to work with. It is better if we have one
>> simple, comprehensible standard for the files they work with as a whole.
>>
>> Let me be clear -- imgCIF is produced worldwide and used for thousands of
>> images daily. These older "legacy" imgCIF images will be around for a long
>> time to come, and whatever new imgCIF (or if you force us to it, imgNCIF)
>> images we produce will need to be, and will be, supported by software that
>> handles both the legacy and the new images and has a clean interface to HDF5
>> and XML as well. I would greatly prefer that this be coordinated with
>> COMCIFS and done in a way that helps the community to understand the
>> relationship between CIF and imgCIF, but if COMCIFS feels a need to return
>> to its 1997 position and exclude the data we work with from its charge, then
>> imgCIF can return to being imgNCIF.
>>
>> If we are to resolve this, then, as in 1998, we need a meeting or e-meeting.
>> Once you have a web-cam, I would suggest you and I have a skype meeting to
>> frame the issues in dispute and organize a wider meeting.
>>
>> -- Herbert
>>
>> =====================================================
>> Herbert J. Bernstein, Professor of Computer Science
>> Dowling College, Kramer Science Center, KSC 121
>> Idle Hour Blvd, Oakdale, NY, 11769
>>
>> +1-631-244-3035
>> yaya at dowling.edu
>> =====================================================
>>
>> On Fri, 3 Sep 2010, James Hester wrote:
>>
>>>> On Fri, 3 Sep 2010, James Hester wrote:
>>>>
>>>>> Thanks Herbert for providing the imgCIF perspective.
>>>>>
>>>>> I am unfortunately severely restricted in my ability to attend
>>>>> overseas meetings at present, for family and work reasons. I am also
>>>>> keen to have our discussions written down and available for perusal by
>>>>> those that will come later.
>>>>
>>>> How about an e-meeting?
>>>
>>> OK, I think we need to try online as my carefully crafted arguments
>>> seem to be misunderstood more often than not.
>>> Let me buy a web cam first!
>>>
>>>>> We need to discuss the relationship of imgCIF to CIF2 explicitly, if
>>>>> imgCIF is going to influence our decisionmaking. Some questions for
>>>>> Herbert to answer for the record:
>>>>>
>>>>> 1. How widely used are non-CBF forms of imgCIF at present? By "widely
>>>>> used" I mean both
>>>>> (a) supported by software packages that allow one to do "useful
>>>>> work", most obviously to extract diffraction spots
>>>>
>>>> I assume by "non-CBF" you mean the forms that do the binary sections
>>>> in something that is not pure binary -- all software that uses CBFlib
>>>> supports them automatically for reading. For writing, most software
>>>> chooses one representation for writing, usually byte-offset or
>>>> packed binary, except when we have to debug -- then the ascii
>>>> forms, esp. the hexdump form are very useful.
>>>
>>> You are correct in interpreting what I mean by "non-CBF".
>>>
>>> I understand that CBFlib supports everything, but CBFlib on its own is
>>> not useful. Do you know approximately what programs use CBFlib? I
>>> know only of rasmol, but you presumably know of many more.
>>>
>>>>> (b) provided as an output format (even optionally) by beamlines or
>>>>> detector manufacturers
>>>
>>>> See above
>>>
>>> I see nothing in your reply on the availability of imgCIF files from
>>> detectors or instruments.
>>>
>>>>> 2. What is the advantage of having "pure text" image files? Why isn't
>>>>> a format like CBF more appropriate?
>>>>
>>>> While I agree, when we deal with people who like XML e.g. the NeXus
>>>> form of imgCIF, then we have no choice -- no binary is allowed, so
>>>> UCS-2 becomes important. Don't ask me to defend XML. It is simply a
>>>> fact of life.
>>>
>>> I am guessing that this NeXuS-XML requirement is coming from Diamond,
>>> and if this is what they want I can see why you are keen to integrate
>>> imgCIF into HDF5, so that HDF5-XML conversion can be carried out the
>>> standard HDF5 way, rather than encapsulating the entire imgCIF file as
>>> a NeXuS-XML dataset. OK: so apart from this relatively recent and
>>> frankly crazy-wierd use case, is there any other use-case for
>>> pure-text imgCIF? Can we regard the "Diamond" case as a
>>> beaurocratically-driven kluge that will be resolved via your HDF5
>>> work, leaving no other reason to create a space-efficient CIF2 version
>>> of imgCIF?
>>>
>>>>> 3. What is the problem with a scenario where "pure text" imgCIF
>>>>> remains in its current CIF1 form, and CIF2 advances are incorporated
>>>>> into the CIF sections of CBF?
>>>>
>>>> I don't understand this question, nor the assumptions behind it.
>>>
>>> Let me be less obtuse:
>>> I envision a CBF2 format, which is a CBF file with CIF2 instead of
>>> CIF1 syntax. A corresponding imgCIF2 format exists. We *do not care*
>>> about the space-efficiency of these imgCIF2 files. We recommend that
>>> all new crystallographic image-handling applications should target
>>> CBF2 only, rendering space-efficiency of imgCIF2 files irrelevant.
>>> Legacy applications, of which there are very few, will be restricted
>>> to the original imgCIF, which is very rarely produced in any case
>>> (anticipating your answers to my above questions).
>>>
>>> What are your (Herbert's, anybody else's) thoughts on such a plan?
>>>
>>>>> Herbert: your work merging a DDL2-based version with DDLm-like
>>>>> features in HDF5 format sounds interesting. Are you planning to
>>>>> present a motivation and/or discussion of this work at some stage?
>>>>
>>>> This is the subject of some grant applications, so not appropriate for
>>>> detailed open discussion in this forum at this time. The motivations
>>>> are simple -- to satisfy the demands of several major facilities for
>>>> easy integration of crytallographic synchrotron images into HDF5-based
>>>> data
>>>> management systems while preserving access to metadata, and to extend
>>>> HDF5
>>>> with relational meta-data access. This second aspect is an increasingly
>>>> critical need and will go forward in any case. If we have
>>>> a meeting or e-meeting, I can explain better.
>>>
>>> OK, I think reading between the lines I see where this is coming from
>>> (read your CACM article as well, BTW). It'd be good to discuss some
>>> of these plans at some stage.
>>>
>>>>>
>>>>> On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein
>>>>> <yaya at bernstein-plus-sons.com> wrote:
>>>>>>
>>>>>> Dear James,
>>>>>>
>>>>>> I have not been at all reticent -- imgCIF will be very poorly
>>>>>> supported
>>>>>> by CIF2 as currently proposed. Of necessity, imgCIF changes encodings
>>>>>> internally -- that it why it uses MIME -- same problem as email with
>>>>>> images, same solution.
>>>>>>
>>>>>> Any purely text version has at least a 7% overhead as compared to
>>>>>> pure binary. Restricting to UTF-8 increases the overhead to at least
>>>>>> 50%.
>>>>>> We may get away with the 7% (UTF-16). The 50% version (UTF-8) will be
>>>>>> ignored by the community as unworkable. The most likely to be used
>>>>>> version
>>>>>> will be the current DDL2-based version with embedded compressed
>>>>>> binaries
>>>>>> that I am augmenting with DDLm-like features
>>>>>> and merging in with HDF5.
>>>>>>
>>>>>> As I noted many months ago, the unfortunate reality is that the
>>>>>> current CIF2 effort will not merge well with imgCIF. If avoiding
>>>>>> a split is a important -- we need a meeting. I would suggest
>>>>>> involving Bob Sweet and holding it at BNL in conjunction with
>>>>>> something relevant to NSLS-II.
>>>>>>
>>>>>> Regards,
>>>>>> Herbert
>>>>>>
>>>>>> =====================================================
>>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>
>>>>>> +1-631-244-3035
>>>>>> yaya at dowling.edu
>>>>>> =====================================================
>>>>>>
>>>>>> On Tue, 24 Aug 2010, James Hester wrote:
>>>>>>
>>>>>>> Hi Herbert: regarding imgCIF, I agree that splitting it off is not a
>>>>>>> desirable outcome. I would like to get an idea of how well imgCIF can
>>>>>>> be accommodated under the various encoding proposals currently
>>>>>>> floating around, as you have been rather reticent to bring it up. My
>>>>>>> naive take on things is that a UTF8-only encoding scheme for CIF2
>>>>>>> would not pose significant issues for imgCIF, and a decorated UTF16
>>>>>>> encoding in the style of Scheme B would be even better, and quite
>>>>>>> adequate, so imgCIF is not actually presenting any problems and so was
>>>>>>> a red herring.
>>>>>>>
>>>>>>> I'm not sure that face-to-face or Skype discussions are necessarily
>>>>>>> going to be more productive. Writing things down, while slower,
>>>>>>> allows me at least to collect my thoughts and those of other
>>>>>>> participants, and hopefully make a reasoned contribution (my apologies
>>>>>>> if I am too long-winded) and as an added bonus those thoughts are
>>>>>>> recorded for later reference. For example, where would I now find the
>>>>>>> background on why a container format for imgCIF is such a bad idea?
>>>>>>> Presumably that was all thrashed out in face to face discussions, and
>>>>>>> no record now remains.
>>>>>>>
>>>>>>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein
>>>>>>> <yaya at bernstein-plus-sons.com> wrote:
>>>>>>>>
>>>>>>>> Dear Colleagues,
>>>>>>>>
>>>>>>>> James' and John's last interchange is so voluminous, I doubt any of
>>>>>>>> us has been able to fully appreciate the rich complexity of ideas
>>>>>>>> contained therein. For example, one of the suggestions far down in
>>>>>>>> the text is:
>>>>>>>>
>>>>>>>> (James now) Indeed. My intent with this specification was to ensure
>>>>>>>> that third parties would be able to recover the encoding. If imgCIF
>>>>>>>> is
>>>>>>>> going to cause us to make such an open-ended specification, it is
>>>>>>>> probably a sign that imgCIF needs to be addressed separately. For
>>>>>>>> example, should we think about redefining it as a container format,
>>>>>>>> with a CIF header and UTF16 body (but still part of the
>>>>>>>> "Crystallographic Information Framework")?
>>>>>>>>
>>>>>>>> The idea of an imgCIF "header" in CIF format and a image in another
>>>>>>>> is
>>>>>>>> an
>>>>>>>> old, well-established, thoroughly discussed, and mistaken idea,
>>>>>>>> rejected
>>>>>>>> in 1998. The handling of multiple images in a single file (e.g.
>>>>>>>> a jpeg thumbnail and crystal image and a full-size diffraction image)
>>>>>>>> requires the ability to switch among encodings within the file --
>>>>>>>> something handled by the current DDL2 and MIME-based imgCIF format
>>>>>>>> and
>>>>>>>> which would be a serious problem in CIF2 has currently proposed,
>>>>>>>> increasing the chances that we will have to move imgCIF entirely into
>>>>>>>> HDF5 and abandon the CIF representation entirely, sharing only
>>>>>>>> the dictionary and not the framework.
>>>>>>>>
>>>>>>>> If you look carefully, you will see a similar trend with mmCIF, in
>>>>>>>> which
>>>>>>>> and XML representation sharing the dictionary plays a much more
>>>>>>>> important role than the CIF format.
>>>>>>>>
>>>>>>>> Is it really desirable to make the new CIF format so rigid and
>>>>>>>> unadaptable that major portions of macromolecular crysallography
>>>>>>>> end up migrating to very different formats, as they already are
>>>>>>>> doing? Yes, there is great value in having a common dictionary,
>>>>>>>> but would there not be additional value in having a sufficiently
>>>>>>>> flexible common format to allow for more software sharing than
>>>>>>>> we now have? It is really desirable for us to continue in the
>>>>>>>> direction of a single macromolecular experiment having to
>>>>>>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data
>>>>>>>> during collection, CCP4-style CIF representations during processing
>>>>>>>> and deposition and legacy PDB and PDBML representations in subsequent
>>>>>>>> community use? If we could be a little bit more flexible, we might
>>>>>>>> be
>>>>>>>> able to reduce the data interchange software burdens a little.
>>>>>>>> Right now, this discussion seems headed in the direction of simply
>>>>>>>> adding yet another data representation (DDLm/CIF2) to the mix,
>>>>>>>> increasing the chances of mistranslation and confusion, rather
>>>>>>>> that reducing them.
>>>>>>>>
>>>>>>>> Please, step back a bit from the detailed discussion of UTF8 and
>>>>>>>> look at the work-flow of doing and publishing crystallographic
>>>>>>>> experiments and let us try to make a contribution that simplifies
>>>>>>>> it, not one that makes it more complex than it needs to be.
>>>>>>>>
>>>>>>>> I suggest we need to meet and talk, either face-to-face, or by skype.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Herbert
>>>>>>>>
>>>>>>>> =====================================================
>>>>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>
>>>>>>>> +1-631-244-3035
>>>>>>>> yaya at dowling.edu
>>>>>>>> =====================================================
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> cif2-encoding mailing list
>>>>>>>> cif2-encoding at iucr.org
>>>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> T +61 (02) 9717 9907
>>>>>>> F +61 (02) 9717 3145
>>>>>>> M +61 (04) 0249 4148
>>>>>>> _______________________________________________
>>>>>>> cif2-encoding mailing list
>>>>>>> cif2-encoding at iucr.org
>>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>
>>>>>> _______________________________________________
>>>>>> cif2-encoding mailing list
>>>>>> cif2-encoding at iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> T +61 (02) 9717 9907
>>>>> F +61 (02) 9717 3145
>>>>> M +61 (04) 0249 4148
>>>>> _______________________________________________
>>>>> cif2-encoding mailing list
>>>>> cif2-encoding at iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>
>>>> _______________________________________________
>>>> cif2-encoding mailing list
>>>> cif2-encoding at iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>> _______________________________________________
>>> cif2-encoding mailing list
>>> cif2-encoding at iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>
>> _______________________________________________
>> cif2-encoding mailing list
>> cif2-encoding at iucr.org
>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>
>>
>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding at iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
More information about the cif2-encoding
mailing list