From jamesrhester at gmail.com Fri Sep 3 03:54:36 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 3 Sep 2010 12:54:36 +1000 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> Message-ID: Herbert, you will note that I carefully wrote "de-facto" ASCII, by which I mean that virtually, if not all, software for doing "useful work" with CIF, such as structural display programs, syntax checkers, refinement programs etc. read and write ASCII only. So while you can produce an EBCDIC or UTF16 encoded CIF1 file and proudly proclaim that it is CIF1 conformant, good luck in your quest to do useful work with it: you won't be able to input it as a starting model in any crystallographic packages, CheckCIF will complain, you won't be able to display the structure in all those nice programs...so in practice you are restricted to ASCII. As an additional and far more significant restriction, regardless of your CIF1 encoding, you must use only characters appearing in the ASCII character set in your CIF file. My point being that UTF8-only CIF2 is *less* restrictive than the successful CIF1 standard, because more code points are available, with the same range of encoding schemes (i.e. effectively *one* encoding only). If the only non-UTF8 use case will be imgCIF (that would appear to be the only non-ASCII use case for CIF1), we need to discuss this explicitly. On Thu, Aug 26, 2010 at 9:24 PM, Herbert J. Bernstein wrote: > Um, but CIF1 is _not_ ascii-only. ?It is text in any acceptable local > encoding. > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? ?Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > ===================================================== > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From jamesrhester at gmail.com Fri Sep 3 04:21:11 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 3 Sep 2010 13:21:11 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics In-Reply-To: References: <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Thanks Herbert for providing the imgCIF perspective. I am unfortunately severely restricted in my ability to attend overseas meetings at present, for family and work reasons. I am also keen to have our discussions written down and available for perusal by those that will come later. We need to discuss the relationship of imgCIF to CIF2 explicitly, if imgCIF is going to influence our decisionmaking. Some questions for Herbert to answer for the record: 1. How widely used are non-CBF forms of imgCIF at present? By "widely used" I mean both (a) supported by software packages that allow one to do "useful work", most obviously to extract diffraction spots (b) provided as an output format (even optionally) by beamlines or detector manufacturers 2. What is the advantage of having "pure text" image files? Why isn't a format like CBF more appropriate? 3. What is the problem with a scenario where "pure text" imgCIF remains in its current CIF1 form, and CIF2 advances are incorporated into the CIF sections of CBF? Herbert: your work merging a DDL2-based version with DDLm-like features in HDF5 format sounds interesting. Are you planning to present a motivation and/or discussion of this work at some stage? On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein wrote: > Dear James, > > ?I have not been at all reticent -- imgCIF will be very poorly supported > by CIF2 as currently proposed. ?Of necessity, imgCIF changes encodings > internally -- that it why it uses MIME -- same problem as email with > images, same solution. > > ?Any purely text version has at least a 7% overhead as compared to > pure binary. ?Restricting to UTF-8 increases the overhead to at least 50%. > We may get away with the 7% (UTF-16). ?The 50% version (UTF-8) will be > ignored by the community as unworkable. ?The most likely to be used version > will be the current DDL2-based version with embedded compressed binaries > that I am augmenting with DDLm-like features > and merging in with HDF5. > > ?As I noted many months ago, the unfortunate reality is that the > current CIF2 effort will not merge well with imgCIF. ?If avoiding > a split is a important -- we need a meeting. ?I would suggest > involving Bob Sweet and holding it at BNL in conjunction with > something relevant to NSLS-II. > > ?Regards, > ? ?Herbert > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Tue, 24 Aug 2010, James Hester wrote: > >> Hi Herbert: regarding imgCIF, ?I agree that splitting it off is not a >> desirable outcome. ?I would like to get an idea of how well imgCIF can >> be accommodated under the various encoding proposals currently >> floating around, as you have been rather reticent to bring it up. ?My >> naive take on things is that a UTF8-only encoding scheme for CIF2 >> would not pose significant issues for imgCIF, and a decorated UTF16 >> encoding in the style of Scheme B would be even better, and quite >> adequate, so imgCIF is not actually presenting any problems and so was >> a red herring. >> >> I'm not sure that face-to-face or Skype discussions are necessarily >> going to be more productive. ?Writing things down, while slower, >> allows me at least to collect my thoughts and those of other >> participants, and hopefully make a reasoned contribution (my apologies >> if I am too long-winded) and as an added bonus those thoughts are >> recorded for later reference. ?For example, where would I now find the >> background on why a container format for imgCIF is such a bad idea? >> Presumably that was all thrashed out in face to face discussions, and >> no record now remains. >> >> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein >> wrote: >>> >>> Dear Colleagues, >>> >>> ? James' and John's last interchange is so voluminous, I doubt any of >>> us has been able to fully appreciate the rich complexity of ideas >>> contained therein. ?For example, one of the suggestions far down in >>> the text is: >>> >>> (James now) ?Indeed. ?My intent with this specification was to ensure >>> that third parties would be able to recover the encoding. If imgCIF is >>> going to cause us to make such an open-ended specification, it is >>> probably a sign that imgCIF needs to be addressed separately. ?For >>> example, should we think about redefining it as a container format, >>> with a CIF header and UTF16 body (but still part of the >>> "Crystallographic Information Framework")? >>> >>> The idea of an imgCIF "header" in CIF format and a image in another is an >>> old, well-established, thoroughly discussed, and mistaken idea, rejected >>> in 1998. ?The handling of multiple images in a single file (e.g. >>> a jpeg thumbnail and crystal image and a full-size diffraction image) >>> requires the ability to switch among encodings within the file -- >>> something handled by the current DDL2 and MIME-based imgCIF format and >>> which would be a serious problem in CIF2 has currently proposed, >>> increasing the chances that we will have to move imgCIF entirely into >>> HDF5 and abandon the CIF representation entirely, sharing only >>> the dictionary and not the framework. >>> >>> If you look carefully, you will see a similar trend with mmCIF, in which >>> and XML representation sharing the dictionary plays a much more >>> important role than the CIF format. >>> >>> Is it really desirable to make the new CIF format so rigid and >>> unadaptable that major portions of macromolecular crysallography >>> end up migrating to very different formats, as they already are >>> doing? ?Yes, there is great value in having a common dictionary, >>> but would there not be additional value in having a sufficiently >>> flexible common format to allow for more software sharing than >>> we now have? ?It is really desirable for us to continue in the >>> direction of a single macromolecular experiment having to >>> deal with HDF5 and CIF/DDL2/MIME representations of the image data >>> during collection, CCP4-style CIF representations during processing >>> and deposition and legacy PDB and PDBML representations in subsequent >>> community use? ?If we could be a little bit more flexible, we might be >>> able to reduce the data interchange software burdens a little. >>> Right now, this discussion seems headed in the direction of simply >>> adding yet another data representation (DDLm/CIF2) to the mix, >>> increasing the chances of mistranslation and confusion, rather >>> that reducing them. >>> >>> Please, step back a bit from the detailed discussion of UTF8 and >>> look at the work-flow of doing and publishing crystallographic >>> experiments and let us try to make a contribution that simplifies >>> it, not one that makes it more complex than it needs to be. >>> >>> I suggest we need to meet and talk, either face-to-face, or by skype. >>> >>> Regards, >>> ? Herbert >>> >>> ===================================================== >>> ?Herbert J. Bernstein, Professor of Computer Science >>> ? ?Dowling College, Kramer Science Center, KSC 121 >>> ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> ? ? ? ? ? ? ? ? ?+1-631-244-3035 >>> ? ? ? ? ? ? ? ? ?yaya at dowling.edu >>> ===================================================== >>> >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Fri Sep 3 04:51:19 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 2 Sep 2010 23:51:19 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics In-Reply-To: References: <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Comments interpolated below. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 3 Sep 2010, James Hester wrote: > Thanks Herbert for providing the imgCIF perspective. > > I am unfortunately severely restricted in my ability to attend > overseas meetings at present, for family and work reasons. I am also > keen to have our discussions written down and available for perusal by > those that will come later. How about an e-meeting? > > We need to discuss the relationship of imgCIF to CIF2 explicitly, if > imgCIF is going to influence our decisionmaking. Some questions for > Herbert to answer for the record: > > 1. How widely used are non-CBF forms of imgCIF at present? By "widely > used" I mean both > (a) supported by software packages that allow one to do "useful > work", most obviously to extract diffraction spots I assume by "non-CBF" you mean the forms that do the binary sections in something that is not pure binary -- all software that uses CBFlib supports them automatically for reading. For writing, most software chooses one representation for writing, usually byte-offset or packed binary, except when we have to debug -- then the ascii forms, esp. the hexdump form are very useful. > (b) provided as an output format (even optionally) by beamlines or > detector manufacturers See above > 2. What is the advantage of having "pure text" image files? Why isn't > a format like CBF more appropriate? While I agree, when we deal with people who like XML e.g. the NeXus form of imgCIF, then we have no choice -- no binary is allowed, so UCS-2 becomes important. Don't ask me to defend XML. It is simply a fact of life. > 3. What is the problem with a scenario where "pure text" imgCIF > remains in its current CIF1 form, and CIF2 advances are incorporated > into the CIF sections of CBF? I don't understand this question, nor the assumptions behind it. > > Herbert: your work merging a DDL2-based version with DDLm-like > features in HDF5 format sounds interesting. Are you planning to > present a motivation and/or discussion of this work at some stage? This is the subject of some grant applications, so not appropriate for detailed open discussion in this forum at this time. The motivations are simple -- to satisfy the demands of several major facilities for easy integration of crytallographic synchrotron images into HDF5-based data management systems while preserving access to metadata, and to extend HDF5 with relational meta-data access. This second aspect is an increasingly critical need and will go forward in any case. If we have a meeting or e-meeting, I can explain better. > > On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein > wrote: >> Dear James, >> >> ?I have not been at all reticent -- imgCIF will be very poorly supported >> by CIF2 as currently proposed. ?Of necessity, imgCIF changes encodings >> internally -- that it why it uses MIME -- same problem as email with >> images, same solution. >> >> ?Any purely text version has at least a 7% overhead as compared to >> pure binary. ?Restricting to UTF-8 increases the overhead to at least 50%. >> We may get away with the 7% (UTF-16). ?The 50% version (UTF-8) will be >> ignored by the community as unworkable. ?The most likely to be used version >> will be the current DDL2-based version with embedded compressed binaries >> that I am augmenting with DDLm-like features >> and merging in with HDF5. >> >> ?As I noted many months ago, the unfortunate reality is that the >> current CIF2 effort will not merge well with imgCIF. ?If avoiding >> a split is a important -- we need a meeting. ?I would suggest >> involving Bob Sweet and holding it at BNL in conjunction with >> something relevant to NSLS-II. >> >> ?Regards, >> ? ?Herbert >> >> ===================================================== >> ?Herbert J. Bernstein, Professor of Computer Science >> ? Dowling College, Kramer Science Center, KSC 121 >> ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 >> >> ? ? ? ? ? ? ? ? +1-631-244-3035 >> ? ? ? ? ? ? ? ? yaya at dowling.edu >> ===================================================== >> >> On Tue, 24 Aug 2010, James Hester wrote: >> >>> Hi Herbert: regarding imgCIF, ?I agree that splitting it off is not a >>> desirable outcome. ?I would like to get an idea of how well imgCIF can >>> be accommodated under the various encoding proposals currently >>> floating around, as you have been rather reticent to bring it up. ?My >>> naive take on things is that a UTF8-only encoding scheme for CIF2 >>> would not pose significant issues for imgCIF, and a decorated UTF16 >>> encoding in the style of Scheme B would be even better, and quite >>> adequate, so imgCIF is not actually presenting any problems and so was >>> a red herring. >>> >>> I'm not sure that face-to-face or Skype discussions are necessarily >>> going to be more productive. ?Writing things down, while slower, >>> allows me at least to collect my thoughts and those of other >>> participants, and hopefully make a reasoned contribution (my apologies >>> if I am too long-winded) and as an added bonus those thoughts are >>> recorded for later reference. ?For example, where would I now find the >>> background on why a container format for imgCIF is such a bad idea? >>> Presumably that was all thrashed out in face to face discussions, and >>> no record now remains. >>> >>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein >>> wrote: >>>> >>>> Dear Colleagues, >>>> >>>> ? James' and John's last interchange is so voluminous, I doubt any of >>>> us has been able to fully appreciate the rich complexity of ideas >>>> contained therein. ?For example, one of the suggestions far down in >>>> the text is: >>>> >>>> (James now) ?Indeed. ?My intent with this specification was to ensure >>>> that third parties would be able to recover the encoding. If imgCIF is >>>> going to cause us to make such an open-ended specification, it is >>>> probably a sign that imgCIF needs to be addressed separately. ?For >>>> example, should we think about redefining it as a container format, >>>> with a CIF header and UTF16 body (but still part of the >>>> "Crystallographic Information Framework")? >>>> >>>> The idea of an imgCIF "header" in CIF format and a image in another is an >>>> old, well-established, thoroughly discussed, and mistaken idea, rejected >>>> in 1998. ?The handling of multiple images in a single file (e.g. >>>> a jpeg thumbnail and crystal image and a full-size diffraction image) >>>> requires the ability to switch among encodings within the file -- >>>> something handled by the current DDL2 and MIME-based imgCIF format and >>>> which would be a serious problem in CIF2 has currently proposed, >>>> increasing the chances that we will have to move imgCIF entirely into >>>> HDF5 and abandon the CIF representation entirely, sharing only >>>> the dictionary and not the framework. >>>> >>>> If you look carefully, you will see a similar trend with mmCIF, in which >>>> and XML representation sharing the dictionary plays a much more >>>> important role than the CIF format. >>>> >>>> Is it really desirable to make the new CIF format so rigid and >>>> unadaptable that major portions of macromolecular crysallography >>>> end up migrating to very different formats, as they already are >>>> doing? ?Yes, there is great value in having a common dictionary, >>>> but would there not be additional value in having a sufficiently >>>> flexible common format to allow for more software sharing than >>>> we now have? ?It is really desirable for us to continue in the >>>> direction of a single macromolecular experiment having to >>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data >>>> during collection, CCP4-style CIF representations during processing >>>> and deposition and legacy PDB and PDBML representations in subsequent >>>> community use? ?If we could be a little bit more flexible, we might be >>>> able to reduce the data interchange software burdens a little. >>>> Right now, this discussion seems headed in the direction of simply >>>> adding yet another data representation (DDLm/CIF2) to the mix, >>>> increasing the chances of mistranslation and confusion, rather >>>> that reducing them. >>>> >>>> Please, step back a bit from the detailed discussion of UTF8 and >>>> look at the work-flow of doing and publishing crystallographic >>>> experiments and let us try to make a contribution that simplifies >>>> it, not one that makes it more complex than it needs to be. >>>> >>>> I suggest we need to meet and talk, either face-to-face, or by skype. >>>> >>>> Regards, >>>> ? Herbert >>>> >>>> ===================================================== >>>> ?Herbert J. Bernstein, Professor of Computer Science >>>> ? ?Dowling College, Kramer Science Center, KSC 121 >>>> ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>>> >>>> ? ? ? ? ? ? ? ? ?+1-631-244-3035 >>>> ? ? ? ? ? ? ? ? ?yaya at dowling.edu >>>> ===================================================== >>>> >>>> _______________________________________________ >>>> cif2-encoding mailing list >>>> cif2-encoding at iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>> >>> >>> >>> >>> -- >>> T +61 (02) 9717 9907 >>> F +61 (02) 9717 3145 >>> M +61 (04) 0249 4148 >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From yaya at bernstein-plus-sons.com Fri Sep 3 05:02:52 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 3 Sep 2010 00:02:52 -0400 (EDT) Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> Message-ID: This sounds like circular reasoning, using non-standard-conforming applications as the definition of CIF1 and encouraging the creation of more non-standard-conforming software. If CIF1 is to be redefined, then the proposed redefinition should be clearly stated and proposed to the community or COMCIFS is failing in its primary responsibility. Until some sort of a new CIF1 ASCII-only-based definition is put forward, discussed and accepted, I don't think it is appropriate to call that CIF1. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 3 Sep 2010, James Hester wrote: > Herbert, you will note that I carefully wrote "de-facto" ASCII, by > which I mean that virtually, if not all, software for doing "useful > work" with CIF, such as structural display programs, syntax checkers, > refinement programs etc. read and write ASCII only. So while you can > produce an EBCDIC or UTF16 encoded CIF1 file and proudly proclaim that > it is CIF1 conformant, good luck in your quest to do useful work with > it: you won't be able to input it as a starting model in any > crystallographic packages, CheckCIF will complain, you won't be able > to display the structure in all those nice programs...so in practice > you are restricted to ASCII. As an additional and far more > significant restriction, regardless of your CIF1 encoding, you must > use only characters appearing in the ASCII character set in your CIF > file. > > My point being that UTF8-only CIF2 is *less* restrictive than the > successful CIF1 standard, because more code points are available, with > the same range of encoding schemes (i.e. effectively *one* encoding > only). > > If the only non-UTF8 use case will be imgCIF (that would appear to be > the only non-ASCII use case for CIF1), we need to discuss this > explicitly. > > On Thu, Aug 26, 2010 at 9:24 PM, Herbert J. Bernstein > wrote: >> Um, but CIF1 is _not_ ascii-only. ?It is text in any acceptable local >> encoding. >> >> ===================================================== >> ?Herbert J. Bernstein, Professor of Computer Science >> ? ?Dowling College, Kramer Science Center, KSC 121 >> ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >> >> ? ? ? ? ? ? ? ? ?+1-631-244-3035 >> ? ? ? ? ? ? ? ? ?yaya at dowling.edu >> ===================================================== >> > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From jamesrhester at gmail.com Fri Sep 3 05:22:09 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 3 Sep 2010 14:22:09 +1000 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> Message-ID: I agree that CIF1 is not *defined* as ASCII-only, and I have no wish to push for any redefinition. I am stating that CIF1 is used by the community *as if* it were ASCII-only. When speculating about the community response to CIF2, the actual community response to the CIF1 standard is a perfectly reasonable starting point. Are you suggesting that a CIF1 application that accepts only ASCII encoding is not standards conformant? Because all that I am asserting is that useful CIF1 programs that support non-ASCII encodings are either rare or non-existent, despite being allowed by the standard. I see no hint of non-standards-conforming programs in this description. On Fri, Sep 3, 2010 at 2:02 PM, Herbert J. Bernstein wrote: > This sounds like circular reasoning, using non-standard-conforming > applications as the definition of CIF1 and encouraging the creation > of more non-standard-conforming software. ?If CIF1 is to be redefined, > then the proposed redefinition should be clearly stated and > proposed to the community or COMCIFS is failing in its primary > responsibility. ?Until some sort of a new CIF1 ASCII-only-based > definition is put forward, discussed and accepted, ?I don't think it is > appropriate to call that CIF1. > > > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Fri, 3 Sep 2010, James Hester wrote: > >> Herbert, you will note that I carefully wrote "de-facto" ASCII, by >> which I mean that virtually, if not all, software for doing "useful >> work" with CIF, such as structural display programs, syntax checkers, >> refinement programs etc. read and write ASCII only. ?So while you can >> produce an EBCDIC or UTF16 encoded CIF1 file and proudly proclaim that >> it is CIF1 conformant, good luck in your quest to do useful work with >> it: you won't be able to input it as a starting model in any >> crystallographic packages, CheckCIF will complain, you won't be able >> to display the structure in all those nice programs...so in practice >> you are restricted to ASCII. ?As an additional and far more >> significant restriction, regardless of your CIF1 encoding, you must >> use only characters appearing in the ASCII character set in your CIF >> file. >> >> My point being that UTF8-only CIF2 is *less* restrictive than the >> successful CIF1 standard, because more code points are available, with >> the same range of encoding schemes (i.e. effectively *one* encoding >> only). >> >> If the only non-UTF8 use case will be imgCIF (that would appear to be >> the only non-ASCII use case for CIF1), we need to discuss this >> explicitly. >> >> On Thu, Aug 26, 2010 at 9:24 PM, Herbert J. Bernstein >> wrote: >>> >>> Um, but CIF1 is _not_ ascii-only. ?It is text in any acceptable local >>> encoding. >>> >>> ===================================================== >>> ?Herbert J. Bernstein, Professor of Computer Science >>> ? ?Dowling College, Kramer Science Center, KSC 121 >>> ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> ? ? ? ? ? ? ? ? ?+1-631-244-3035 >>> ? ? ? ? ? ? ? ? ?yaya at dowling.edu >>> ===================================================== >>> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From jamesrhester at gmail.com Fri Sep 3 06:33:17 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 3 Sep 2010 15:33:17 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics In-Reply-To: References: <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: > On Fri, 3 Sep 2010, James Hester wrote: > >> Thanks Herbert for providing the imgCIF perspective. >> >> I am unfortunately severely restricted in my ability to attend >> overseas meetings at present, for family and work reasons. ?I am also >> keen to have our discussions written down and available for perusal by >> those that will come later. > > How about an e-meeting? OK, I think we need to try online as my carefully crafted arguments seem to be misunderstood more often than not. Let me buy a web cam first! >> We need to discuss the relationship of imgCIF to CIF2 explicitly, if >> imgCIF is going to influence our decisionmaking. ?Some questions for >> Herbert to answer for the record: >> >> 1. How widely used are non-CBF forms of imgCIF at present? ?By "widely >> used" I mean both >> ?(a) supported by software packages that allow one to do "useful >> work", most obviously to extract diffraction spots > > I assume by "non-CBF" you mean the forms that do the binary sections > in something that is not pure binary -- all software that uses CBFlib > supports them automatically for reading. ?For writing, most software > chooses one representation for writing, usually byte-offset or > packed binary, except when we have to debug -- then the ascii > forms, esp. the hexdump form are very useful. You are correct in interpreting what I mean by "non-CBF". I understand that CBFlib supports everything, but CBFlib on its own is not useful. Do you know approximately what programs use CBFlib? I know only of rasmol, but you presumably know of many more. >> ?(b) provided as an output format (even optionally) by beamlines or >> detector manufacturers > See above I see nothing in your reply on the availability of imgCIF files from detectors or instruments. >> 2. What is the advantage of having "pure text" image files? ?Why isn't >> a format like CBF more appropriate? > > While I agree, when we deal with people who like XML e.g. the NeXus > form of imgCIF, then we have no choice -- no binary is allowed, so > UCS-2 becomes important. ?Don't ask me to defend XML. ?It is simply a > fact of life. I am guessing that this NeXuS-XML requirement is coming from Diamond, and if this is what they want I can see why you are keen to integrate imgCIF into HDF5, so that HDF5-XML conversion can be carried out the standard HDF5 way, rather than encapsulating the entire imgCIF file as a NeXuS-XML dataset. OK: so apart from this relatively recent and frankly crazy-wierd use case, is there any other use-case for pure-text imgCIF? Can we regard the "Diamond" case as a beaurocratically-driven kluge that will be resolved via your HDF5 work, leaving no other reason to create a space-efficient CIF2 version of imgCIF? >> 3. What is the problem with a scenario where "pure text" imgCIF >> remains in its current CIF1 form, and CIF2 advances are incorporated >> into the CIF sections of CBF? > > I don't understand this question, nor the assumptions behind it. Let me be less obtuse: I envision a CBF2 format, which is a CBF file with CIF2 instead of CIF1 syntax. A corresponding imgCIF2 format exists. We *do not care* about the space-efficiency of these imgCIF2 files. We recommend that all new crystallographic image-handling applications should target CBF2 only, rendering space-efficiency of imgCIF2 files irrelevant. Legacy applications, of which there are very few, will be restricted to the original imgCIF, which is very rarely produced in any case (anticipating your answers to my above questions). What are your (Herbert's, anybody else's) thoughts on such a plan? >> Herbert: your work merging a DDL2-based version with DDLm-like >> features in HDF5 format sounds interesting. ?Are you planning to >> present a motivation and/or discussion of this work at some stage? > > This is the subject of some grant applications, so not appropriate for > detailed open discussion in this forum at this time. ?The motivations > are simple -- to satisfy the demands of several major facilities for > easy integration of crytallographic synchrotron images into HDF5-based data > management systems while preserving access to metadata, and to extend HDF5 > with relational meta-data access. ?This second aspect is an increasingly > critical need and will go forward in any case. ?If we have > a meeting or e-meeting, I can explain better. OK, I think reading between the lines I see where this is coming from (read your CACM article as well, BTW). It'd be good to discuss some of these plans at some stage. >> >> On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein >> wrote: >>> >>> Dear James, >>> >>> ?I have not been at all reticent -- imgCIF will be very poorly supported >>> by CIF2 as currently proposed. ?Of necessity, imgCIF changes encodings >>> internally -- that it why it uses MIME -- same problem as email with >>> images, same solution. >>> >>> ?Any purely text version has at least a 7% overhead as compared to >>> pure binary. ?Restricting to UTF-8 increases the overhead to at least >>> 50%. >>> We may get away with the 7% (UTF-16). ?The 50% version (UTF-8) will be >>> ignored by the community as unworkable. ?The most likely to be used >>> version >>> will be the current DDL2-based version with embedded compressed binaries >>> that I am augmenting with DDLm-like features >>> and merging in with HDF5. >>> >>> ?As I noted many months ago, the unfortunate reality is that the >>> current CIF2 effort will not merge well with imgCIF. ?If avoiding >>> a split is a important -- we need a meeting. ?I would suggest >>> involving Bob Sweet and holding it at BNL in conjunction with >>> something relevant to NSLS-II. >>> >>> ?Regards, >>> ? ?Herbert >>> >>> ===================================================== >>> ?Herbert J. Bernstein, Professor of Computer Science >>> ? Dowling College, Kramer Science Center, KSC 121 >>> ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> ? ? ? ? ? ? ? ? +1-631-244-3035 >>> ? ? ? ? ? ? ? ? yaya at dowling.edu >>> ===================================================== >>> >>> On Tue, 24 Aug 2010, James Hester wrote: >>> >>>> Hi Herbert: regarding imgCIF, ?I agree that splitting it off is not a >>>> desirable outcome. ?I would like to get an idea of how well imgCIF can >>>> be accommodated under the various encoding proposals currently >>>> floating around, as you have been rather reticent to bring it up. ?My >>>> naive take on things is that a UTF8-only encoding scheme for CIF2 >>>> would not pose significant issues for imgCIF, and a decorated UTF16 >>>> encoding in the style of Scheme B would be even better, and quite >>>> adequate, so imgCIF is not actually presenting any problems and so was >>>> a red herring. >>>> >>>> I'm not sure that face-to-face or Skype discussions are necessarily >>>> going to be more productive. ?Writing things down, while slower, >>>> allows me at least to collect my thoughts and those of other >>>> participants, and hopefully make a reasoned contribution (my apologies >>>> if I am too long-winded) and as an added bonus those thoughts are >>>> recorded for later reference. ?For example, where would I now find the >>>> background on why a container format for imgCIF is such a bad idea? >>>> Presumably that was all thrashed out in face to face discussions, and >>>> no record now remains. >>>> >>>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein >>>> wrote: >>>>> >>>>> Dear Colleagues, >>>>> >>>>> ? James' and John's last interchange is so voluminous, I doubt any of >>>>> us has been able to fully appreciate the rich complexity of ideas >>>>> contained therein. ?For example, one of the suggestions far down in >>>>> the text is: >>>>> >>>>> (James now) ?Indeed. ?My intent with this specification was to ensure >>>>> that third parties would be able to recover the encoding. If imgCIF is >>>>> going to cause us to make such an open-ended specification, it is >>>>> probably a sign that imgCIF needs to be addressed separately. ?For >>>>> example, should we think about redefining it as a container format, >>>>> with a CIF header and UTF16 body (but still part of the >>>>> "Crystallographic Information Framework")? >>>>> >>>>> The idea of an imgCIF "header" in CIF format and a image in another is >>>>> an >>>>> old, well-established, thoroughly discussed, and mistaken idea, >>>>> rejected >>>>> in 1998. ?The handling of multiple images in a single file (e.g. >>>>> a jpeg thumbnail and crystal image and a full-size diffraction image) >>>>> requires the ability to switch among encodings within the file -- >>>>> something handled by the current DDL2 and MIME-based imgCIF format and >>>>> which would be a serious problem in CIF2 has currently proposed, >>>>> increasing the chances that we will have to move imgCIF entirely into >>>>> HDF5 and abandon the CIF representation entirely, sharing only >>>>> the dictionary and not the framework. >>>>> >>>>> If you look carefully, you will see a similar trend with mmCIF, in >>>>> which >>>>> and XML representation sharing the dictionary plays a much more >>>>> important role than the CIF format. >>>>> >>>>> Is it really desirable to make the new CIF format so rigid and >>>>> unadaptable that major portions of macromolecular crysallography >>>>> end up migrating to very different formats, as they already are >>>>> doing? ?Yes, there is great value in having a common dictionary, >>>>> but would there not be additional value in having a sufficiently >>>>> flexible common format to allow for more software sharing than >>>>> we now have? ?It is really desirable for us to continue in the >>>>> direction of a single macromolecular experiment having to >>>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data >>>>> during collection, CCP4-style CIF representations during processing >>>>> and deposition and legacy PDB and PDBML representations in subsequent >>>>> community use? ?If we could be a little bit more flexible, we might be >>>>> able to reduce the data interchange software burdens a little. >>>>> Right now, this discussion seems headed in the direction of simply >>>>> adding yet another data representation (DDLm/CIF2) to the mix, >>>>> increasing the chances of mistranslation and confusion, rather >>>>> that reducing them. >>>>> >>>>> Please, step back a bit from the detailed discussion of UTF8 and >>>>> look at the work-flow of doing and publishing crystallographic >>>>> experiments and let us try to make a contribution that simplifies >>>>> it, not one that makes it more complex than it needs to be. >>>>> >>>>> I suggest we need to meet and talk, either face-to-face, or by skype. >>>>> >>>>> Regards, >>>>> ? Herbert >>>>> >>>>> ===================================================== >>>>> ?Herbert J. Bernstein, Professor of Computer Science >>>>> ? ?Dowling College, Kramer Science Center, KSC 121 >>>>> ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>>>> >>>>> ? ? ? ? ? ? ? ? ?+1-631-244-3035 >>>>> ? ? ? ? ? ? ? ? ?yaya at dowling.edu >>>>> ===================================================== >>>>> >>>>> _______________________________________________ >>>>> cif2-encoding mailing list >>>>> cif2-encoding at iucr.org >>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>>> >>>> >>>> >>>> >>>> -- >>>> T +61 (02) 9717 9907 >>>> F +61 (02) 9717 3145 >>>> M +61 (04) 0249 4148 >>>> _______________________________________________ >>>> cif2-encoding mailing list >>>> cif2-encoding at iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >>> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Fri Sep 3 14:10:56 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 3 Sep 2010 09:10:56 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Here is more detail on the use of CBFlib. I know for sure that CBFlib is used directly by mosflm and adxv. While XDS uses code that was prototyped in the Fortran part of CBFlib, they work with their own versions. However, Kay Diederichs has also used the CBFlib C code for work on simulations. Paul Ellis started HKL2000 off with CBFlib, but I don't know if they stayed with it. As a practical matter, whether someone uses CBFlib itself, it is an essential part of the documentation that people use to understand how the various compression schemes work, and they use the utility cif2cbf from the package both as an external converter and as a validator and as a debugger when they don't want to put all the functionality in their own code. If you have a funny CBF in any of the semi-infinite number of representations, cif2cbf allows you to check it, get a hex dump of it or convert it to a specific compression scheme or format that some other program needs to process that file. In other words, CBFlib on its own _is_ useful. Sorry about not giving you a list re imgCIF use, I thought you were asking me about CBFlib use -- every beamline that uses a Dectris Pilatus 6M produces imgCIF as the default. This had been a byte-offset compressed binary with a mini-header. Dectris has now moved up to writing a full header. There were some beamlines with some of the older smaller Dectris detectors that were producing TIFF, but all currently delivered Dectris detectors of all sizes produce imgCIF as the default. All the major detector manufacturers now offer CBF as an option except for Bruker which is debugging an optional CBF output. When I checked at the ACA meeting in July they all also said that their processing packages can accept CBF as an input. On the XML use, I would suggest a more broad-minded attitude. Judging from the workshop I was at in January at ESRF, it has much broader support than just from Diamond, especially for spectra which have smaller data volume than images. HDF5 is the most widely accepted scientific binary data format for the physics community, and XML is the easiest and most reliable way to port smaller HDF5 datasets from site to site. The problem with XML is that for large files such as crystallographic images ordinary straight-text XML produces huge, impractical files. binutf allows for a compromise in which you have a true XML UCS-2 file but with the binary having only a 7% overhead. I have no choice -- I _will_ (indeed already do) produce CIFs with UCS-2 binary sections. If COMCIFS repeats the unfortunate decision of 1997 of saying that what the synchrotron community needs can't be called CIF, we'll just go back to calling it imgNCIF (which is an acronym for image-not-CIF), but we will still have to produce it for the community. In 1998 after we had a face-to-face discussion at a BNL workshop, that decision was reversed and what the synchrotron community needed was folded under the CIF umbrella, and imgNCIF became imgCIF. I hope we can have discussions now to avoid the need for a pointless schism. Your proposal on the relationship between CIF2 and imgCIF sounds like a replay of the discussions we had in 1997, with CIF headers following one standard and binary sections following another. You can make that work, but it is clumsy and hard for users to work with. It is better if we have one simple, comprehensible standard for the files they work with as a whole. Let me be clear -- imgCIF is produced worldwide and used for thousands of images daily. These older "legacy" imgCIF images will be around for a long time to come, and whatever new imgCIF (or if you force us to it, imgNCIF) images we produce will need to be, and will be, supported by software that handles both the legacy and the new images and has a clean interface to HDF5 and XML as well. I would greatly prefer that this be coordinated with COMCIFS and done in a way that helps the community to understand the relationship between CIF and imgCIF, but if COMCIFS feels a need to return to its 1997 position and exclude the data we work with from its charge, then imgCIF can return to being imgNCIF. If we are to resolve this, then, as in 1998, we need a meeting or e-meeting. Once you have a web-cam, I would suggest you and I have a skype meeting to frame the issues in dispute and organize a wider meeting. -- Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 3 Sep 2010, James Hester wrote: >> On Fri, 3 Sep 2010, James Hester wrote: >> >>> Thanks Herbert for providing the imgCIF perspective. >>> >>> I am unfortunately severely restricted in my ability to attend >>> overseas meetings at present, for family and work reasons. ?I am also >>> keen to have our discussions written down and available for perusal by >>> those that will come later. >> >> How about an e-meeting? > > OK, I think we need to try online as my carefully crafted arguments > seem to be misunderstood more often than not. > Let me buy a web cam first! > >>> We need to discuss the relationship of imgCIF to CIF2 explicitly, if >>> imgCIF is going to influence our decisionmaking. ?Some questions for >>> Herbert to answer for the record: >>> >>> 1. How widely used are non-CBF forms of imgCIF at present? ?By "widely >>> used" I mean both >>> ?(a) supported by software packages that allow one to do "useful >>> work", most obviously to extract diffraction spots >> >> I assume by "non-CBF" you mean the forms that do the binary sections >> in something that is not pure binary -- all software that uses CBFlib >> supports them automatically for reading. ?For writing, most software >> chooses one representation for writing, usually byte-offset or >> packed binary, except when we have to debug -- then the ascii >> forms, esp. the hexdump form are very useful. > > You are correct in interpreting what I mean by "non-CBF". > > I understand that CBFlib supports everything, but CBFlib on its own is > not useful. Do you know approximately what programs use CBFlib? I > know only of rasmol, but you presumably know of many more. > >>> ?(b) provided as an output format (even optionally) by beamlines or >>> detector manufacturers > >> See above > > I see nothing in your reply on the availability of imgCIF files from > detectors or instruments. > >>> 2. What is the advantage of having "pure text" image files? ?Why isn't >>> a format like CBF more appropriate? >> >> While I agree, when we deal with people who like XML e.g. the NeXus >> form of imgCIF, then we have no choice -- no binary is allowed, so >> UCS-2 becomes important. ?Don't ask me to defend XML. ?It is simply a >> fact of life. > > I am guessing that this NeXuS-XML requirement is coming from Diamond, > and if this is what they want I can see why you are keen to integrate > imgCIF into HDF5, so that HDF5-XML conversion can be carried out the > standard HDF5 way, rather than encapsulating the entire imgCIF file as > a NeXuS-XML dataset. OK: so apart from this relatively recent and > frankly crazy-wierd use case, is there any other use-case for > pure-text imgCIF? Can we regard the "Diamond" case as a > beaurocratically-driven kluge that will be resolved via your HDF5 > work, leaving no other reason to create a space-efficient CIF2 version > of imgCIF? > >>> 3. What is the problem with a scenario where "pure text" imgCIF >>> remains in its current CIF1 form, and CIF2 advances are incorporated >>> into the CIF sections of CBF? >> >> I don't understand this question, nor the assumptions behind it. > > Let me be less obtuse: > I envision a CBF2 format, which is a CBF file with CIF2 instead of > CIF1 syntax. A corresponding imgCIF2 format exists. We *do not care* > about the space-efficiency of these imgCIF2 files. We recommend that > all new crystallographic image-handling applications should target > CBF2 only, rendering space-efficiency of imgCIF2 files irrelevant. > Legacy applications, of which there are very few, will be restricted > to the original imgCIF, which is very rarely produced in any case > (anticipating your answers to my above questions). > > What are your (Herbert's, anybody else's) thoughts on such a plan? > >>> Herbert: your work merging a DDL2-based version with DDLm-like >>> features in HDF5 format sounds interesting. ?Are you planning to >>> present a motivation and/or discussion of this work at some stage? >> >> This is the subject of some grant applications, so not appropriate for >> detailed open discussion in this forum at this time. ?The motivations >> are simple -- to satisfy the demands of several major facilities for >> easy integration of crytallographic synchrotron images into HDF5-based data >> management systems while preserving access to metadata, and to extend HDF5 >> with relational meta-data access. ?This second aspect is an increasingly >> critical need and will go forward in any case. ?If we have >> a meeting or e-meeting, I can explain better. > > OK, I think reading between the lines I see where this is coming from > (read your CACM article as well, BTW). It'd be good to discuss some > of these plans at some stage. > >>> >>> On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein >>> wrote: >>>> >>>> Dear James, >>>> >>>> ?I have not been at all reticent -- imgCIF will be very poorly supported >>>> by CIF2 as currently proposed. ?Of necessity, imgCIF changes encodings >>>> internally -- that it why it uses MIME -- same problem as email with >>>> images, same solution. >>>> >>>> ?Any purely text version has at least a 7% overhead as compared to >>>> pure binary. ?Restricting to UTF-8 increases the overhead to at least >>>> 50%. >>>> We may get away with the 7% (UTF-16). ?The 50% version (UTF-8) will be >>>> ignored by the community as unworkable. ?The most likely to be used >>>> version >>>> will be the current DDL2-based version with embedded compressed binaries >>>> that I am augmenting with DDLm-like features >>>> and merging in with HDF5. >>>> >>>> ?As I noted many months ago, the unfortunate reality is that the >>>> current CIF2 effort will not merge well with imgCIF. ?If avoiding >>>> a split is a important -- we need a meeting. ?I would suggest >>>> involving Bob Sweet and holding it at BNL in conjunction with >>>> something relevant to NSLS-II. >>>> >>>> ?Regards, >>>> ? ?Herbert >>>> >>>> ===================================================== >>>> ?Herbert J. Bernstein, Professor of Computer Science >>>> ? Dowling College, Kramer Science Center, KSC 121 >>>> ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 >>>> >>>> ? ? ? ? ? ? ? ? +1-631-244-3035 >>>> ? ? ? ? ? ? ? ? yaya at dowling.edu >>>> ===================================================== >>>> >>>> On Tue, 24 Aug 2010, James Hester wrote: >>>> >>>>> Hi Herbert: regarding imgCIF, ?I agree that splitting it off is not a >>>>> desirable outcome. ?I would like to get an idea of how well imgCIF can >>>>> be accommodated under the various encoding proposals currently >>>>> floating around, as you have been rather reticent to bring it up. ?My >>>>> naive take on things is that a UTF8-only encoding scheme for CIF2 >>>>> would not pose significant issues for imgCIF, and a decorated UTF16 >>>>> encoding in the style of Scheme B would be even better, and quite >>>>> adequate, so imgCIF is not actually presenting any problems and so was >>>>> a red herring. >>>>> >>>>> I'm not sure that face-to-face or Skype discussions are necessarily >>>>> going to be more productive. ?Writing things down, while slower, >>>>> allows me at least to collect my thoughts and those of other >>>>> participants, and hopefully make a reasoned contribution (my apologies >>>>> if I am too long-winded) and as an added bonus those thoughts are >>>>> recorded for later reference. ?For example, where would I now find the >>>>> background on why a container format for imgCIF is such a bad idea? >>>>> Presumably that was all thrashed out in face to face discussions, and >>>>> no record now remains. >>>>> >>>>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein >>>>> wrote: >>>>>> >>>>>> Dear Colleagues, >>>>>> >>>>>> ? James' and John's last interchange is so voluminous, I doubt any of >>>>>> us has been able to fully appreciate the rich complexity of ideas >>>>>> contained therein. ?For example, one of the suggestions far down in >>>>>> the text is: >>>>>> >>>>>> (James now) ?Indeed. ?My intent with this specification was to ensure >>>>>> that third parties would be able to recover the encoding. If imgCIF is >>>>>> going to cause us to make such an open-ended specification, it is >>>>>> probably a sign that imgCIF needs to be addressed separately. ?For >>>>>> example, should we think about redefining it as a container format, >>>>>> with a CIF header and UTF16 body (but still part of the >>>>>> "Crystallographic Information Framework")? >>>>>> >>>>>> The idea of an imgCIF "header" in CIF format and a image in another is >>>>>> an >>>>>> old, well-established, thoroughly discussed, and mistaken idea, >>>>>> rejected >>>>>> in 1998. ?The handling of multiple images in a single file (e.g. >>>>>> a jpeg thumbnail and crystal image and a full-size diffraction image) >>>>>> requires the ability to switch among encodings within the file -- >>>>>> something handled by the current DDL2 and MIME-based imgCIF format and >>>>>> which would be a serious problem in CIF2 has currently proposed, >>>>>> increasing the chances that we will have to move imgCIF entirely into >>>>>> HDF5 and abandon the CIF representation entirely, sharing only >>>>>> the dictionary and not the framework. >>>>>> >>>>>> If you look carefully, you will see a similar trend with mmCIF, in >>>>>> which >>>>>> and XML representation sharing the dictionary plays a much more >>>>>> important role than the CIF format. >>>>>> >>>>>> Is it really desirable to make the new CIF format so rigid and >>>>>> unadaptable that major portions of macromolecular crysallography >>>>>> end up migrating to very different formats, as they already are >>>>>> doing? ?Yes, there is great value in having a common dictionary, >>>>>> but would there not be additional value in having a sufficiently >>>>>> flexible common format to allow for more software sharing than >>>>>> we now have? ?It is really desirable for us to continue in the >>>>>> direction of a single macromolecular experiment having to >>>>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data >>>>>> during collection, CCP4-style CIF representations during processing >>>>>> and deposition and legacy PDB and PDBML representations in subsequent >>>>>> community use? ?If we could be a little bit more flexible, we might be >>>>>> able to reduce the data interchange software burdens a little. >>>>>> Right now, this discussion seems headed in the direction of simply >>>>>> adding yet another data representation (DDLm/CIF2) to the mix, >>>>>> increasing the chances of mistranslation and confusion, rather >>>>>> that reducing them. >>>>>> >>>>>> Please, step back a bit from the detailed discussion of UTF8 and >>>>>> look at the work-flow of doing and publishing crystallographic >>>>>> experiments and let us try to make a contribution that simplifies >>>>>> it, not one that makes it more complex than it needs to be. >>>>>> >>>>>> I suggest we need to meet and talk, either face-to-face, or by skype. >>>>>> >>>>>> Regards, >>>>>> ? Herbert >>>>>> >>>>>> ===================================================== >>>>>> ?Herbert J. Bernstein, Professor of Computer Science >>>>>> ? ?Dowling College, Kramer Science Center, KSC 121 >>>>>> ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>>>>> >>>>>> ? ? ? ? ? ? ? ? ?+1-631-244-3035 >>>>>> ? ? ? ? ? ? ? ? ?yaya at dowling.edu >>>>>> ===================================================== >>>>>> >>>>>> _______________________________________________ >>>>>> cif2-encoding mailing list >>>>>> cif2-encoding at iucr.org >>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> T +61 (02) 9717 9907 >>>>> F +61 (02) 9717 3145 >>>>> M +61 (04) 0249 4148 >>>>> _______________________________________________ >>>>> cif2-encoding mailing list >>>>> cif2-encoding at iucr.org >>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>> >>>> _______________________________________________ >>>> cif2-encoding mailing list >>>> cif2-encoding at iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>> >>>> >>> >>> >>> >>> -- >>> T +61 (02) 9717 9907 >>> F +61 (02) 9717 3145 >>> M +61 (04) 0249 4148 >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From John.Bollinger at STJUDE.ORG Fri Sep 3 16:36:15 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Fri, 3 Sep 2010 10:36:15 -0500 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDB0@SJMEMXMBS11.stjude.sjcrh.local> James: I am not ignoring our ongoing blockbuster exchange, but I have been unable to devote the time to it that it deserves. In the mean time, I have a shorter response to these comments: On Thursday, September 02, 2010 11:22 PM, James Hester wrote: >I agree that CIF1 is not *defined* as ASCII-only, and I have no wish >to push for any redefinition. I am stating that CIF1 is used by the >community *as if* it were ASCII-only. I think it's more accurate to say that CIF1 is used by the community under the assumption that CIFs comply with the default text conventions for the environment. This is reasonable, because the CIF1 design assumes that exchange of CIFs between dissimilar environments involves conversion from one set of text conventions to another (sounds familiar?). For example, CIF1 processors are not required to recognize non-native line termination semantics. CIF1's limited character repertoire and the great prevalence of ASCII-compatible character encodings make it tempting to describe that situation as de facto ASCII-only. That is a mischaracterization, however, ignoring CIF1's assumption of text conversion accompanying CIF exchange. That assumption makes a great difference if you want to design CIF software that is reasonably portable to systems that do not default to an ASCII-compatible encoding. > When speculating about the >community response to CIF2, the actual community response to the CIF1 >standard is a perfectly reasonable starting point. Indeed, hence the continuing line of argument that users would want to continue to use CIFs encoded according to local convention, just as they already do. The new and disruptive thing here is support for non-native encodings, which in most places include UTF-8. I want UTF-8, but it's not free. >Are you suggesting that a CIF1 application that accepts only ASCII >encoding is not standards conformant? I am amused to see you arguing the other side of the "CIF software must accept all compliant CIFs" argument now :) . I don't know about Herb, but I would find that program's behavior unacceptable if it were running on an EBCDIC-based computer. The standard says almost nothing about program behavior, so I could not call the *program* non-conformant, but it would reject conformant (EBCDIC-encoded) CIFs that I would expect it to accept. > Because all that I am asserting >is that useful CIF1 programs that support non-ASCII encodings are >either rare or non-existent, despite being allowed by the standard. I >see no hint of non-standards-conforming programs in this description. I suspect that many CIF1 programs would in fact support a non-ASCII encoding just fine when used on a system where that encoding is the default. In fact, I expect that many of them would fail on ASCII- (or UTF-8-)encoded CIFs in such an environment. In other words, I believe that there are many useful CIF1 programs that support non-ASCII encodings, simply as a result of assuming default text conventions. This is the difference between "ASCII-only" and "text". John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From jamesrhester at gmail.com Tue Sep 7 06:03:02 2010 From: jamesrhester at gmail.com (James Hester) Date: Tue, 7 Sep 2010 15:03:02 +1000 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDB0@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDB0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: I will try not to make this exchange into another blockbuster! Comments inserted below. On Sat, Sep 4, 2010 at 1:36 AM, Bollinger, John C wrote: > James: I am not ignoring our ongoing blockbuster exchange, but I have > been unable to devote the time to it that it deserves. ?In the mean > time, I have a shorter response to these comments: > > On Thursday, September 02, 2010 11:22 PM, James Hester wrote: >>I agree that CIF1 is not *defined* as ASCII-only, and I have no wish >>to push for any redefinition. ?I am stating that CIF1 is used by the >>community *as if* it were ASCII-only. > > I think it's more accurate to say that CIF1 is used by the community > under the assumption that CIFs comply with the default text conventions > for the environment. ?This is reasonable, because the CIF1 design > assumes that exchange of CIFs between dissimilar environments > involves conversion from one set of text conventions to another (sounds > familiar?). ?For example, CIF1 processors are not required to recognize > non-native line termination semantics. > > CIF1's limited character repertoire and the great prevalence of > ASCII-compatible character encodings make it tempting to describe that > situation as de facto ASCII-only. ?That is a mischaracterization, > however, ignoring CIF1's assumption of text conversion accompanying CIF > exchange. ?That assumption makes a great difference if you want to > design CIF software that is reasonably portable to systems that do not > default to an ASCII-compatible encoding. Yes, I might have been overstretching to characterise the current situation as involving the community choosing to use ASCII, rather than simply having to use ASCII. Nevertheless, I believe it is still fair to say that substituting UTF8 for ASCII reduces the restrictions on CIF users, rather than increases them. >> When speculating about the >>community response to CIF2, the actual community response to the CIF1 >>standard is a perfectly reasonable starting point. > > Indeed, hence the continuing line of argument that users would want to > continue to use CIFs encoded according to local convention, just as they > already do. ?The new and disruptive thing here is support for non-native > encodings, which in most places include UTF-8. ?I want UTF-8, but it's > not free. > >>Are you suggesting that a CIF1 application that accepts only ASCII >>encoding is not standards conformant? > > I am amused to see you arguing the other side of the "CIF software must > accept all compliant CIFs" argument now :) . Unfortunately the horse has well and truly bolted on the CIF1 standard (with which I was not involved), so I have been attempting to use it as a real-life test for what happens when optional behaviour remains in the standard. As you rightly point out, it turns out not to be a particularly enlightening test, as the apparent open slather in CIF1 encodings essentially reduces to ASCII everywhere. I do not yet resile from my "CIF software must accept all compliant CIFs" line, but it is too late (and pointless) to go back and fix CIF1 to make this possible. > I don't know about Herb, but I would find that program's behavior > unacceptable if it were running on an EBCDIC-based computer. ?The > standard says almost nothing about program behavior, so I could not call > the *program* non-conformant, but it would reject conformant > (EBCDIC-encoded) CIFs that I would expect it to accept. I take your point, and think I see in what sense my putative ASCII-only CIF program would have been non-conformant to the CIF1 standard in an EBCDIC environment, hence Herb's cryptic (to me) remark about non-conformance. >> ?Because all that I am asserting >>is that useful CIF1 programs that support non-ASCII encodings are >>either rare or non-existent, despite being allowed by the standard. ?I >>see no hint of non-standards-conforming programs in this description. > > I suspect that many CIF1 programs would in fact support a non-ASCII encoding > just fine when used on a system where that encoding is the default. ?In > fact, I expect that many of them would fail on ASCII- (or UTF-8-)encoded CIFs > in such an environment. ?In other words, I believe that there are many useful > CIF1 programs that support non-ASCII encodings, simply as a result of assuming > default text conventions. ?This is the difference between "ASCII-only" and > "text". I have strong doubts that such local defaults are sufficiently robust, well-defined, and consistently used to allow us to follow CIF1 in this respect. And, even if such local defaults were reliable, the issue would remain of writing portable CIF programs (portable between language environments as well as operating systems) and transfer of such files between each language environment/operating system. CIF1 put these text conversion issues into the "somebody else's problem" basket (e.g. let Chester figure out what to do with an EBCDIC-encoded CIF), which perhaps was efficient and responsible when the only contenders were "ASCII-compatible" and "almost-extinct EBCDIC". all the best, James. > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: ?www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From John.Bollinger at STJUDE.ORG Wed Sep 8 16:45:46 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Wed, 8 Sep 2010 10:45:46 -0500 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDB0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDB8@SJMEMXMBS11.stjude.sjcrh.local> On Tuesday, September 07, 2010 12:03 AM, James Hester wrote: >On Sat, Sep 4, 2010 at 1:36 AM, Bollinger, John C > wrote: >> I suspect that many CIF1 programs would in fact support a non-ASCII encoding >> just fine when used on a system where that encoding is the default. In >> fact, I expect that many of them would fail on ASCII- (or UTF-8-)encoded CIFs >> in such an environment. In other words, I believe that there are many useful >> CIF1 programs that support non-ASCII encodings, simply as a result of assuming >> default text conventions. This is the difference between "ASCII-only" and >> "text". > >I have strong doubts that such local defaults are sufficiently robust, >well-defined, and consistently used to allow us to follow CIF1 in this >respect. One of the reasons this disagreement has been so difficult to resolve is that there is no agreed set of goals or priorities for us to rely upon. Our discussion has ascended to that level once or twice, where it revealed that our divide goes that high. In this particular case, I don't think we can agree about whether CIF2 can follow CIF1 in this respect without first agreeing on the objectives for CIF2. > And, even if such local defaults were reliable, the issue >would remain of writing portable CIF programs (portable between >language environments as well as operating systems) and transfer of >such files between each language environment/operating system. Those are appropriate considerations for the discussion we should be having, about the objectives for CIF2. I do want CIF2 to support portable programs (in both the senses you mention). I want it also to be easy to program for, to the extent that's possible. On the other hand, although I see the allure of binary uniformity for CIFs, I am not persuaded that that's where CIF2 should go. I would have additionally preferred full backwards compatibility with CIF1. That's a lost cause, but there seems to be an idea that CIF2 should nevertheless be backwards-compatible in some approximate sense -- for instance, the draft talks about CIF2 software handling CIF1 syntax. Requiring UTF-8 for CIF2 is in direct opposition to that objective. With that as the rule, CIF1's expectation that CIFs comply with local text conventions would in CIF2 become a requirement that CIFs NOT comply with local text conventions on most current systems, unless by accident. We all know that files conforming to CIF1 would accidentally comply with the proposed CIF2 convention in many important cases, but it would be a mistake to let that obscure the fact that a 180-degree design reversal is being considered. Regards, John -- John C. Bollinger, Ph.D. Computing and X-Ray Scientist Department of Structural Biology St. Jude Children's Research Hospital John.Bollinger at StJude.org (901) 595-3166 [office] www.stjude.org Email Disclaimer: www.stjude.org/emaildisclaimer From John.Bollinger at STJUDE.ORG Wed Sep 8 17:31:07 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Wed, 8 Sep 2010 11:31:07 -0500 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDB0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDB9@SJMEMXMBS11.stjude.sjcrh.local> On Tuesday, September 07, 2010 12:03 AM, James Hester wrote: >CIF1 put these text conversion issues into the "somebody else's >problem" basket (e.g. let Chester figure out what to do with an >EBCDIC-encoded CIF), which perhaps was efficient and responsible when >the only contenders were "ASCII-compatible" and "almost-extinct >EBCDIC". For what it's worth, I see a decent possibility that UTF-16 will become a non-negligible, non-ASCII-compatible contender in that arena within the next ten years, if it isn't already. It offers considerable space savings relative to UTF-8 for many non-Latin scripts. In any case, I disagree that CIF1 relegated text conversion to "somebody else's problem." Correctly exchanging text between dissimilar systems is *always* a joint problem of sender and receiver. As a service to the community, Chester shouldered the bulk of that burden for CIF1-conformant submissions, but that was their strategic decision, not something imposed on them by the standard. Chester will need to make a similar decision regarding CIF2, quite independently of what the standard ultimately says about text encoding. Either way, my recommendation would be that they accept only UTF-8, and perhaps something like scheme B. No formulation of the standard that allows those could obligate Chester to accept more. That pushes the responsibility back to the CIF submitter, where any Unicode-based form of CIF2 I can imagine requires it to be. In no way do I see any inefficiency or irresponsibility on the part of the CIF2 standardization effort should we ultimately agree that CIF2 encodings will not be limited to UTF-8. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From jamesrhester at gmail.com Fri Sep 10 08:51:56 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 10 Sep 2010 17:51:56 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Thanks Herbert for this detailed information, which is a great help to me in forming an opinion. Please understand that we are not even close to considering excluding imgCIF from CIF. Rather, I am collecting information in order to form an opinion and work with everybody to find a solution which then goes back to the DDLm group and then on to COMCIFS regarding CIF2. Speculation about potential consequences for imgCIF are just part of the information-gathering process. In general terms, CIF is now a 'framework', which I think will make bringing XML and HDF5 developments under the CIF umbrella relatively simple. Please also understand that my comments about the usefulness of CBFlib were in the context of a typical beamline user wishing to handle their data, rather than from a programmer's point of view. I was not casting aspersions on CBFlib, rather seeking more information (which you have provided). I am afraid that terminology here may be confusing me: I would like to talk about imgCIF as a pure ASCII format (eg IT Vol G p 40 para 15) and CBF as the binary equivalent. However, your previous statements indicate that imgCIF could also be written in UTF16 encoding. So: when you speak of the Dectris detector output as 'imgCIF', what encoding is used? The point you make about embedding imgCIF into a text-only format (in this case XML) is, I agree, a use-case that we have to consider. I see merit in the position that 'CIF2 content' inside a container is not constrained by encoding, in those cases where the container is able to specify the encoding itself. This is *pedantically* true already in that the 'header' of the container file as a whole is *not* the CIF2 magic header. So: what does everyone think of the following statement being included in the standard? "Note that a CIF2-conformant character stream that forms part of a larger stream is not constrained to be in UTF8 encoding if the encoding of the CIF2 stream is specified in a standards-conformant manner within the enclosing stream. For example, CIF2 content within an XML file is not constrained to be UTF8-encoded as standard XML attributes can be used to manage encoding." (Perhaps John B, who has shown superior wordsmithing capabilities, could polish this up a bit?) On Fri, Sep 3, 2010 at 11:10 PM, Herbert J. Bernstein wrote: > Here is more detail on the use of CBFlib. > > I know for sure that CBFlib is used directly by mosflm and adxv. ?While XDS > uses code that was prototyped in the Fortran part of CBFlib, they work with > their own versions. ?However, Kay Diederichs has also used the CBFlib C code > for work on simulations. ?Paul Ellis started HKL2000 off with CBFlib, but I > don't know if they stayed with it. > > As a practical matter, whether someone uses CBFlib itself, it is an > essential part of the documentation that people use to understand how the > various compression schemes work, and they use the utility cif2cbf from the > package both as an external converter and as a validator and as a debugger > when they don't want to put all the functionality in their own code. ?If you > have a funny CBF in any of the semi-infinite number of representations, > cif2cbf allows you to check it, get a hex dump of it or convert it to a > specific compression scheme or format that some other program needs to > process that file. > > In other words, CBFlib on its own _is_ useful. > > Sorry about not giving you a list re imgCIF use, I thought you were asking > me about CBFlib use -- every beamline that uses a Dectris Pilatus 6M > produces imgCIF as the default. ?This had been a byte-offset compressed > binary with a mini-header. ?Dectris has now moved up to writing a full > header. ?There were some beamlines with some of the older smaller Dectris > detectors that were producing TIFF, but all currently delivered Dectris > detectors of all sizes produce imgCIF as the default. > > All the major detector manufacturers now offer CBF as an option except for > Bruker which is debugging an optional CBF output. ?When I checked at the ACA > meeting in July they all also said that their processing packages can accept > CBF as an input. > > On the XML use, I would suggest a more broad-minded attitude. Judging from > the workshop I was at in January at ESRF, it has much broader support than > just from Diamond, especially for spectra which have smaller data volume > than images. HDF5 is the most widely accepted scientific binary data format > for the physics community, and XML is the easiest and most reliable way to > port smaller HDF5 datasets from site to site. The problem with XML is that > for large files such as crystallographic images ordinary straight-text XML > produces huge, impractical files. ?binutf allows for a compromise in which > you have a true XML UCS-2 file but with the binary having only a 7% > overhead. > > I have no choice -- I _will_ (indeed already do) produce CIFs with UCS-2 > binary sections. ?If COMCIFS repeats the unfortunate decision of 1997 of > saying that what the synchrotron community needs can't be called CIF, we'll > just go back to calling it imgNCIF (which is an acronym for image-not-CIF), > but we will still have to produce it for the community. In 1998 after we had > a face-to-face discussion at a BNL workshop, that decision was reversed and > what the synchrotron community needed was folded under the CIF umbrella, and > imgNCIF became imgCIF. ?I hope we can have discussions now to avoid the need > for a pointless schism. > > Your proposal on the relationship between CIF2 and imgCIF sounds like a > replay of the discussions we had in 1997, with CIF headers following one > standard and binary sections following another. You can make that work, but > it is clumsy and hard for users to work with. ?It is better if we have one > simple, comprehensible standard for the files they work with as a whole. > > Let me be clear -- imgCIF is produced worldwide and used for thousands of > images daily. ?These older "legacy" imgCIF images will be around for a long > time to come, and whatever new imgCIF (or if you force us to it, imgNCIF) > images we produce will need to be, and will be, supported by software that > handles both the legacy and the new images and has a clean interface to HDF5 > and XML as well. ?I would greatly prefer that this be coordinated with > COMCIFS and done in a way that helps the community to understand the > relationship between CIF and imgCIF, but if COMCIFS feels a need to return > to its 1997 position and exclude the data we work with from its charge, then > imgCIF can return to being imgNCIF. > > If we are to resolve this, then, as in 1998, we need a meeting or e-meeting. > ?Once you have a web-cam, I would suggest you and I have a skype meeting to > frame the issues in dispute and organize a wider meeting. > > -- Herbert > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Fri, 3 Sep 2010, James Hester wrote: > >>> On Fri, 3 Sep 2010, James Hester wrote: >>> >>>> Thanks Herbert for providing the imgCIF perspective. >>>> >>>> I am unfortunately severely restricted in my ability to attend >>>> overseas meetings at present, for family and work reasons. ?I am also >>>> keen to have our discussions written down and available for perusal by >>>> those that will come later. >>> >>> How about an e-meeting? >> >> OK, I think we need to try online as my carefully crafted arguments >> seem to be misunderstood more often than not. >> Let me buy a web cam first! >> >>>> We need to discuss the relationship of imgCIF to CIF2 explicitly, if >>>> imgCIF is going to influence our decisionmaking. ?Some questions for >>>> Herbert to answer for the record: >>>> >>>> 1. How widely used are non-CBF forms of imgCIF at present? ?By "widely >>>> used" I mean both >>>> ?(a) supported by software packages that allow one to do "useful >>>> work", most obviously to extract diffraction spots >>> >>> I assume by "non-CBF" you mean the forms that do the binary sections >>> in something that is not pure binary -- all software that uses CBFlib >>> supports them automatically for reading. ?For writing, most software >>> chooses one representation for writing, usually byte-offset or >>> packed binary, except when we have to debug -- then the ascii >>> forms, esp. the hexdump form are very useful. >> >> You are correct in interpreting what I mean by "non-CBF". >> >> I understand that CBFlib supports everything, but CBFlib on its own is >> not useful. Do you know approximately what programs use CBFlib? ?I >> know only of rasmol, but you presumably know of many more. >> >>>> ?(b) provided as an output format (even optionally) by beamlines or >>>> detector manufacturers >> >>> See above >> >> I see nothing in your reply on the availability of imgCIF files from >> detectors or instruments. >> >>>> 2. What is the advantage of having "pure text" image files? ?Why isn't >>>> a format like CBF more appropriate? >>> >>> While I agree, when we deal with people who like XML e.g. the NeXus >>> form of imgCIF, then we have no choice -- no binary is allowed, so >>> UCS-2 becomes important. ?Don't ask me to defend XML. ?It is simply a >>> fact of life. >> >> I am guessing that this NeXuS-XML requirement is coming from Diamond, >> and if this is what they want I can see why you are keen to integrate >> imgCIF into HDF5, so that HDF5-XML conversion can be carried out the >> standard HDF5 way, rather than encapsulating the entire imgCIF file as >> a NeXuS-XML dataset. ?OK: so apart from this relatively recent and >> frankly crazy-wierd use case, is there any other use-case for >> pure-text imgCIF? ?Can we regard the "Diamond" case as a >> beaurocratically-driven kluge that will be resolved via your HDF5 >> work, leaving no other reason to create a space-efficient CIF2 version >> of imgCIF? >> >>>> 3. What is the problem with a scenario where "pure text" imgCIF >>>> remains in its current CIF1 form, and CIF2 advances are incorporated >>>> into the CIF sections of CBF? >>> >>> I don't understand this question, nor the assumptions behind it. >> >> Let me be less obtuse: >> I envision a CBF2 format, which is a CBF file with CIF2 instead of >> CIF1 syntax. ?A corresponding imgCIF2 format exists. We *do not care* >> about the space-efficiency of these imgCIF2 files. We recommend that >> all new crystallographic image-handling applications should target >> CBF2 only, rendering space-efficiency of imgCIF2 files irrelevant. >> Legacy applications, of which there are very few, will be restricted >> to the original imgCIF, which is very rarely produced in any case >> (anticipating your answers to my above questions). >> >> What are your (Herbert's, anybody else's) thoughts on such a plan? >> >>>> Herbert: your work merging a DDL2-based version with DDLm-like >>>> features in HDF5 format sounds interesting. ?Are you planning to >>>> present a motivation and/or discussion of this work at some stage? >>> >>> This is the subject of some grant applications, so not appropriate for >>> detailed open discussion in this forum at this time. ?The motivations >>> are simple -- to satisfy the demands of several major facilities for >>> easy integration of crytallographic synchrotron images into HDF5-based >>> data >>> management systems while preserving access to metadata, and to extend >>> HDF5 >>> with relational meta-data access. ?This second aspect is an increasingly >>> critical need and will go forward in any case. ?If we have >>> a meeting or e-meeting, I can explain better. >> >> OK, I think reading between the lines I see where this is coming from >> (read your CACM article as well, BTW). ?It'd be good to discuss some >> of these plans at some stage. >> >>>> >>>> On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein >>>> wrote: >>>>> >>>>> Dear James, >>>>> >>>>> ?I have not been at all reticent -- imgCIF will be very poorly >>>>> supported >>>>> by CIF2 as currently proposed. ?Of necessity, imgCIF changes encodings >>>>> internally -- that it why it uses MIME -- same problem as email with >>>>> images, same solution. >>>>> >>>>> ?Any purely text version has at least a 7% overhead as compared to >>>>> pure binary. ?Restricting to UTF-8 increases the overhead to at least >>>>> 50%. >>>>> We may get away with the 7% (UTF-16). ?The 50% version (UTF-8) will be >>>>> ignored by the community as unworkable. ?The most likely to be used >>>>> version >>>>> will be the current DDL2-based version with embedded compressed >>>>> binaries >>>>> that I am augmenting with DDLm-like features >>>>> and merging in with HDF5. >>>>> >>>>> ?As I noted many months ago, the unfortunate reality is that the >>>>> current CIF2 effort will not merge well with imgCIF. ?If avoiding >>>>> a split is a important -- we need a meeting. ?I would suggest >>>>> involving Bob Sweet and holding it at BNL in conjunction with >>>>> something relevant to NSLS-II. >>>>> >>>>> ?Regards, >>>>> ? ?Herbert >>>>> >>>>> ===================================================== >>>>> ?Herbert J. Bernstein, Professor of Computer Science >>>>> ? Dowling College, Kramer Science Center, KSC 121 >>>>> ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 >>>>> >>>>> ? ? ? ? ? ? ? ? +1-631-244-3035 >>>>> ? ? ? ? ? ? ? ? yaya at dowling.edu >>>>> ===================================================== >>>>> >>>>> On Tue, 24 Aug 2010, James Hester wrote: >>>>> >>>>>> Hi Herbert: regarding imgCIF, ?I agree that splitting it off is not a >>>>>> desirable outcome. ?I would like to get an idea of how well imgCIF can >>>>>> be accommodated under the various encoding proposals currently >>>>>> floating around, as you have been rather reticent to bring it up. ?My >>>>>> naive take on things is that a UTF8-only encoding scheme for CIF2 >>>>>> would not pose significant issues for imgCIF, and a decorated UTF16 >>>>>> encoding in the style of Scheme B would be even better, and quite >>>>>> adequate, so imgCIF is not actually presenting any problems and so was >>>>>> a red herring. >>>>>> >>>>>> I'm not sure that face-to-face or Skype discussions are necessarily >>>>>> going to be more productive. ?Writing things down, while slower, >>>>>> allows me at least to collect my thoughts and those of other >>>>>> participants, and hopefully make a reasoned contribution (my apologies >>>>>> if I am too long-winded) and as an added bonus those thoughts are >>>>>> recorded for later reference. ?For example, where would I now find the >>>>>> background on why a container format for imgCIF is such a bad idea? >>>>>> Presumably that was all thrashed out in face to face discussions, and >>>>>> no record now remains. >>>>>> >>>>>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein >>>>>> wrote: >>>>>>> >>>>>>> Dear Colleagues, >>>>>>> >>>>>>> ? James' and John's last interchange is so voluminous, I doubt any of >>>>>>> us has been able to fully appreciate the rich complexity of ideas >>>>>>> contained therein. ?For example, one of the suggestions far down in >>>>>>> the text is: >>>>>>> >>>>>>> (James now) ?Indeed. ?My intent with this specification was to ensure >>>>>>> that third parties would be able to recover the encoding. If imgCIF >>>>>>> is >>>>>>> going to cause us to make such an open-ended specification, it is >>>>>>> probably a sign that imgCIF needs to be addressed separately. ?For >>>>>>> example, should we think about redefining it as a container format, >>>>>>> with a CIF header and UTF16 body (but still part of the >>>>>>> "Crystallographic Information Framework")? >>>>>>> >>>>>>> The idea of an imgCIF "header" in CIF format and a image in another >>>>>>> is >>>>>>> an >>>>>>> old, well-established, thoroughly discussed, and mistaken idea, >>>>>>> rejected >>>>>>> in 1998. ?The handling of multiple images in a single file (e.g. >>>>>>> a jpeg thumbnail and crystal image and a full-size diffraction image) >>>>>>> requires the ability to switch among encodings within the file -- >>>>>>> something handled by the current DDL2 and MIME-based imgCIF format >>>>>>> and >>>>>>> which would be a serious problem in CIF2 has currently proposed, >>>>>>> increasing the chances that we will have to move imgCIF entirely into >>>>>>> HDF5 and abandon the CIF representation entirely, sharing only >>>>>>> the dictionary and not the framework. >>>>>>> >>>>>>> If you look carefully, you will see a similar trend with mmCIF, in >>>>>>> which >>>>>>> and XML representation sharing the dictionary plays a much more >>>>>>> important role than the CIF format. >>>>>>> >>>>>>> Is it really desirable to make the new CIF format so rigid and >>>>>>> unadaptable that major portions of macromolecular crysallography >>>>>>> end up migrating to very different formats, as they already are >>>>>>> doing? ?Yes, there is great value in having a common dictionary, >>>>>>> but would there not be additional value in having a sufficiently >>>>>>> flexible common format to allow for more software sharing than >>>>>>> we now have? ?It is really desirable for us to continue in the >>>>>>> direction of a single macromolecular experiment having to >>>>>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data >>>>>>> during collection, CCP4-style CIF representations during processing >>>>>>> and deposition and legacy PDB and PDBML representations in subsequent >>>>>>> community use? ?If we could be a little bit more flexible, we might >>>>>>> be >>>>>>> able to reduce the data interchange software burdens a little. >>>>>>> Right now, this discussion seems headed in the direction of simply >>>>>>> adding yet another data representation (DDLm/CIF2) to the mix, >>>>>>> increasing the chances of mistranslation and confusion, rather >>>>>>> that reducing them. >>>>>>> >>>>>>> Please, step back a bit from the detailed discussion of UTF8 and >>>>>>> look at the work-flow of doing and publishing crystallographic >>>>>>> experiments and let us try to make a contribution that simplifies >>>>>>> it, not one that makes it more complex than it needs to be. >>>>>>> >>>>>>> I suggest we need to meet and talk, either face-to-face, or by skype. >>>>>>> >>>>>>> Regards, >>>>>>> ? Herbert >>>>>>> >>>>>>> ===================================================== >>>>>>> ?Herbert J. Bernstein, Professor of Computer Science >>>>>>> ? ?Dowling College, Kramer Science Center, KSC 121 >>>>>>> ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>>>>>> >>>>>>> ? ? ? ? ? ? ? ? ?+1-631-244-3035 >>>>>>> ? ? ? ? ? ? ? ? ?yaya at dowling.edu >>>>>>> ===================================================== >>>>>>> >>>>>>> _______________________________________________ >>>>>>> cif2-encoding mailing list >>>>>>> cif2-encoding at iucr.org >>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> T +61 (02) 9717 9907 >>>>>> F +61 (02) 9717 3145 >>>>>> M +61 (04) 0249 4148 >>>>>> _______________________________________________ >>>>>> cif2-encoding mailing list >>>>>> cif2-encoding at iucr.org >>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>>> >>>>> _______________________________________________ >>>>> cif2-encoding mailing list >>>>> cif2-encoding at iucr.org >>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> T +61 (02) 9717 9907 >>>> F +61 (02) 9717 3145 >>>> M +61 (04) 0249 4148 >>>> _______________________________________________ >>>> cif2-encoding mailing list >>>> cif2-encoding at iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >>> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Fri Sep 10 12:25:21 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 10 Sep 2010 07:25:21 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear James, > "Note that a CIF2-conformant character stream that forms part of a > larger stream is not constrained to be in UTF8 encoding if the > encoding of the CIF2 stream is specified in a standards-conformant > manner within the enclosing stream. For example, CIF2 content within > an XML file is not constrained to be UTF8-encoded as standard XML > attributes can be used to manage encoding." is almost reasoanble, but basically says that it will be easier to handle CIF2 is almost any external container, rather than as itself. I would suggest saying. The description of a conformant CIF2 in terms of a UTF8 encoding is intended to provide clarity in the description of a CIF2, not to prevent use of CIF2 in terms of other encodings, such as UCS-2 unicode or code-page-based encodings needed for editors in particular system, nor to prevent used of transformed CIF2 in other containers such as HDF5 and XML or imgCIF/CBF, as long as the decodings/encoding or other transformations that would be necessary to go to and from a UTF8 CIF2 representation are clearly and unambiguously defined. This would bring us back essentially to where we have been for more than a decade with imgCIF/CBF and for nearly 2 decades with CIF1 itself. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 10 Sep 2010, James Hester wrote: > Thanks Herbert for this detailed information, which is a great help to > me in forming an opinion. Please understand that we are not even > close to considering excluding imgCIF from CIF. Rather, I am > collecting information in order to form an opinion and work with > everybody to find a solution which then goes back to the DDLm group > and then on to COMCIFS regarding CIF2. Speculation about potential > consequences for imgCIF are just part of the information-gathering > process. In general terms, CIF is now a 'framework', which I think > will make bringing XML and HDF5 developments under the CIF umbrella > relatively simple. > > Please also understand that my comments about the usefulness of CBFlib > were in the context of a typical beamline user wishing to handle their > data, rather than from a programmer's point of view. I was not > casting aspersions on CBFlib, rather seeking more information (which > you have provided). > > I am afraid that terminology here may be confusing me: I would like to > talk about imgCIF as a pure ASCII format (eg IT Vol G p 40 para 15) > and CBF as the binary equivalent. However, your previous statements > indicate that imgCIF could also be written in UTF16 encoding. So: > when you speak of the Dectris detector output as 'imgCIF', what > encoding is used? > > The point you make about embedding imgCIF into a text-only format (in > this case XML) is, I agree, a use-case that we have to consider. I > see merit in the position that 'CIF2 content' inside a container is > not constrained by encoding, in those cases where the container is > able to specify the encoding itself. This is *pedantically* true > already in that the 'header' of the container file as a whole is *not* > the CIF2 magic header. So: what does everyone think of the following > statement being included in the standard? > > "Note that a CIF2-conformant character stream that forms part of a > larger stream is not constrained to be in UTF8 encoding if the > encoding of the CIF2 stream is specified in a standards-conformant > manner within the enclosing stream. For example, CIF2 content within > an XML file is not constrained to be UTF8-encoded as standard XML > attributes can be used to manage encoding." > > (Perhaps John B, who has shown superior wordsmithing capabilities, > could polish this up a bit?) > > On Fri, Sep 3, 2010 at 11:10 PM, Herbert J. Bernstein > wrote: >> Here is more detail on the use of CBFlib. >> >> I know for sure that CBFlib is used directly by mosflm and adxv. ?While XDS >> uses code that was prototyped in the Fortran part of CBFlib, they work with >> their own versions. ?However, Kay Diederichs has also used the CBFlib C code >> for work on simulations. ?Paul Ellis started HKL2000 off with CBFlib, but I >> don't know if they stayed with it. >> >> As a practical matter, whether someone uses CBFlib itself, it is an >> essential part of the documentation that people use to understand how the >> various compression schemes work, and they use the utility cif2cbf from the >> package both as an external converter and as a validator and as a debugger >> when they don't want to put all the functionality in their own code. ?If you >> have a funny CBF in any of the semi-infinite number of representations, >> cif2cbf allows you to check it, get a hex dump of it or convert it to a >> specific compression scheme or format that some other program needs to >> process that file. >> >> In other words, CBFlib on its own _is_ useful. >> >> Sorry about not giving you a list re imgCIF use, I thought you were asking >> me about CBFlib use -- every beamline that uses a Dectris Pilatus 6M >> produces imgCIF as the default. ?This had been a byte-offset compressed >> binary with a mini-header. ?Dectris has now moved up to writing a full >> header. ?There were some beamlines with some of the older smaller Dectris >> detectors that were producing TIFF, but all currently delivered Dectris >> detectors of all sizes produce imgCIF as the default. >> >> All the major detector manufacturers now offer CBF as an option except for >> Bruker which is debugging an optional CBF output. ?When I checked at the ACA >> meeting in July they all also said that their processing packages can accept >> CBF as an input. >> >> On the XML use, I would suggest a more broad-minded attitude. Judging from >> the workshop I was at in January at ESRF, it has much broader support than >> just from Diamond, especially for spectra which have smaller data volume >> than images. HDF5 is the most widely accepted scientific binary data format >> for the physics community, and XML is the easiest and most reliable way to >> port smaller HDF5 datasets from site to site. The problem with XML is that >> for large files such as crystallographic images ordinary straight-text XML >> produces huge, impractical files. ?binutf allows for a compromise in which >> you have a true XML UCS-2 file but with the binary having only a 7% >> overhead. >> >> I have no choice -- I _will_ (indeed already do) produce CIFs with UCS-2 >> binary sections. ?If COMCIFS repeats the unfortunate decision of 1997 of >> saying that what the synchrotron community needs can't be called CIF, we'll >> just go back to calling it imgNCIF (which is an acronym for image-not-CIF), >> but we will still have to produce it for the community. In 1998 after we had >> a face-to-face discussion at a BNL workshop, that decision was reversed and >> what the synchrotron community needed was folded under the CIF umbrella, and >> imgNCIF became imgCIF. ?I hope we can have discussions now to avoid the need >> for a pointless schism. >> >> Your proposal on the relationship between CIF2 and imgCIF sounds like a >> replay of the discussions we had in 1997, with CIF headers following one >> standard and binary sections following another. You can make that work, but >> it is clumsy and hard for users to work with. ?It is better if we have one >> simple, comprehensible standard for the files they work with as a whole. >> >> Let me be clear -- imgCIF is produced worldwide and used for thousands of >> images daily. ?These older "legacy" imgCIF images will be around for a long >> time to come, and whatever new imgCIF (or if you force us to it, imgNCIF) >> images we produce will need to be, and will be, supported by software that >> handles both the legacy and the new images and has a clean interface to HDF5 >> and XML as well. ?I would greatly prefer that this be coordinated with >> COMCIFS and done in a way that helps the community to understand the >> relationship between CIF and imgCIF, but if COMCIFS feels a need to return >> to its 1997 position and exclude the data we work with from its charge, then >> imgCIF can return to being imgNCIF. >> >> If we are to resolve this, then, as in 1998, we need a meeting or e-meeting. >> ?Once you have a web-cam, I would suggest you and I have a skype meeting to >> frame the issues in dispute and organize a wider meeting. >> >> -- Herbert >> >> ===================================================== >> ?Herbert J. Bernstein, Professor of Computer Science >> ? Dowling College, Kramer Science Center, KSC 121 >> ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 >> >> ? ? ? ? ? ? ? ? +1-631-244-3035 >> ? ? ? ? ? ? ? ? yaya at dowling.edu >> ===================================================== >> >> On Fri, 3 Sep 2010, James Hester wrote: >> >>>> On Fri, 3 Sep 2010, James Hester wrote: >>>> >>>>> Thanks Herbert for providing the imgCIF perspective. >>>>> >>>>> I am unfortunately severely restricted in my ability to attend >>>>> overseas meetings at present, for family and work reasons. ?I am also >>>>> keen to have our discussions written down and available for perusal by >>>>> those that will come later. >>>> >>>> How about an e-meeting? >>> >>> OK, I think we need to try online as my carefully crafted arguments >>> seem to be misunderstood more often than not. >>> Let me buy a web cam first! >>> >>>>> We need to discuss the relationship of imgCIF to CIF2 explicitly, if >>>>> imgCIF is going to influence our decisionmaking. ?Some questions for >>>>> Herbert to answer for the record: >>>>> >>>>> 1. How widely used are non-CBF forms of imgCIF at present? ?By "widely >>>>> used" I mean both >>>>> ?(a) supported by software packages that allow one to do "useful >>>>> work", most obviously to extract diffraction spots >>>> >>>> I assume by "non-CBF" you mean the forms that do the binary sections >>>> in something that is not pure binary -- all software that uses CBFlib >>>> supports them automatically for reading. ?For writing, most software >>>> chooses one representation for writing, usually byte-offset or >>>> packed binary, except when we have to debug -- then the ascii >>>> forms, esp. the hexdump form are very useful. >>> >>> You are correct in interpreting what I mean by "non-CBF". >>> >>> I understand that CBFlib supports everything, but CBFlib on its own is >>> not useful. Do you know approximately what programs use CBFlib? ?I >>> know only of rasmol, but you presumably know of many more. >>> >>>>> ?(b) provided as an output format (even optionally) by beamlines or >>>>> detector manufacturers >>> >>>> See above >>> >>> I see nothing in your reply on the availability of imgCIF files from >>> detectors or instruments. >>> >>>>> 2. What is the advantage of having "pure text" image files? ?Why isn't >>>>> a format like CBF more appropriate? >>>> >>>> While I agree, when we deal with people who like XML e.g. the NeXus >>>> form of imgCIF, then we have no choice -- no binary is allowed, so >>>> UCS-2 becomes important. ?Don't ask me to defend XML. ?It is simply a >>>> fact of life. >>> >>> I am guessing that this NeXuS-XML requirement is coming from Diamond, >>> and if this is what they want I can see why you are keen to integrate >>> imgCIF into HDF5, so that HDF5-XML conversion can be carried out the >>> standard HDF5 way, rather than encapsulating the entire imgCIF file as >>> a NeXuS-XML dataset. ?OK: so apart from this relatively recent and >>> frankly crazy-wierd use case, is there any other use-case for >>> pure-text imgCIF? ?Can we regard the "Diamond" case as a >>> beaurocratically-driven kluge that will be resolved via your HDF5 >>> work, leaving no other reason to create a space-efficient CIF2 version >>> of imgCIF? >>> >>>>> 3. What is the problem with a scenario where "pure text" imgCIF >>>>> remains in its current CIF1 form, and CIF2 advances are incorporated >>>>> into the CIF sections of CBF? >>>> >>>> I don't understand this question, nor the assumptions behind it. >>> >>> Let me be less obtuse: >>> I envision a CBF2 format, which is a CBF file with CIF2 instead of >>> CIF1 syntax. ?A corresponding imgCIF2 format exists. We *do not care* >>> about the space-efficiency of these imgCIF2 files. We recommend that >>> all new crystallographic image-handling applications should target >>> CBF2 only, rendering space-efficiency of imgCIF2 files irrelevant. >>> Legacy applications, of which there are very few, will be restricted >>> to the original imgCIF, which is very rarely produced in any case >>> (anticipating your answers to my above questions). >>> >>> What are your (Herbert's, anybody else's) thoughts on such a plan? >>> >>>>> Herbert: your work merging a DDL2-based version with DDLm-like >>>>> features in HDF5 format sounds interesting. ?Are you planning to >>>>> present a motivation and/or discussion of this work at some stage? >>>> >>>> This is the subject of some grant applications, so not appropriate for >>>> detailed open discussion in this forum at this time. ?The motivations >>>> are simple -- to satisfy the demands of several major facilities for >>>> easy integration of crytallographic synchrotron images into HDF5-based >>>> data >>>> management systems while preserving access to metadata, and to extend >>>> HDF5 >>>> with relational meta-data access. ?This second aspect is an increasingly >>>> critical need and will go forward in any case. ?If we have >>>> a meeting or e-meeting, I can explain better. >>> >>> OK, I think reading between the lines I see where this is coming from >>> (read your CACM article as well, BTW). ?It'd be good to discuss some >>> of these plans at some stage. >>> >>>>> >>>>> On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein >>>>> wrote: >>>>>> >>>>>> Dear James, >>>>>> >>>>>> ?I have not been at all reticent -- imgCIF will be very poorly >>>>>> supported >>>>>> by CIF2 as currently proposed. ?Of necessity, imgCIF changes encodings >>>>>> internally -- that it why it uses MIME -- same problem as email with >>>>>> images, same solution. >>>>>> >>>>>> ?Any purely text version has at least a 7% overhead as compared to >>>>>> pure binary. ?Restricting to UTF-8 increases the overhead to at least >>>>>> 50%. >>>>>> We may get away with the 7% (UTF-16). ?The 50% version (UTF-8) will be >>>>>> ignored by the community as unworkable. ?The most likely to be used >>>>>> version >>>>>> will be the current DDL2-based version with embedded compressed >>>>>> binaries >>>>>> that I am augmenting with DDLm-like features >>>>>> and merging in with HDF5. >>>>>> >>>>>> ?As I noted many months ago, the unfortunate reality is that the >>>>>> current CIF2 effort will not merge well with imgCIF. ?If avoiding >>>>>> a split is a important -- we need a meeting. ?I would suggest >>>>>> involving Bob Sweet and holding it at BNL in conjunction with >>>>>> something relevant to NSLS-II. >>>>>> >>>>>> ?Regards, >>>>>> ? ?Herbert >>>>>> >>>>>> ===================================================== >>>>>> ?Herbert J. Bernstein, Professor of Computer Science >>>>>> ? Dowling College, Kramer Science Center, KSC 121 >>>>>> ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 >>>>>> >>>>>> ? ? ? ? ? ? ? ? +1-631-244-3035 >>>>>> ? ? ? ? ? ? ? ? yaya at dowling.edu >>>>>> ===================================================== >>>>>> >>>>>> On Tue, 24 Aug 2010, James Hester wrote: >>>>>> >>>>>>> Hi Herbert: regarding imgCIF, ?I agree that splitting it off is not a >>>>>>> desirable outcome. ?I would like to get an idea of how well imgCIF can >>>>>>> be accommodated under the various encoding proposals currently >>>>>>> floating around, as you have been rather reticent to bring it up. ?My >>>>>>> naive take on things is that a UTF8-only encoding scheme for CIF2 >>>>>>> would not pose significant issues for imgCIF, and a decorated UTF16 >>>>>>> encoding in the style of Scheme B would be even better, and quite >>>>>>> adequate, so imgCIF is not actually presenting any problems and so was >>>>>>> a red herring. >>>>>>> >>>>>>> I'm not sure that face-to-face or Skype discussions are necessarily >>>>>>> going to be more productive. ?Writing things down, while slower, >>>>>>> allows me at least to collect my thoughts and those of other >>>>>>> participants, and hopefully make a reasoned contribution (my apologies >>>>>>> if I am too long-winded) and as an added bonus those thoughts are >>>>>>> recorded for later reference. ?For example, where would I now find the >>>>>>> background on why a container format for imgCIF is such a bad idea? >>>>>>> Presumably that was all thrashed out in face to face discussions, and >>>>>>> no record now remains. >>>>>>> >>>>>>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein >>>>>>> wrote: >>>>>>>> >>>>>>>> Dear Colleagues, >>>>>>>> >>>>>>>> ? James' and John's last interchange is so voluminous, I doubt any of >>>>>>>> us has been able to fully appreciate the rich complexity of ideas >>>>>>>> contained therein. ?For example, one of the suggestions far down in >>>>>>>> the text is: >>>>>>>> >>>>>>>> (James now) ?Indeed. ?My intent with this specification was to ensure >>>>>>>> that third parties would be able to recover the encoding. If imgCIF >>>>>>>> is >>>>>>>> going to cause us to make such an open-ended specification, it is >>>>>>>> probably a sign that imgCIF needs to be addressed separately. ?For >>>>>>>> example, should we think about redefining it as a container format, >>>>>>>> with a CIF header and UTF16 body (but still part of the >>>>>>>> "Crystallographic Information Framework")? >>>>>>>> >>>>>>>> The idea of an imgCIF "header" in CIF format and a image in another >>>>>>>> is >>>>>>>> an >>>>>>>> old, well-established, thoroughly discussed, and mistaken idea, >>>>>>>> rejected >>>>>>>> in 1998. ?The handling of multiple images in a single file (e.g. >>>>>>>> a jpeg thumbnail and crystal image and a full-size diffraction image) >>>>>>>> requires the ability to switch among encodings within the file -- >>>>>>>> something handled by the current DDL2 and MIME-based imgCIF format >>>>>>>> and >>>>>>>> which would be a serious problem in CIF2 has currently proposed, >>>>>>>> increasing the chances that we will have to move imgCIF entirely into >>>>>>>> HDF5 and abandon the CIF representation entirely, sharing only >>>>>>>> the dictionary and not the framework. >>>>>>>> >>>>>>>> If you look carefully, you will see a similar trend with mmCIF, in >>>>>>>> which >>>>>>>> and XML representation sharing the dictionary plays a much more >>>>>>>> important role than the CIF format. >>>>>>>> >>>>>>>> Is it really desirable to make the new CIF format so rigid and >>>>>>>> unadaptable that major portions of macromolecular crysallography >>>>>>>> end up migrating to very different formats, as they already are >>>>>>>> doing? ?Yes, there is great value in having a common dictionary, >>>>>>>> but would there not be additional value in having a sufficiently >>>>>>>> flexible common format to allow for more software sharing than >>>>>>>> we now have? ?It is really desirable for us to continue in the >>>>>>>> direction of a single macromolecular experiment having to >>>>>>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data >>>>>>>> during collection, CCP4-style CIF representations during processing >>>>>>>> and deposition and legacy PDB and PDBML representations in subsequent >>>>>>>> community use? ?If we could be a little bit more flexible, we might >>>>>>>> be >>>>>>>> able to reduce the data interchange software burdens a little. >>>>>>>> Right now, this discussion seems headed in the direction of simply >>>>>>>> adding yet another data representation (DDLm/CIF2) to the mix, >>>>>>>> increasing the chances of mistranslation and confusion, rather >>>>>>>> that reducing them. >>>>>>>> >>>>>>>> Please, step back a bit from the detailed discussion of UTF8 and >>>>>>>> look at the work-flow of doing and publishing crystallographic >>>>>>>> experiments and let us try to make a contribution that simplifies >>>>>>>> it, not one that makes it more complex than it needs to be. >>>>>>>> >>>>>>>> I suggest we need to meet and talk, either face-to-face, or by skype. >>>>>>>> >>>>>>>> Regards, >>>>>>>> ? Herbert >>>>>>>> >>>>>>>> ===================================================== >>>>>>>> ?Herbert J. Bernstein, Professor of Computer Science >>>>>>>> ? ?Dowling College, Kramer Science Center, KSC 121 >>>>>>>> ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>>>>>>> >>>>>>>> ? ? ? ? ? ? ? ? ?+1-631-244-3035 >>>>>>>> ? ? ? ? ? ? ? ? ?yaya at dowling.edu >>>>>>>> ===================================================== >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> cif2-encoding mailing list >>>>>>>> cif2-encoding at iucr.org >>>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> T +61 (02) 9717 9907 >>>>>>> F +61 (02) 9717 3145 >>>>>>> M +61 (04) 0249 4148 >>>>>>> _______________________________________________ >>>>>>> cif2-encoding mailing list >>>>>>> cif2-encoding at iucr.org >>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>>>> >>>>>> _______________________________________________ >>>>>> cif2-encoding mailing list >>>>>> cif2-encoding at iucr.org >>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> T +61 (02) 9717 9907 >>>>> F +61 (02) 9717 3145 >>>>> M +61 (04) 0249 4148 >>>>> _______________________________________________ >>>>> cif2-encoding mailing list >>>>> cif2-encoding at iucr.org >>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>> >>>> _______________________________________________ >>>> cif2-encoding mailing list >>>> cif2-encoding at iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>> >>>> >>> >>> >>> >>> -- >>> T +61 (02) 9717 9907 >>> F +61 (02) 9717 3145 >>> M +61 (04) 0249 4148 >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From John.Bollinger at STJUDE.ORG Fri Sep 10 15:47:28 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Fri, 10 Sep 2010 09:47:28 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> Hello All, On Friday, September 10, 2010 6:25 AM, Herbert J. Bernstein wrote: [James Hester wrote:] >> "Note that a CIF2-conformant character stream that forms part of a >> larger stream is not constrained to be in UTF8 encoding if the >> encoding of the CIF2 stream is specified in a standards-conformant >> manner within the enclosing stream. For example, CIF2 content within >> an XML file is not constrained to be UTF8-encoded as standard XML >> attributes can be used to manage encoding." > >is almost reasoanble, but basically says that it will be easier to handle CIF2 is almost any external >container, rather than as itself. >I would suggest saying. > >The description of a conformant CIF2 in terms of a UTF8 encoding is intended to provide clarity in the >description of a CIF2, not to prevent use of CIF2 in terms of other encodings, such as UCS-2 unicode or code->page-based encodings needed for editors in particular system, nor to prevent used of transformed CIF2 in other >containers such as HDF5 and XML or imgCIF/CBF, as long as the decodings/encoding or other transformations that >would be necessary to go to and from a UTF8 CIF2 representation are clearly and unambiguously defined. I think this matter would be best addressed by explicitly adopting an idea that we have discussed before: a formal separation between the definition of CIF text (i.e. James's "CIF2-conformant character stream") and the particular kind of packaging that we are accustomed to calling "a CIF" or "a CIF file". James's suggestion implies such a separation anyway, so let's not do it halfway. Given such a separation, the explanatory comment could be as simple as: "This specification's definition of the 'CIF File' serialization form for CIF2 text is not intended to preclude definition or use of other serialization forms, such as HDF5-based forms, XML-based forms, or imgCIF/CBF." I choose the term "serialization form" because it puts primary emphasis on the CIF text (which after all is the subject of the bulk of the specification). Every correct serialization of CIF text is, by definition, transformable into CIF text form. There remains a minor question of which CIF details are considered part of the serialization form, and which are an integral to the CIF text. Character encoding and initial BOM (however we feel about that) are surely part of the serialization form. Including end-of-line conventions as a serialization detail might also be convenient. The biggest question for me is how to categorize the CIF version comment. I am inclined to make it part of the serialization form, but I can see arguments both ways. Regards, John Email Disclaimer: www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Fri Sep 10 17:02:29 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 10 Sep 2010 12:02:29 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: As I have said before, we went through this approach in 1997 and ended up going the other way -- treating the text-based CIF and the binary CBF as parts of the _same_ format, not two different formats, not one being a serialization of the other, but the same format. This may seem like a minor distinction, but it actually has strong implications for software design and implementation, ensuring that binaries in a CIF context are just a particular type of data handled with all the same mecnahisms as ASCII data, allowing, for example, multiple diffraction images and thumbnails in one file in an order-independent way. You may be interested to know that the false dichotomy between binary and text-based representations is not starting to imapct HDF5, requiring some significant effort to now work in database access, an aspect CIF1 supports -- why throw it away for CIF2? ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 10 Sep 2010, Bollinger, John C wrote: > Hello All, > > On Friday, September 10, 2010 6:25 AM, Herbert J. Bernstein wrote: > [James Hester wrote:] >>> "Note that a CIF2-conformant character stream that forms part of a >>> larger stream is not constrained to be in UTF8 encoding if the >>> encoding of the CIF2 stream is specified in a standards-conformant >>> manner within the enclosing stream. For example, CIF2 content within >>> an XML file is not constrained to be UTF8-encoded as standard XML >>> attributes can be used to manage encoding." >> >> is almost reasoanble, but basically says that it will be easier to handle CIF2 is almost any external >container, rather than as itself. >> I would suggest saying. >> >> The description of a conformant CIF2 in terms of a UTF8 encoding is intended to provide clarity in the >description of a CIF2, not to prevent use of CIF2 in terms of other encodings, such as UCS-2 unicode or code->page-based encodings needed for editors in particular system, nor to prevent used of transformed CIF2 in other >containers such as HDF5 and XML or imgCIF/CBF, as long as the decodings/encoding or other transformations that >would be necessary to go to and from a UTF8 CIF2 representation are clearly and unambiguously defined. > > I think this matter would be best addressed by explicitly adopting an idea that we have discussed before: a formal separation between the definition of CIF text (i.e. James's "CIF2-conformant character stream") and the particular kind of packaging that we are accustomed to calling "a CIF" or "a CIF file". James's suggestion implies such a separation anyway, so let's not do it halfway. Given such a separation, the explanatory comment could be as simple as: > > "This specification's definition of the 'CIF File' serialization form for CIF2 text is not intended to preclude definition or use of other serialization forms, such as HDF5-based forms, XML-based forms, or imgCIF/CBF." > > I choose the term "serialization form" because it puts primary emphasis on the CIF text (which after all is the subject of the bulk of the specification). Every correct serialization of CIF text is, by definition, transformable into CIF text form. > > > There remains a minor question of which CIF details are considered part of the serialization form, and which are an integral to the CIF text. Character encoding and initial BOM (however we feel about that) are surely part of the serialization form. Including end-of-line conventions as a serialization detail might also be convenient. The biggest question for me is how to categorize the CIF version comment. I am inclined to make it part of the serialization form, but I can see arguments both ways. > > > Regards, > > John > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From John.Bollinger at STJUDE.ORG Fri Sep 10 19:24:05 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Fri, 10 Sep 2010 13:24:05 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >As I have said before, we went through this approach >in 1997 and ended up going the other way -- treating the >text-based CIF and the binary CBF as parts of the _same_ >format, not two different formats, not one being a serialization >of the other, but the same format. This may seem like a >minor distinction, but it actually has strong implications >for software design and implementation, ensuring that >binaries in a CIF context are just a particular type of data >handled with all the same mecnahisms as ASCII data, allowing, >for example, multiple diffraction images and thumbnails in >one file in an order-independent way. > >You may be interested to know that the false dichotomy between >binary and text-based representations is not starting >to imapct HDF5, requiring some significant effort to now >work in database access, an aspect CIF1 supports -- why >throw it away for CIF2? Herb, Perhaps you're reading more into my comments than I intended to put there. In particular, I did not aim to suggest one on-disk/wire format should be a serialization of another, but rather that *all* on-disk/wire formats be characterized in terms of serialization of the Unicode character sequences described by most of the spec. I meant "text" in that sense -- a sequence of Unicode characters -- not in the sense of a sequence of bytes conforming to some particular set of local conventions for text. I meant "serialization" in the general sense of any reversible transformation of CIF text into a byte sequence, including those that rely on interpreting the CIF syntax. That's aimed primarily at recognizing the use case in which CIF2 is embedded in or transformed into some other format, such as XML. I postulate, but do not specify, a serialization form defining the CIF2 version of what we have conventionally called "a CIF." The details of that form are exactly what this list was established to discuss, and I did not intend to imply a particular resolution of our ongoing debate. It was perhaps a mistake to include imgCIF/CBF on the list of possible alternative serialization forms, as it is far from settled whether it will fit under the umbrella of the 'CIF File' serialization form. I apologize if that caused confusion. [... I wrote:] >> I think this matter would be best addressed by explicitly adopting an idea that we have discussed before: a formal separation between the definition of CIF text (i.e. James's "CIF2-conformant character stream") and the particular kind of packaging that we are accustomed to calling "a CIF" or "a CIF file". James's suggestion implies such a separation anyway, so let's not do it halfway. Given such a separation, the explanatory comment could be as simple as: >> >> "This specification's definition of the 'CIF File' serialization form for CIF2 text is not intended to preclude definition or use of other serialization forms, such as HDF5-based forms, XML-based forms, or imgCIF/CBF." >> >> I choose the term "serialization form" because it puts primary emphasis on the CIF text (which after all is the subject of the bulk of the specification). Every correct serialization of CIF text is, by definition, transformable into CIF text form. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From simonwestrip at btinternet.com Sat Sep 11 09:59:54 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Sat, 11 Sep 2010 08:59:54 +0000 (GMT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <856929.33676.qm@web87013.mail.ird.yahoo.com> Dear all I have found recent exchanges, especially Herbert's contributions regarding the real-world use of imgCIF, very enlightening. Primarily for reasons of flexibility, I now find myself inclined to support a CIF specification that allows a variety of encodings, provided that such are "clearly and unambiguously defined". To me, the clear and unambiguous definition should encompass a clear and unambiguous *declaration* of the encoding; in the absence of such a declaration in the CIF or in its container, a default encoding should be assummed, either the default CIF encoding (which I think most agree should be UTF8) or inherited from the container? Though CIF1 has been successful without such a declaration (largely because of the ASCII restriction), I beleive it is essential in the case of CIF2. Cheers Simon ________________________________ From: "Bollinger, John C" To: Group for discussing encoding and content validation schemes for CIF2 Sent: Friday, 10 September, 2010 19:24:05 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >As I have said before, we went through this approach >in 1997 and ended up going the other way -- treating the >text-based CIF and the binary CBF as parts of the _same_ >format, not two different formats, not one being a serialization >of the other, but the same format. This may seem like a >minor distinction, but it actually has strong implications >for software design and implementation, ensuring that >binaries in a CIF context are just a particular type of data >handled with all the same mecnahisms as ASCII data, allowing, >for example, multiple diffraction images and thumbnails in >one file in an order-independent way. > >You may be interested to know that the false dichotomy between >binary and text-based representations is not starting >to imapct HDF5, requiring some significant effort to now >work in database access, an aspect CIF1 supports -- why >throw it away for CIF2? Herb, Perhaps you're reading more into my comments than I intended to put there. In particular, I did not aim to suggest one on-disk/wire format should be a serialization of another, but rather that *all* on-disk/wire formats be characterized in terms of serialization of the Unicode character sequences described by most of the spec. I meant "text" in that sense -- a sequence of Unicode characters -- not in the sense of a sequence of bytes conforming to some particular set of local conventions for text. I meant "serialization" in the general sense of any reversible transformation of CIF text into a byte sequence, including those that rely on interpreting the CIF syntax. That's aimed primarily at recognizing the use case in which CIF2 is embedded in or transformed into some other format, such as XML. I postulate, but do not specify, a serialization form defining the CIF2 version of what we have conventionally called "a CIF." The details of that form are exactly what this list was established to discuss, and I did not intend to imply a particular resolution of our ongoing debate. It was perhaps a mistake to include imgCIF/CBF on the list of possible alternative serialization forms, as it is far from settled whether it will fit under the umbrella of the 'CIF File' serialization form. I apologize if that caused confusion. [... I wrote:] >> I think this matter would be best addressed by explicitly adopting an idea that >>we have discussed before: a formal separation between the definition of CIF text >>(i.e. James's "CIF2-conformant character stream") and the particular kind of >>packaging that we are accustomed to calling "a CIF" or "a CIF file". James's >>suggestion implies such a separation anyway, so let's not do it halfway. Given >>such a separation, the explanatory comment could be as simple as: >> >> "This specification's definition of the 'CIF File' serialization form for CIF2 >>text is not intended to preclude definition or use of other serialization forms, >>such as HDF5-based forms, XML-based forms, or imgCIF/CBF." >> >> I choose the term "serialization form" because it puts primary emphasis on the >>CIF text (which after all is the subject of the bulk of the specification). >>Every correct serialization of CIF text is, by definition, transformable into >>CIF text form. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100911/3aa37e05/attachment-0001.html From yaya at bernstein-plus-sons.com Sat Sep 11 15:33:09 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Sat, 11 Sep 2010 10:33:09 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <856929.33676.qm@web87013.mail.ird.yahoo.com> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> Message-ID: Dear Colleagues, Let me propose what I think would be a reasonable resolution: 1. We come to a final resolution on what _information_ is in CIF2, independent of the representation used. I think we have that in hand. 2. We present one UTF-8 based _representation_ of that information for two essential purposes: 2.1. To have a concrete way in which to present examples of CIF2; and 2.2. To have a default assumed representation in which a CIF2 the representation of which is not otherwise identified is most likely to have been presented. 3. That we suggest some reasonable mechanisms for helping software developers and users to determine which of the very large number of possible reprentations has been used for a given file, including, but not limited to: BOM Magic number Extended idenfifying comments Encoding tags in the file itself with sufficient detail to allow developers to get started, but with a final decision deferred on everything other than the BOM to allow for broad-based community discussion of what is clearly a contentious issue. The BOM is going to be in _any_ final list because is is well-supported by several existing text editors, and keeps getting forced into files without any user control. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Sat, 11 Sep 2010, SIMON WESTRIP wrote: > Dear all > > I have found recent exchanges, especially Herbert's contributions regarding > the real-world use of imgCIF, very > enlightening. Primarily for reasons of flexibility, I now find myself > inclined to support a CIF specification > that allows a variety of encodings, provided that such are "clearly and > unambiguously defined". > > To me, the clear and unambiguous definition should encompass a clear and > unambiguous *declaration* > ?of the encoding; in the absence of such a declaration in the CIF or in its > container, a default encoding > should be assummed, either the default CIF encoding (which I think most > agree should be UTF8) or inherited > from the container? > > Though CIF1 has been successful without such a declaration (largely because > of the ASCII restriction), > I beleive it is essential in the case of CIF2. > > Cheers > > Simon > > > > > > > > ____________________________________________________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 10 September, 2010 19:24:05 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > > On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: > >As I have said before, we went through this approach > >in 1997 and ended up going the other way -- treating the > >text-based CIF and the binary CBF as parts of the _same_ > >format, not two different formats, not one being a serialization > >of the other, but the same format.? This may seem like a > >minor distinction, but it actually has strong implications > >for software design and implementation, ensuring that > >binaries in a CIF context are just a particular type of data > >handled with all the same mecnahisms as ASCII data, allowing, > >for example, multiple diffraction images and thumbnails in > >one file in an order-independent way. > > > >You may be interested to know that the false dichotomy between > >binary and text-based representations is not starting > >to imapct HDF5, requiring some significant effort to now > >work in database access, an aspect CIF1 supports -- why > >throw it away for CIF2? > > Herb, > > Perhaps you're reading more into my comments than I intended to put there.? > In particular, I did not aim to suggest one on-disk/wire format should be a > serialization of another, but rather that *all* on-disk/wire formats be > characterized in terms of serialization of the Unicode character sequences > described by most of the spec.? I meant "text" in that sense -- a sequence > of Unicode characters -- not in the sense of a sequence of bytes conforming > to some particular set of local conventions for text.? I meant > "serialization" in the general sense of any reversible transformation of CIF > text into a byte sequence, including those that rely on interpreting the CIF > syntax.? That's aimed primarily at recognizing the use case in which CIF2 is > embedded in or transformed into some other format, such as XML. > > I postulate, but do not specify, a serialization form defining the CIF2 > version of what we have conventionally called "a CIF."? The details of that > form are exactly what this list was established to discuss, and I did not > intend to imply a particular resolution of our ongoing debate.? It was > perhaps a mistake to include imgCIF/CBF on the list of possible alternative > serialization forms, as it is far from settled whether it will fit under the > umbrella of the 'CIF File' serialization form.? I apologize if that caused > confusion. > > [... I wrote:] > >> I think this matter would be best addressed by explicitly adopting an > idea that we have discussed before: a formal separation between the > definition of CIF text (i.e. James's "CIF2-conformant character stream") and > the particular kind of packaging that we are accustomed to calling "a CIF" > or "a CIF file".? James's suggestion implies such a separation anyway, so > let's not do it halfway.? Given such a separation, the explanatory comment > could be as simple as: > >> > >> "This specification's definition of the 'CIF File' serialization form for > CIF2 text is not intended to preclude definition or use of other > serialization forms, such as HDF5-based forms, XML-based forms, or > imgCIF/CBF." > >> > >> I choose the term "serialization form" because it puts primary emphasis > on the CIF text (which after all is the subject of the bulk of the > specification).? Every correct serialization of CIF text is, by definition, > transformable into CIF text form. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > Email Disclaimer:? www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > From jamesrhester at gmail.com Mon Sep 13 05:25:31 2010 From: jamesrhester at gmail.com (James Hester) Date: Mon, 13 Sep 2010 14:25:31 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: I find John's approach in terms of 'serialisation forms' reasonable and acceptable insofar as it takes care of those situations where CIF content is contained within something else. This formulation, if adopted, would solve the particular real-life use case that Herbert put forward, of embedding an imgCIF in an XML file. Herbert: do you have any other real-life use cases of imgCIF that this solution does not address? I would note that adoption of the 'serialisation form' approach would also immediately provide a somewhat hackish workaround for those requiring non UTF8 encoding: simply embed the CIF material in an XML file and use XML encoding switches. Perhaps these sort of hacks are what Herbert means by "it will be easier to handle CIF2 in almost any external container". John's formulation is also less restrictive than my original proposal in that it does not require the container to be able to handle encoding, but I'm prepared to leave that part out of the spec and simply add a note to remind readers that they should consider this issue. To my mind, the encoding of plain CIF files remains an open issue. I do not view the mechanisms for managing file encoding that are provided by current OSs to be sufficiently robust, widespread or consistent that we can rely on developers or text editors respecting them, so we require something like Scheme B for all files (not only at the point of transfer to another OS). On Sat, Sep 11, 2010 at 12:47 AM, Bollinger, John C wrote: > Hello All, > > On Friday, September 10, 2010 6:25 AM, Herbert J. Bernstein wrote: > [James Hester wrote:] >>> "Note that a CIF2-conformant character stream that forms part of a >>> larger stream is not constrained to be in UTF8 encoding if the >>> encoding of the CIF2 stream is specified in a standards-conformant >>> manner within the enclosing stream. ?For example, CIF2 content within >>> an XML file is not constrained to be UTF8-encoded as standard XML >>> attributes can be used to manage encoding." >> >>is almost reasoanble, but basically says that it will be easier to handle CIF2 is almost any external >container, rather than as itself. >>I would suggest saying. >> >>The description of a conformant CIF2 in terms of a UTF8 encoding is intended to provide clarity in the >description of a CIF2, not to prevent use of CIF2 in terms of other encodings, such as UCS-2 unicode ?or code->page-based encodings needed for editors in particular system, nor to prevent used of transformed CIF2 in other >containers such as HDF5 and XML or imgCIF/CBF, as long as the decodings/encoding or other transformations that >would be necessary to go to and from a UTF8 CIF2 representation are clearly and unambiguously defined. > > I think this matter would be best addressed by explicitly adopting an idea that we have discussed before: a formal separation between the definition of CIF text (i.e. James's "CIF2-conformant character stream") and the particular kind of packaging that we are accustomed to calling "a CIF" or "a CIF file". ?James's suggestion implies such a separation anyway, so let's not do it halfway. ?Given such a separation, the explanatory comment could be as simple as: > > "This specification's definition of the 'CIF File' serialization form for CIF2 text is not intended to preclude definition or use of other serialization forms, such as HDF5-based forms, XML-based forms, or imgCIF/CBF." > > I choose the term "serialization form" because it puts primary emphasis on the CIF text (which after all is the subject of the bulk of the specification). ?Every correct serialization of CIF text is, by definition, transformable into CIF text form. > > > There remains a minor question of which CIF details are considered part of the serialization form, and which are an integral to the CIF text. ?Character encoding and initial BOM (however we feel about that) are surely part of the serialization form. ?Including end-of-line conventions as a serialization detail might also be convenient. ?The biggest question for me is how to categorize the CIF version comment. ?I am inclined to make it part of the serialization form, but I can see arguments both ways. > > > Regards, > > John > > > Email Disclaimer: ?www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From jamesrhester at gmail.com Mon Sep 13 05:32:41 2010 From: jamesrhester at gmail.com (James Hester) Date: Mon, 13 Sep 2010 14:32:41 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> Message-ID: Your step 3 is what we are discussing here. It is not sufficient to simply "suggest" reasonable mechanisms, as this leaves developers bewildered as to which of these potentially vague suggestions they can and should support, leading to confusion and inability of programs to communicate with one another. Far better to *specify* mechanisms, which is what we are groping towards doing here. Herbert: you have not reacted to a suggestion that we simply reserve the first line of a CIF2 file for future expansion, and state that non UTF8 encodings for CIF2 would be considered by COMCIFS as the need arose. On Sun, Sep 12, 2010 at 12:33 AM, Herbert J. Bernstein wrote: > Dear Colleagues, > > ?Let me propose what I think would be a reasonable resolution: > > 1. ?We come to a final resolution on what _information_ is in CIF2, > independent of the representation used. ?I think we have that in hand. > > 2. ?We present one UTF-8 based _representation_ of that information > for two essential purposes: > ?2.1. ?To have a concrete way in which to present examples of CIF2; and > ?2.2. ?To have a default assumed representation in which a CIF2 the > representation of which is not otherwise identified is most likely > to have been presented. > > 3. ?That we suggest some reasonable mechanisms for helping software > developers and users to determine which of the very large number of > possible reprentations has been used for a given file, including, > but not limited to: > ?BOM > ?Magic number > ?Extended idenfifying comments > ?Encoding tags in the file itself with sufficient detail to allow developers > to get started, but > with a final decision deferred on everything other than the BOM > to allow for broad-based community discussion of what is clearly > a contentious issue. ?The BOM is going to be in _any_ final list > because is is well-supported by several existing text > editors, and keeps getting forced into files without any user > control. > > Regards, > ?Herbert > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Sat, 11 Sep 2010, SIMON WESTRIP wrote: > >> Dear all >> >> I have found recent exchanges, especially Herbert's contributions >> regarding >> the real-world use of imgCIF, very >> enlightening. Primarily for reasons of flexibility, I now find myself >> inclined to support a CIF specification >> that allows a variety of encodings, provided that such are "clearly and >> unambiguously defined". >> >> To me, the clear and unambiguous definition should encompass a clear and >> unambiguous *declaration* >> ?of the encoding; in the absence of such a declaration in the CIF or in >> its >> container, a default encoding >> should be assummed, either the default CIF encoding (which I think most >> agree should be UTF8) or inherited >> from the container? >> >> Though CIF1 has been successful without such a declaration (largely >> because >> of the ASCII restriction), >> I beleive it is essential in the case of CIF2. >> >> Cheers >> >> Simon >> >> >> >> >> >> >> >> >> ____________________________________________________________________________ >> From: "Bollinger, John C" >> To: Group for discussing encoding and content validation schemes for CIF2 >> >> Sent: Friday, 10 September, 2010 19:24:05 >> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. >> . >> >> >> On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >> >As I have said before, we went through this approach >> >in 1997 and ended up going the other way -- treating the >> >text-based CIF and the binary CBF as parts of the _same_ >> >format, not two different formats, not one being a serialization >> >of the other, but the same format.? This may seem like a >> >minor distinction, but it actually has strong implications >> >for software design and implementation, ensuring that >> >binaries in a CIF context are just a particular type of data >> >handled with all the same mecnahisms as ASCII data, allowing, >> >for example, multiple diffraction images and thumbnails in >> >one file in an order-independent way. >> > >> >You may be interested to know that the false dichotomy between >> >binary and text-based representations is not starting >> >to imapct HDF5, requiring some significant effort to now >> >work in database access, an aspect CIF1 supports -- why >> >throw it away for CIF2? >> >> Herb, >> >> Perhaps you're reading more into my comments than I intended to put >> there. >> In particular, I did not aim to suggest one on-disk/wire format should be >> a >> serialization of another, but rather that *all* on-disk/wire formats be >> characterized in terms of serialization of the Unicode character sequences >> described by most of the spec.? I meant "text" in that sense -- a sequence >> of Unicode characters -- not in the sense of a sequence of bytes >> conforming >> to some particular set of local conventions for text.? I meant >> "serialization" in the general sense of any reversible transformation of >> CIF >> text into a byte sequence, including those that rely on interpreting the >> CIF >> syntax.? That's aimed primarily at recognizing the use case in which CIF2 >> is >> embedded in or transformed into some other format, such as XML. >> >> I postulate, but do not specify, a serialization form defining the CIF2 >> version of what we have conventionally called "a CIF."? The details of >> that >> form are exactly what this list was established to discuss, and I did not >> intend to imply a particular resolution of our ongoing debate.? It was >> perhaps a mistake to include imgCIF/CBF on the list of possible >> alternative >> serialization forms, as it is far from settled whether it will fit under >> the >> umbrella of the 'CIF File' serialization form.? I apologize if that caused >> confusion. >> >> [... I wrote:] >> >> I think this matter would be best addressed by explicitly adopting an >> idea that we have discussed before: a formal separation between the >> definition of CIF text (i.e. James's "CIF2-conformant character stream") >> and >> the particular kind of packaging that we are accustomed to calling "a CIF" >> or "a CIF file".? James's suggestion implies such a separation anyway, so >> let's not do it halfway.? Given such a separation, the explanatory comment >> could be as simple as: >> >> >> >> "This specification's definition of the 'CIF File' serialization form >> >> for >> CIF2 text is not intended to preclude definition or use of other >> serialization forms, such as HDF5-based forms, XML-based forms, or >> imgCIF/CBF." >> >> >> >> I choose the term "serialization form" because it puts primary emphasis >> on the CIF text (which after all is the subject of the bulk of the >> specification).? Every correct serialization of CIF text is, by >> definition, >> transformable into CIF text form. >> >> >> Regards, >> >> John >> -- >> John C. Bollinger, Ph.D. >> Department of Structural Biology >> St. Jude Children's Research Hospital >> >> >> >> Email Disclaimer:? www.stjude.org/emaildisclaimer >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From jamesrhester at gmail.com Mon Sep 13 06:24:42 2010 From: jamesrhester at gmail.com (James Hester) Date: Mon, 13 Sep 2010 15:24:42 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <856929.33676.qm@web87013.mail.ird.yahoo.com> References: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> Message-ID: Hi Simon: the issue with such an encoding declaration is that it is not supported by generic text tools, and so would not be automatically inserted, updated or respected when creating, editing (ie open in one encoding, save in another) or transcoding a CIF2 file. This means it has no status beyond a hint that could cause as many problems as it solves. Such a declaration becomes more robust if accompanied by the checksum that John B suggested. The checksum gives some guarantee that the encoding has been checked by a CIF-aware program. If you are proposing that such a declaration and checksum be mandatory for all non-UTF8 CIF2 files (not only during transfer), I agree with you that this would be acceptable. On Sat, Sep 11, 2010 at 6:59 PM, SIMON WESTRIP wrote: > Dear all > > I have found recent exchanges, especially Herbert's contributions regarding > the real-world use of imgCIF, very > enlightening. Primarily for reasons of flexibility, I now find myself > inclined to support a CIF specification > that allows a variety of encodings, provided that such are "clearly and > unambiguously defined". > > To me, the clear and unambiguous definition should encompass a clear and > unambiguous *declaration* > ?of the encoding; in the absence of such a declaration in the CIF or in its > container, a default encoding > should be assummed, either the default CIF encoding (which I think most > agree should be UTF8) or inherited > from the container? > > Though CIF1 has been successful without such a declaration (largely because > of the ASCII restriction), > I beleive it is essential in the case of CIF2. > > Cheers > > Simon > > > > > > > > ________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 10 September, 2010 19:24:05 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > > On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >>As I have said before, we went through this approach >>in 1997 and ended up going the other way -- treating the >>text-based CIF and the binary CBF as parts of the _same_ >>format, not two different formats, not one being a serialization >>of the other, but the same format.? This may seem like a >>minor distinction, but it actually has strong implications >>for software design and implementation, ensuring that >>binaries in a CIF context are just a particular type of data >>handled with all the same mecnahisms as ASCII data, allowing, >>for example, multiple diffraction images and thumbnails in >>one file in an order-independent way. >> >>You may be interested to know that the false dichotomy between >>binary and text-based representations is not starting >>to imapct HDF5, requiring some significant effort to now >>work in database access, an aspect CIF1 supports -- why >>throw it away for CIF2? > > Herb, > > Perhaps you're reading more into my comments than I intended to put there. > In particular, I did not aim to suggest one on-disk/wire format should be a > serialization of another, but rather that *all* on-disk/wire formats be > characterized in terms of serialization of the Unicode character sequences > described by most of the spec.? I meant "text" in that sense -- a sequence > of Unicode characters -- not in the sense of a sequence of bytes conforming > to some particular set of local conventions for text.? I meant > "serialization" in the general sense of any reversible transformation of CIF > text into a byte sequence, including those that rely on interpreting the CIF > syntax.? That's aimed primarily at recognizing the use case in which CIF2 is > embedded in or transformed into some other format, such as XML. > > I postulate, but do not specify, a serialization form defining the CIF2 > version of what we have conventionally called "a CIF."? The details of that > form are exactly what this list was established to discuss, and I did not > intend to imply a particular resolution of our ongoing debate.? It was > perhaps a mistake to include imgCIF/CBF on the list of possible alternative > serialization forms, as it is far from settled whether it will fit under the > umbrella of the 'CIF File' serialization form.? I apologize if that caused > confusion. > > [... I wrote:] >>> I think this matter would be best addressed by explicitly adopting an >>> idea that we have discussed before: a formal separation between the >>> definition of CIF text (i.e. James's "CIF2-conformant character stream") and >>> the particular kind of packaging that we are accustomed to calling "a CIF" >>> or "a CIF file".? James's suggestion implies such a separation anyway, so >>> let's not do it halfway.? Given such a separation, the explanatory comment >>> could be as simple as: >>> >>> "This specification's definition of the 'CIF File' serialization form for >>> CIF2 text is not intended to preclude definition or use of other >>> serialization forms, such as HDF5-based forms, XML-based forms, or >>> imgCIF/CBF." >>> >>> I choose the term "serialization form" because it puts primary emphasis >>> on the CIF text (which after all is the subject of the bulk of the >>> specification).? Every correct serialization of CIF text is, by definition, >>> transformable into CIF text form. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > Email Disclaimer:? www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From simonwestrip at btinternet.com Mon Sep 13 11:05:12 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 13 Sep 2010 10:05:12 +0000 (GMT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541 6659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> Message-ID: <427596.65604.qm@web87014.mail.ird.yahoo.com> Yes - I beleive that such a declaration should be mandatory for all non-UTF8 CIF2 files, and agree that a supporting checksum mechanism would be very useful to CIF2-aware programs. Until I've revisited the checksum scheme, I can not say that the checksum should be mandatory too. For example, if mandatory, does that mean it becomes impossible to create a non-UTF8 CIF without using CIF2-aware software? I need to review the discussions on checksums and indeed the various forms that such a declaration might take, but I do beleive in the principle that it should be mandatory for all 'stand-alone' non-UTF8 CIF2 files. If a CIF is packaged in a container, then it will be the job of non-CIF software to retreive it from the container and deliver it in its original form. So a non-UTF8 CIF packaged in a non-UTF8 container (or even a UTF8 container) should still carry its non-UTF8 declaration. Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 13 September, 2010 6:24:42 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . Hi Simon: the issue with such an encoding declaration is that it is not supported by generic text tools, and so would not be automatically inserted, updated or respected when creating, editing (ie open in one encoding, save in another) or transcoding a CIF2 file. This means it has no status beyond a hint that could cause as many problems as it solves. Such a declaration becomes more robust if accompanied by the checksum that John B suggested. The checksum gives some guarantee that the encoding has been checked by a CIF-aware program. If you are proposing that such a declaration and checksum be mandatory for all non-UTF8 CIF2 files (not only during transfer), I agree with you that this would be acceptable. On Sat, Sep 11, 2010 at 6:59 PM, SIMON WESTRIP wrote: > Dear all > > I have found recent exchanges, especially Herbert's contributions regarding > the real-world use of imgCIF, very > enlightening. Primarily for reasons of flexibility, I now find myself > inclined to support a CIF specification > that allows a variety of encodings, provided that such are "clearly and > unambiguously defined". > > To me, the clear and unambiguous definition should encompass a clear and > unambiguous *declaration* > of the encoding; in the absence of such a declaration in the CIF or in its > container, a default encoding > should be assummed, either the default CIF encoding (which I think most > agree should be UTF8) or inherited > from the container? > > Though CIF1 has been successful without such a declaration (largely because > of the ASCII restriction), > I beleive it is essential in the case of CIF2. > > Cheers > > Simon > > > > > > > > ________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 10 September, 2010 19:24:05 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > > On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >>As I have said before, we went through this approach >>in 1997 and ended up going the other way -- treating the >>text-based CIF and the binary CBF as parts of the _same_ >>format, not two different formats, not one being a serialization >>of the other, but the same format. This may seem like a >>minor distinction, but it actually has strong implications >>for software design and implementation, ensuring that >>binaries in a CIF context are just a particular type of data >>handled with all the same mecnahisms as ASCII data, allowing, >>for example, multiple diffraction images and thumbnails in >>one file in an order-independent way. >> >>You may be interested to know that the false dichotomy between >>binary and text-based representations is not starting >>to imapct HDF5, requiring some significant effort to now >>work in database access, an aspect CIF1 supports -- why >>throw it away for CIF2? > > Herb, > > Perhaps you're reading more into my comments than I intended to put there. > In particular, I did not aim to suggest one on-disk/wire format should be a > serialization of another, but rather that *all* on-disk/wire formats be > characterized in terms of serialization of the Unicode character sequences > described by most of the spec. I meant "text" in that sense -- a sequence > of Unicode characters -- not in the sense of a sequence of bytes conforming > to some particular set of local conventions for text. I meant > "serialization" in the general sense of any reversible transformation of CIF > text into a byte sequence, including those that rely on interpreting the CIF > syntax. That's aimed primarily at recognizing the use case in which CIF2 is > embedded in or transformed into some other format, such as XML. > > I postulate, but do not specify, a serialization form defining the CIF2 > version of what we have conventionally called "a CIF." The details of that > form are exactly what this list was established to discuss, and I did not > intend to imply a particular resolution of our ongoing debate. It was > perhaps a mistake to include imgCIF/CBF on the list of possible alternative > serialization forms, as it is far from settled whether it will fit under the > umbrella of the 'CIF File' serialization form. I apologize if that caused > confusion. > > [... I wrote:] >>> I think this matter would be best addressed by explicitly adopting an >>> idea that we have discussed before: a formal separation between the >>> definition of CIF text (i.e. James's "CIF2-conformant character stream") and >>> the particular kind of packaging that we are accustomed to calling "a CIF" >>> or "a CIF file". James's suggestion implies such a separation anyway, so >>> let's not do it halfway. Given such a separation, the explanatory comment >>> could be as simple as: >>> >>> "This specification's definition of the 'CIF File' serialization form for >>> CIF2 text is not intended to preclude definition or use of other >>> serialization forms, such as HDF5-based forms, XML-based forms, or >>> imgCIF/CBF." >>> >>> I choose the term "serialization form" because it puts primary emphasis >>> on the CIF text (which after all is the subject of the bulk of the >>> specification). Every correct serialization of CIF text is, by definition, >>> transformable into CIF text form. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100913/0a62420e/attachment-0001.html From simonwestrip at btinternet.com Mon Sep 13 11:22:30 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 13 Sep 2010 03:22:30 -0700 (PDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <427596.65604.qm@web87014.mail.ird.yahoo.com> References: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541 6659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> <427596.65604.qm@web87014.mail.ird.yahoo.com> Message-ID: <823366.90417.qm@web87010.mail.ird.yahoo.com> I questioned: "For example, if mandatory, does that mean it becomes impossible to create a non-UTF8 CIF without using CIF2-aware software?" In some respects this might not be a bad idea - i.e.restricting the use of non-UTF8 to CIF2-aware systems... Simon (thinking aloud) ________________________________ From: SIMON WESTRIP To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 13 September, 2010 11:05:12 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . Yes - I beleive that such a declaration should be mandatory for all non-UTF8 CIF2 files, and agree that a supporting checksum mechanism would be very useful to CIF2-aware programs. Until I've revisited the checksum scheme, I can not say that the checksum should be mandatory too. For example, if mandatory, does that mean it becomes impossible to create a non-UTF8 CIF without using CIF2-aware software? I need to review the discussions on checksums and indeed the various forms that such a declaration might take, but I do beleive in the principle that it should be mandatory for all 'stand-alone' non-UTF8 CIF2 files. If a CIF is packaged in a container, then it will be the job of non-CIF software to retreive it from the container and deliver it in its original form. So a non-UTF8 CIF packaged in a non-UTF8 container (or even a UTF8 container) should still carry its non-UTF8 declaration. Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 13 September, 2010 6:24:42 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . Hi Simon: the issue with such an encoding declaration is that it is not supported by generic text tools, and so would not be automatically inserted, updated or respected when creating, editing (ie open in one encoding, save in another) or transcoding a CIF2 file. This means it has no status beyond a hint that could cause as many problems as it solves. Such a declaration becomes more robust if accompanied by the checksum that John B suggested. The checksum gives some guarantee that the encoding has been checked by a CIF-aware program. If you are proposing that such a declaration and checksum be mandatory for all non-UTF8 CIF2 files (not only during transfer), I agree with you that this would be acceptable. On Sat, Sep 11, 2010 at 6:59 PM, SIMON WESTRIP wrote: > Dear all > > I have found recent exchanges, especially Herbert's contributions regarding > the real-world use of imgCIF, very > enlightening. Primarily for reasons of flexibility, I now find myself > inclined to support a CIF specification > that allows a variety of encodings, provided that such are "clearly and > unambiguously defined". > > To me, the clear and unambiguous definition should encompass a clear and > unambiguous *declaration* > of the encoding; in the absence of such a declaration in the CIF or in its > container, a default encoding > should be assummed, either the default CIF encoding (which I think most > agree should be UTF8) or inherited > from the container? > > Though CIF1 has been successful without such a declaration (largely because > of the ASCII restriction), > I beleive it is essential in the case of CIF2. > > Cheers > > Simon > > > > > > > > ________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 10 September, 2010 19:24:05 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > > On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >>As I have said before, we went through this approach >>in 1997 and ended up going the other way -- treating the >>text-based CIF and the binary CBF as parts of the _same_ >>format, not two different formats, not one being a serialization >>of the other, but the same format. This may seem like a >>minor distinction, but it actually has strong implications >>for software design and implementation, ensuring that >>binaries in a CIF context are just a particular type of data >>handled with all the same mecnahisms as ASCII data, allowing, >>for example, multiple diffraction images and thumbnails in >>one file in an order-independent way. >> >>You may be interested to know that the false dichotomy between >>binary and text-based representations is not starting >>to imapct HDF5, requiring some significant effort to now >>work in database access, an aspect CIF1 supports -- why >>throw it away for CIF2? > > Herb, > > Perhaps you're reading more into my comments than I intended to put there. > In particular, I did not aim to suggest one on-disk/wire format should be a > serialization of another, but rather that *all* on-disk/wire formats be > characterized in terms of serialization of the Unicode character sequences > described by most of the spec. I meant "text" in that sense -- a sequence > of Unicode characters -- not in the sense of a sequence of bytes conforming > to some particular set of local conventions for text. I meant > "serialization" in the general sense of any reversible transformation of CIF > text into a byte sequence, including those that rely on interpreting the CIF > syntax. That's aimed primarily at recognizing the use case in which CIF2 is > embedded in or transformed into some other format, such as XML. > > I postulate, but do not specify, a serialization form defining the CIF2 > version of what we have conventionally called "a CIF." The details of that > form are exactly what this list was established to discuss, and I did not > intend to imply a particular resolution of our ongoing debate. It was > perhaps a mistake to include imgCIF/CBF on the list of possible alternative > serialization forms, as it is far from settled whether it will fit under the > umbrella of the 'CIF File' serialization form. I apologize if that caused > confusion. > > [... I wrote:] >>> I think this matter would be best addressed by explicitly adopting an >>> idea that we have discussed before: a formal separation between the >>> definition of CIF text (i.e. James's "CIF2-conformant character stream") and >>> the particular kind of packaging that we are accustomed to calling "a CIF" >>> or "a CIF file". James's suggestion implies such a separation anyway, so >>> let's not do it halfway. Given such a separation, the explanatory comment >>> could be as simple as: >>> >>> "This specification's definition of the 'CIF File' serialization form for >>> CIF2 text is not intended to preclude definition or use of other >>> serialization forms, such as HDF5-based forms, XML-based forms, or >>> imgCIF/CBF." >>> >>> I choose the term "serialization form" because it puts primary emphasis >>> on the CIF text (which after all is the subject of the bulk of the >>> specification). Every correct serialization of CIF text is, by definition, >>> transformable into CIF text form. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100913/4d4b820f/attachment.html From simonwestrip at btinternet.com Mon Sep 13 11:31:00 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 13 Sep 2010 03:31:00 -0700 (PDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541 6659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> Message-ID: <709653.62097.qm@web87003.mail.ird.yahoo.com> I questioned: "For example, if mandatory, does that mean it becomes impossible to create a non-UTF8 CIF without using CIF2-aware software?" In some respects this might not be a bad idea - i.e.restricting the use of non-UTF8 to CIF2-aware systems... Simon (thinking aloud) ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 13 September, 2010 6:24:42 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . Hi Simon: the issue with such an encoding declaration is that it is not supported by generic text tools, and so would not be automatically inserted, updated or respected when creating, editing (ie open in one encoding, save in another) or transcoding a CIF2 file. This means it has no status beyond a hint that could cause as many problems as it solves. Such a declaration becomes more robust if accompanied by the checksum that John B suggested. The checksum gives some guarantee that the encoding has been checked by a CIF-aware program. If you are proposing that such a declaration and checksum be mandatory for all non-UTF8 CIF2 files (not only during transfer), I agree with you that this would be acceptable. On Sat, Sep 11, 2010 at 6:59 PM, SIMON WESTRIP wrote: > Dear all > > I have found recent exchanges, especially Herbert's contributions regarding > the real-world use of imgCIF, very > enlightening. Primarily for reasons of flexibility, I now find myself > inclined to support a CIF specification > that allows a variety of encodings, provided that such are "clearly and > unambiguously defined". > > To me, the clear and unambiguous definition should encompass a clear and > unambiguous *declaration* > of the encoding; in the absence of such a declaration in the CIF or in its > container, a default encoding > should be assummed, either the default CIF encoding (which I think most > agree should be UTF8) or inherited > from the container? > > Though CIF1 has been successful without such a declaration (largely because > of the ASCII restriction), > I beleive it is essential in the case of CIF2. > > Cheers > > Simon > > > > > > > > ________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 10 September, 2010 19:24:05 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > > On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >>As I have said before, we went through this approach >>in 1997 and ended up going the other way -- treating the >>text-based CIF and the binary CBF as parts of the _same_ >>format, not two different formats, not one being a serialization >>of the other, but the same format. This may seem like a >>minor distinction, but it actually has strong implications >>for software design and implementation, ensuring that >>binaries in a CIF context are just a particular type of data >>handled with all the same mecnahisms as ASCII data, allowing, >>for example, multiple diffraction images and thumbnails in >>one file in an order-independent way. >> >>You may be interested to know that the false dichotomy between >>binary and text-based representations is not starting >>to imapct HDF5, requiring some significant effort to now >>work in database access, an aspect CIF1 supports -- why >>throw it away for CIF2? > > Herb, > > Perhaps you're reading more into my comments than I intended to put there. > In particular, I did not aim to suggest one on-disk/wire format should be a > serialization of another, but rather that *all* on-disk/wire formats be > characterized in terms of serialization of the Unicode character sequences > described by most of the spec. I meant "text" in that sense -- a sequence > of Unicode characters -- not in the sense of a sequence of bytes conforming > to some particular set of local conventions for text. I meant > "serialization" in the general sense of any reversible transformation of CIF > text into a byte sequence, including those that rely on interpreting the CIF > syntax. That's aimed primarily at recognizing the use case in which CIF2 is > embedded in or transformed into some other format, such as XML. > > I postulate, but do not specify, a serialization form defining the CIF2 > version of what we have conventionally called "a CIF." The details of that > form are exactly what this list was established to discuss, and I did not > intend to imply a particular resolution of our ongoing debate. It was > perhaps a mistake to include imgCIF/CBF on the list of possible alternative > serialization forms, as it is far from settled whether it will fit under the > umbrella of the 'CIF File' serialization form. I apologize if that caused > confusion. > > [... I wrote:] >>> I think this matter would be best addressed by explicitly adopting an >>> idea that we have discussed before: a formal separation between the >>> definition of CIF text (i.e. James's "CIF2-conformant character stream") and >>> the particular kind of packaging that we are accustomed to calling "a CIF" >>> or "a CIF file". James's suggestion implies such a separation anyway, so >>> let's not do it halfway. Given such a separation, the explanatory comment >>> could be as simple as: >>> >>> "This specification's definition of the 'CIF File' serialization form for >>> CIF2 text is not intended to preclude definition or use of other >>> serialization forms, such as HDF5-based forms, XML-based forms, or >>> imgCIF/CBF." >>> >>> I choose the term "serialization form" because it puts primary emphasis >>> on the CIF text (which after all is the subject of the bulk of the >>> specification). Every correct serialization of CIF text is, by definition, >>> transformable into CIF text form. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100913/28613eb1/attachment-0001.html From yaya at bernstein-plus-sons.com Mon Sep 13 12:19:05 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 13 Sep 2010 07:19:05 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> Message-ID: Dear Colleagues, I guess I do not understand the role of COMCIFS. It appears that some of us think that COMCIFS has the power to control what people do. I disagree. We don't. We cannot _require_ anything. We can make reasonable suggestions, and, if they are indeed reasonable, many people will follow those suggestions. If, as in the case of requiring people not to use ordinary text editors to edit a CIF, what is being suggested is an inconvenient nuisance, we will simply be ignored, and we will have a large supply of non-compliant, unidentified pseudo-CIF2 files. >> ?Let me propose what I think would be a reasonable resolution: >> >> 1. ?We come to a final resolution on what _information_ is in CIF2, >> independent of the representation used. ?I think we have that in hand. >> >> 2. ?We present one UTF-8 based _representation_ of that information >> for two essential purposes: >> ?2.1. ?To have a concrete way in which to present examples of CIF2; and >> ?2.2. ?To have a default assumed representation in which a CIF2 the >> representation of which is not otherwise identified is most likely >> to have been presented. >> >> 3. ?That we suggest some reasonable mechanisms for helping software >> developers and users to determine which of the very large number of >> possible reprentations has been used for a given file, including, >> but not limited to: >> ?BOM >> ?Magic number >> ?Extended idenfifying comments >> ?Encoding tags in the file itself with sufficient detail to allow developers >> to get started, but >> with a final decision deferred on everything other than the BOM >> to allow for broad-based community discussion of what is clearly >> a contentious issue. ?The BOM is going to be in _any_ final list >> because is is well-supported by several existing text >> editors, and keeps getting forced into files without any user >> control. is about as far as we can go and have any hope of any degree of compliance. To be blunt and specific -- if we _mandate_ a checksum in the file, we will just end up with lots of files with checksums that don't agree. Our only hope is to recommend a checksum, and, most importantly, provide widely supported software to calculate it both for automatic insertions and for validation. as for > Herbert: you have not reacted to a suggestion that we simply reserve > the first line of a CIF2 file for future expansion, and state that non > UTF8 encodings for CIF2 would be considered by COMCIFS as the need > arose. this brings us back to being unreasonably rigid and fussy and certain to be ignored. I cannot fgure out what "reserve the first line of a CIF2 file" means in practice, and "non UTF8 encodings ... be considered by COMCIFS" has no practical meaning for what a poor user or software developer in, say, a code-page or UCS2 environment is supposed to do now. At this rate, CIF2 will _never_ be adopted (and first we have to propose something), and that is a great shame. I believe that what I suggested above is as far as we can go if we want to stop debating how many angels can dance on the head of a pin and get on with having a CIF2 for real people to use with real data and real software. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Mon, 13 Sep 2010, James Hester wrote: > Your step 3 is what we are discussing here. It is not sufficient to > simply "suggest" reasonable mechanisms, as this leaves developers > bewildered as to which of these potentially vague suggestions they can > and should support, leading to confusion and inability of programs to > communicate with one another. Far better to *specify* mechanisms, > which is what we are groping towards doing here. > > Herbert: you have not reacted to a suggestion that we simply reserve > the first line of a CIF2 file for future expansion, and state that non > UTF8 encodings for CIF2 would be considered by COMCIFS as the need > arose. > > On Sun, Sep 12, 2010 at 12:33 AM, Herbert J. Bernstein > wrote: >> Dear Colleagues, >> >> ?Let me propose what I think would be a reasonable resolution: >> >> 1. ?We come to a final resolution on what _information_ is in CIF2, >> independent of the representation used. ?I think we have that in hand. >> >> 2. ?We present one UTF-8 based _representation_ of that information >> for two essential purposes: >> ?2.1. ?To have a concrete way in which to present examples of CIF2; and >> ?2.2. ?To have a default assumed representation in which a CIF2 the >> representation of which is not otherwise identified is most likely >> to have been presented. >> >> 3. ?That we suggest some reasonable mechanisms for helping software >> developers and users to determine which of the very large number of >> possible reprentations has been used for a given file, including, >> but not limited to: >> ?BOM >> ?Magic number >> ?Extended idenfifying comments >> ?Encoding tags in the file itself with sufficient detail to allow developers >> to get started, but >> with a final decision deferred on everything other than the BOM >> to allow for broad-based community discussion of what is clearly >> a contentious issue. ?The BOM is going to be in _any_ final list >> because is is well-supported by several existing text >> editors, and keeps getting forced into files without any user >> control. >> >> Regards, >> ?Herbert >> >> ===================================================== >> ?Herbert J. Bernstein, Professor of Computer Science >> ? Dowling College, Kramer Science Center, KSC 121 >> ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 >> >> ? ? ? ? ? ? ? ? +1-631-244-3035 >> ? ? ? ? ? ? ? ? yaya at dowling.edu >> ===================================================== >> >> On Sat, 11 Sep 2010, SIMON WESTRIP wrote: >> >>> Dear all >>> >>> I have found recent exchanges, especially Herbert's contributions >>> regarding >>> the real-world use of imgCIF, very >>> enlightening. Primarily for reasons of flexibility, I now find myself >>> inclined to support a CIF specification >>> that allows a variety of encodings, provided that such are "clearly and >>> unambiguously defined". >>> >>> To me, the clear and unambiguous definition should encompass a clear and >>> unambiguous *declaration* >>> ?of the encoding; in the absence of such a declaration in the CIF or in >>> its >>> container, a default encoding >>> should be assummed, either the default CIF encoding (which I think most >>> agree should be UTF8) or inherited >>> from the container? >>> >>> Though CIF1 has been successful without such a declaration (largely >>> because >>> of the ASCII restriction), >>> I beleive it is essential in the case of CIF2. >>> >>> Cheers >>> >>> Simon >>> >>> >>> >>> >>> >>> >>> >>> >>> ____________________________________________________________________________ >>> From: "Bollinger, John C" >>> To: Group for discussing encoding and content validation schemes for CIF2 >>> >>> Sent: Friday, 10 September, 2010 19:24:05 >>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. >>> . >>> >>> >>> On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >>>> As I have said before, we went through this approach >>>> in 1997 and ended up going the other way -- treating the >>>> text-based CIF and the binary CBF as parts of the _same_ >>>> format, not two different formats, not one being a serialization >>>> of the other, but the same format.? This may seem like a >>>> minor distinction, but it actually has strong implications >>>> for software design and implementation, ensuring that >>>> binaries in a CIF context are just a particular type of data >>>> handled with all the same mecnahisms as ASCII data, allowing, >>>> for example, multiple diffraction images and thumbnails in >>>> one file in an order-independent way. >>>> >>>> You may be interested to know that the false dichotomy between >>>> binary and text-based representations is not starting >>>> to imapct HDF5, requiring some significant effort to now >>>> work in database access, an aspect CIF1 supports -- why >>>> throw it away for CIF2? >>> >>> Herb, >>> >>> Perhaps you're reading more into my comments than I intended to put >>> there. >>> In particular, I did not aim to suggest one on-disk/wire format should be >>> a >>> serialization of another, but rather that *all* on-disk/wire formats be >>> characterized in terms of serialization of the Unicode character sequences >>> described by most of the spec.? I meant "text" in that sense -- a sequence >>> of Unicode characters -- not in the sense of a sequence of bytes >>> conforming >>> to some particular set of local conventions for text.? I meant >>> "serialization" in the general sense of any reversible transformation of >>> CIF >>> text into a byte sequence, including those that rely on interpreting the >>> CIF >>> syntax.? That's aimed primarily at recognizing the use case in which CIF2 >>> is >>> embedded in or transformed into some other format, such as XML. >>> >>> I postulate, but do not specify, a serialization form defining the CIF2 >>> version of what we have conventionally called "a CIF."? The details of >>> that >>> form are exactly what this list was established to discuss, and I did not >>> intend to imply a particular resolution of our ongoing debate.? It was >>> perhaps a mistake to include imgCIF/CBF on the list of possible >>> alternative >>> serialization forms, as it is far from settled whether it will fit under >>> the >>> umbrella of the 'CIF File' serialization form.? I apologize if that caused >>> confusion. >>> >>> [... I wrote:] >>>>> I think this matter would be best addressed by explicitly adopting an >>> idea that we have discussed before: a formal separation between the >>> definition of CIF text (i.e. James's "CIF2-conformant character stream") >>> and >>> the particular kind of packaging that we are accustomed to calling "a CIF" >>> or "a CIF file".? James's suggestion implies such a separation anyway, so >>> let's not do it halfway.? Given such a separation, the explanatory comment >>> could be as simple as: >>>>> >>>>> "This specification's definition of the 'CIF File' serialization form >>>>> for >>> CIF2 text is not intended to preclude definition or use of other >>> serialization forms, such as HDF5-based forms, XML-based forms, or >>> imgCIF/CBF." >>>>> >>>>> I choose the term "serialization form" because it puts primary emphasis >>> on the CIF text (which after all is the subject of the bulk of the >>> specification).? Every correct serialization of CIF text is, by >>> definition, >>> transformable into CIF text form. >>> >>> >>> Regards, >>> >>> John >>> -- >>> John C. Bollinger, Ph.D. >>> Department of Structural Biology >>> St. Jude Children's Research Hospital >>> >>> >>> >>> Email Disclaimer:? www.stjude.org/emaildisclaimer >>> >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From jamesrhester at gmail.com Mon Sep 13 13:32:52 2010 From: jamesrhester at gmail.com (James Hester) Date: Mon, 13 Sep 2010 22:32:52 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <823366.90417.qm@web87010.mail.ird.yahoo.com> References: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> <427596.65604.qm@web87014.mail.ird.yahoo.com> <823366.90417.qm@web87010.mail.ird.yahoo.com> Message-ID: The original concept was to edit the non UTF8 files in the text editor of choice, then run a simple checksumming application (that understands CIF2 syntax) to update the checksum. This application would also pick out sections of text that would be displayed incorrectly in the wrong encoding, and ask the user to confirm that the text was displayed correctly. Such an application could be made freely available by the IUCr. On Mon, Sep 13, 2010 at 8:22 PM, SIMON WESTRIP wrote: > I questioned: > > "For example, if mandatory, does that mean it becomes impossible to create a > non-UTF8 CIF without using > CIF2-aware software?" > > In some respects this might not be a bad idea - i.e.restricting the use of > non-UTF8 to CIF2-aware systems... > > Simon (thinking aloud) > > ________________________________ > From: SIMON WESTRIP > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Monday, 13 September, 2010 11:05:12 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > Yes - I beleive that such a declaration should be mandatory for all non-UTF8 > CIF2 files, > and agree that a supporting checksum mechanism would be very useful to > CIF2-aware > programs. Until I've revisited the checksum scheme, I can not say that the > checksum should be mandatory too. > For example, if mandatory, does that mean it becomes impossible to create a > non-UTF8 CIF without using > CIF2-aware software? > > I need to review the discussions on checksums and indeed the various forms > that such a declaration might take, > but I do beleive in the principle that it should be mandatory for all > 'stand-alone' non-UTF8 CIF2 files. > If a CIF is packaged in a container, then it will be the job of non-CIF > software to retreive it from the container > and deliver it in its original form. So a non-UTF8 CIF packaged in a > non-UTF8 container (or even a UTF8 container) > should still carry its non-UTF8 declaration. > > Cheers > > Simon > > ________________________________ > From: James Hester > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Monday, 13 September, 2010 6:24:42 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > Hi Simon: the issue with such an encoding declaration is that it is > not supported by generic text tools, and so would not be automatically > inserted, updated or respected when creating, editing (ie open in one > encoding, save in another) or transcoding a CIF2 file.? This means it > has no status beyond a hint that could cause as many problems as it > solves. Such a declaration becomes more robust if accompanied by the > checksum that John B suggested.? The checksum gives some guarantee > that the encoding has been checked by a CIF-aware program. > > If you are proposing that such a declaration and checksum be mandatory > for all non-UTF8 CIF2 files (not only during transfer), I agree with > you that this would be acceptable. > > From yaya at bernstein-plus-sons.com Mon Sep 13 13:52:04 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 13 Sep 2010 08:52:04 -0400 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local > <856929.33676.qm@web87013.mail.ird.yahoo.com> <427596.65604.qm@web87014.mail.ird.yahoo.com> <823366.90417.qm@web87010.mail.ird.yahoo.com> Message-ID: I would suggest actually writing the utility you have in mind. In practice, inasmuch as a CIF file looks like a text file, people are very likely to just pick one up in any convenient text editor change what they want to change and write an unidentified pseudo-cif file back out. Anything else needs to be provided to them in a complete, platform portable, well-documented package they can use easily in place of an editor that they use all the time for everything else. Please be practical -- CIF is a working tool, embedded in the IUCr journal process flows, in many crystallographic applications, in the PDB workflows, in Dectris detector software, etc.,. etc. The more disruptive you make the transition from CIF1 to CIF2, the more software and documentation you need to create to allow people to make the transition actually happen. We are essentially in the same place we were in Osaka. How do we break out of this loop and move forward? We need a realistic plan to get our job done and have a complete specification with the necessary supporting software for CIF2 in place and ready to demonstrate for Madrid, or I would suggest we accept the failure of this effort, and start over. -- Herbert At 10:32 PM +1000 9/13/10, James Hester wrote: >The original concept was to edit the non UTF8 files in the text editor >of choice, then run a simple checksumming application (that >understands CIF2 syntax) to update the checksum. This application >would also pick out sections of text that would be displayed >incorrectly in the wrong encoding, and ask the user to confirm that >the text was displayed correctly. Such an application could be made >freely available by the IUCr. > >On Mon, Sep 13, 2010 at 8:22 PM, SIMON WESTRIP > wrote: >> I questioned: >> >> "For example, if mandatory, does that mean it becomes impossible to create a >> non-UTF8 CIF without using >> CIF2-aware software?" >> >> In some respects this might not be a bad idea - i.e.restricting the use of >> non-UTF8 to CIF2-aware systems... >> >> Simon (thinking aloud) >> >> ________________________________ >> From: SIMON WESTRIP >> To: Group for discussing encoding and content validation schemes for CIF2 >> >> Sent: Monday, 13 September, 2010 11:05:12 >> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . >> >> Yes - I beleive that such a declaration should be mandatory for all non-UTF8 >> CIF2 files, >> and agree that a supporting checksum mechanism would be very useful to >> CIF2-aware >> programs. Until I've revisited the checksum scheme, I can not say that the >> checksum should be mandatory too. >> For example, if mandatory, does that mean it becomes impossible to create a >> non-UTF8 CIF without using >> CIF2-aware software? >> >> I need to review the discussions on checksums and indeed the various forms >> that such a declaration might take, >> but I do beleive in the principle that it should be mandatory for all >> 'stand-alone' non-UTF8 CIF2 files. >> If a CIF is packaged in a container, then it will be the job of non-CIF >> software to retreive it from the container >> and deliver it in its original form. So a non-UTF8 CIF packaged in a >> non-UTF8 container (or even a UTF8 container) >> should still carry its non-UTF8 declaration. >> >> Cheers >> >> Simon >> >> ________________________________ >> From: James Hester >> To: Group for discussing encoding and content validation schemes for CIF2 >> >> Sent: Monday, 13 September, 2010 6:24:42 >> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . >> >> Hi Simon: the issue with such an encoding declaration is that it is >> not supported by generic text tools, and so would not be automatically > > inserted, updated or respected when creating, editing (ie open in one >> encoding, save in another) or transcoding a CIF2 file. This means it >> has no status beyond a hint that could cause as many problems as it >> solves. Such a declaration becomes more robust if accompanied by the >> checksum that John B suggested. The checksum gives some guarantee >> that the encoding has been checked by a CIF-aware program. >> >> If you are proposing that such a declaration and checksum be mandatory >> for all non-UTF8 CIF2 files (not only during transfer), I agree with >> you that this would be acceptable. >> >> >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From jamesrhester at gmail.com Mon Sep 13 14:31:30 2010 From: jamesrhester at gmail.com (James Hester) Date: Mon, 13 Sep 2010 23:31:30 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> Message-ID: See comments below On Mon, Sep 13, 2010 at 9:19 PM, Herbert J. Bernstein wrote: > Dear Colleagues, > > ?I guess I do not understand the role of COMCIFS. ?It appears that some of > us think that COMCIFS has the power to control what people do. ?I disagree. > ?We don't. ?We cannot _require_ anything. ?We can make reasonable > suggestions, and, if they are indeed reasonable, many people will follow > those suggestions. If, as in the case of requiring people not to use > ordinary text editors to edit a CIF, what is being suggested is an > inconvenient nuisance, we will simply be ignored, and we will have a large > supply of non-compliant, unidentified pseudo-CIF2 files. Of course the IUCr do not possess shock troops in black helicopters that will descend on your laboratory the moment you use your homebrew program to display a structure from a non-compliant CIF (Simon, please confirm this...) What we do is make standards. Our syntax standards are aimed primarily at programmers. What these programmers want is assurance that their programs will produce files that can read and write files written or read by other compliant programs. Suggestions are very polite, but don't provide any certainty as to what other programmers will do. Will they accept the suggestion? It is, after all, only a suggestion. By *mandating* we do not mean "do this or we'll send the black helicopters around". We simply mean that this is what compliant files always look like. Simon was mistaken in thinking that people couldn't use normal text editors to edit a CIF under the checksum type proposals, so please disregard that idea. >>> ?Let me propose what I think would be a reasonable resolution: >>> >>> 1. ?We come to a final resolution on what _information_ is in CIF2, >>> independent of the representation used. ?I think we have that in hand. >>> >>> 2. ?We present one UTF-8 based _representation_ of that information >>> for two essential purposes: >>> ?2.1. ?To have a concrete way in which to present examples of CIF2; and >>> ?2.2. ?To have a default assumed representation in which a CIF2 the >>> representation of which is not otherwise identified is most likely >>> to have been presented. >>> >>> 3. ?That we suggest some reasonable mechanisms for helping software >>> developers and users to determine which of the very large number of >>> possible reprentations has been used for a given file, including, >>> but not limited to: >>> ?BOM >>> ?Magic number >>> ?Extended idenfifying comments >>> ?Encoding tags in the file itself with sufficient detail to allow >>> developers >>> to get started, but >>> with a final decision deferred on everything other than the BOM >>> to allow for broad-based community discussion of what is clearly >>> a contentious issue. ?The BOM is going to be in _any_ final list >>> because is is well-supported by several existing text >>> editors, and keeps getting forced into files without any user >>> control. > > is about as far as we can go and have any hope of any degree of compliance. By being so vague about how to deal with encodings, you are simply building in the potential for ambiguity and misunderstanding, thereby creating the inconvenient nuisance you intend to remove. > To be blunt and specific -- if we _mandate_ a checksum in the file, we will > just end up with lots of files with checksums that don't agree. ?Our only > hope is to recommend a checksum, and, most importantly, provide widely > supported software to calculate it both for automatic insertions and for > validation. > > as for > >> Herbert: you have not reacted to a suggestion that we simply reserve >> the first line of a CIF2 file for future expansion, and state that non >> UTF8 encodings for CIF2 would be considered by COMCIFS as the need >> arose. > > this brings us back to being unreasonably rigid and fussy and certain to be > ignored. ?I cannot fgure out what "reserve the first line of a CIF2 file" > means in practice, and "non UTF8 encodings ... be considered by COMCIFS" has > no practical meaning for what a poor user or software developer in, say, a > code-page or UCS2 environment is supposed to do now. Please enlarge on this important use case. Are you suggesting that there are systems out there that enforce UCS2 for all text files or use a code-page that imposes an encoding on all text files in the system? Your imgCIF use case (embedded in an XML file) was very helpful in resolving one issue, so perhaps you could describe these code-page systems and how they manage file encodings. > At this rate, CIF2 will _never_ be adopted (and first we have to propose > something), and that is a great shame. ?I believe that what I suggested > above is as far as we can go if we want to stop debating how many angels can > dance on the head of a pin and get on with having a CIF2 for real people to > use with real data and real software. Don't forget that we already have a complete proposal on the table (UTF8 only) which has provoked precisely zero objections outside of this group. A draft core DDLm dictionary based on Syd and Nick's previous efforts is almost ready, and the moment we resolve the encoding issue we can present the syntax and DDLm drafts to COMCIFS. Moreover, I think we all anticipate that the UTF8-only proposal will be adequate for the IUCr's purposes, as they are likely to only accept UTF8 encoded files, so whenever we tire of angel-counting we can simply choose to defer the development of a multiple-encoding standard for another time. The intent of reserving the first line of the CIF2 file was to allow development of a Scheme-B type proposal at a later date, should we decide to give up for now. There is thus plenty of scope in the current UTF8 standard for creation of real programs in the real world. > Regards, > ?Herbert > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Mon, 13 Sep 2010, James Hester wrote: > >> Your step 3 is what we are discussing here. ?It is not sufficient to >> simply "suggest" reasonable mechanisms, as this leaves developers >> bewildered as to which of these potentially vague suggestions they can >> and should support, leading to confusion and inability of programs to >> communicate with one another. ?Far better to *specify* mechanisms, >> which is what we are groping towards doing here. >> >> Herbert: you have not reacted to a suggestion that we simply reserve >> the first line of a CIF2 file for future expansion, and state that non >> UTF8 encodings for CIF2 would be considered by COMCIFS as the need >> arose. >> >> On Sun, Sep 12, 2010 at 12:33 AM, Herbert J. Bernstein >> wrote: >>> >>> Dear Colleagues, >>> >>> ?Let me propose what I think would be a reasonable resolution: >>> >>> 1. ?We come to a final resolution on what _information_ is in CIF2, >>> independent of the representation used. ?I think we have that in hand. >>> >>> 2. ?We present one UTF-8 based _representation_ of that information >>> for two essential purposes: >>> ?2.1. ?To have a concrete way in which to present examples of CIF2; and >>> ?2.2. ?To have a default assumed representation in which a CIF2 the >>> representation of which is not otherwise identified is most likely >>> to have been presented. >>> >>> 3. ?That we suggest some reasonable mechanisms for helping software >>> developers and users to determine which of the very large number of >>> possible reprentations has been used for a given file, including, >>> but not limited to: >>> ?BOM >>> ?Magic number >>> ?Extended idenfifying comments >>> ?Encoding tags in the file itself with sufficient detail to allow >>> developers >>> to get started, but >>> with a final decision deferred on everything other than the BOM >>> to allow for broad-based community discussion of what is clearly >>> a contentious issue. ?The BOM is going to be in _any_ final list >>> because is is well-supported by several existing text >>> editors, and keeps getting forced into files without any user >>> control. >>> >>> Regards, >>> ?Herbert >>> >>> ===================================================== >>> ?Herbert J. Bernstein, Professor of Computer Science >>> ? Dowling College, Kramer Science Center, KSC 121 >>> ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> ? ? ? ? ? ? ? ? +1-631-244-3035 >>> ? ? ? ? ? ? ? ? yaya at dowling.edu >>> ===================================================== >>> >>> On Sat, 11 Sep 2010, SIMON WESTRIP wrote: >>> >>>> Dear all >>>> >>>> I have found recent exchanges, especially Herbert's contributions >>>> regarding >>>> the real-world use of imgCIF, very >>>> enlightening. Primarily for reasons of flexibility, I now find myself >>>> inclined to support a CIF specification >>>> that allows a variety of encodings, provided that such are "clearly and >>>> unambiguously defined". >>>> >>>> To me, the clear and unambiguous definition should encompass a clear and >>>> unambiguous *declaration* >>>> ?of the encoding; in the absence of such a declaration in the CIF or in >>>> its >>>> container, a default encoding >>>> should be assummed, either the default CIF encoding (which I think most >>>> agree should be UTF8) or inherited >>>> from the container? >>>> >>>> Though CIF1 has been successful without such a declaration (largely >>>> because >>>> of the ASCII restriction), >>>> I beleive it is essential in the case of CIF2. >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ____________________________________________________________________________ >>>> From: "Bollinger, John C" >>>> To: Group for discussing encoding and content validation schemes for >>>> CIF2 >>>> >>>> Sent: Friday, 10 September, 2010 19:24:05 >>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. >>>> .. >>>> . >>>> >>>> >>>> On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >>>>> >>>>> As I have said before, we went through this approach >>>>> in 1997 and ended up going the other way -- treating the >>>>> text-based CIF and the binary CBF as parts of the _same_ >>>>> format, not two different formats, not one being a serialization >>>>> of the other, but the same format.? This may seem like a >>>>> minor distinction, but it actually has strong implications >>>>> for software design and implementation, ensuring that >>>>> binaries in a CIF context are just a particular type of data >>>>> handled with all the same mecnahisms as ASCII data, allowing, >>>>> for example, multiple diffraction images and thumbnails in >>>>> one file in an order-independent way. >>>>> >>>>> You may be interested to know that the false dichotomy between >>>>> binary and text-based representations is not starting >>>>> to imapct HDF5, requiring some significant effort to now >>>>> work in database access, an aspect CIF1 supports -- why >>>>> throw it away for CIF2? >>>> >>>> Herb, >>>> >>>> Perhaps you're reading more into my comments than I intended to put >>>> there. >>>> In particular, I did not aim to suggest one on-disk/wire format should >>>> be >>>> a >>>> serialization of another, but rather that *all* on-disk/wire formats be >>>> characterized in terms of serialization of the Unicode character >>>> sequences >>>> described by most of the spec.? I meant "text" in that sense -- a >>>> sequence >>>> of Unicode characters -- not in the sense of a sequence of bytes >>>> conforming >>>> to some particular set of local conventions for text.? I meant >>>> "serialization" in the general sense of any reversible transformation of >>>> CIF >>>> text into a byte sequence, including those that rely on interpreting the >>>> CIF >>>> syntax.? That's aimed primarily at recognizing the use case in which >>>> CIF2 >>>> is >>>> embedded in or transformed into some other format, such as XML. >>>> >>>> I postulate, but do not specify, a serialization form defining the CIF2 >>>> version of what we have conventionally called "a CIF."? The details of >>>> that >>>> form are exactly what this list was established to discuss, and I did >>>> not >>>> intend to imply a particular resolution of our ongoing debate.? It was >>>> perhaps a mistake to include imgCIF/CBF on the list of possible >>>> alternative >>>> serialization forms, as it is far from settled whether it will fit under >>>> the >>>> umbrella of the 'CIF File' serialization form.? I apologize if that >>>> caused >>>> confusion. >>>> >>>> [... I wrote:] >>>>>> >>>>>> I think this matter would be best addressed by explicitly adopting an >>>> >>>> idea that we have discussed before: a formal separation between the >>>> definition of CIF text (i.e. James's "CIF2-conformant character stream") >>>> and >>>> the particular kind of packaging that we are accustomed to calling "a >>>> CIF" >>>> or "a CIF file".? James's suggestion implies such a separation anyway, >>>> so >>>> let's not do it halfway.? Given such a separation, the explanatory >>>> comment >>>> could be as simple as: >>>>>> >>>>>> "This specification's definition of the 'CIF File' serialization form >>>>>> for >>>> >>>> CIF2 text is not intended to preclude definition or use of other >>>> serialization forms, such as HDF5-based forms, XML-based forms, or >>>> imgCIF/CBF." >>>>>> >>>>>> I choose the term "serialization form" because it puts primary >>>>>> emphasis >>>> >>>> on the CIF text (which after all is the subject of the bulk of the >>>> specification).? Every correct serialization of CIF text is, by >>>> definition, >>>> transformable into CIF text form. >>>> >>>> >>>> Regards, >>>> >>>> John >>>> -- >>>> John C. Bollinger, Ph.D. >>>> Department of Structural Biology >>>> St. Jude Children's Research Hospital >>>> >>>> >>>> >>>> Email Disclaimer:? www.stjude.org/emaildisclaimer >>>> >>>> _______________________________________________ >>>> cif2-encoding mailing list >>>> cif2-encoding at iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>> >>> >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >>> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From jamesrhester at gmail.com Mon Sep 13 14:47:53 2010 From: jamesrhester at gmail.com (James Hester) Date: Mon, 13 Sep 2010 23:47:53 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <856929.33676.qm@web87013.mail.ird.yahoo.com> <427596.65604.qm@web87014.mail.ird.yahoo.com> <823366.90417.qm@web87010.mail.ird.yahoo.com> Message-ID: See comments below. On Mon, Sep 13, 2010 at 10:52 PM, Herbert J. Bernstein wrote: > I would suggest actually writing the utility you have in mind. Why? It is simply a CIF2 syntax parser with checksum of all the contents thus parsed. It is not worth spending the time on something so obviously possible until we agree that we want such a system. > In practice, inasmuch as a CIF file looks like a text file, people > are very likely to just pick one up in any convenient text editor > change what they want to change and write an unidentified pseudo-cif > file back out. ?Anything else needs to be provided to them in > a complete, platform portable, well-documented package they can > use easily in place of an editor that they use all the time for > everything else. Note that we are not suggesting replacing editors, far from it, if we could do that we wouldn't have a problem in the first place. > Please be practical -- CIF is a working tool, embedded in the IUCr > journal process flows, in many crystallographic applications, in > the PDB workflows, in Dectris detector software, etc.,. etc. > > The more disruptive you make the transition from CIF1 to CIF2, the > more software and documentation you need to create to allow people > to make the transition actually happen. ?We are essentially in > the same place we were in Osaka. ?How do we break out of this > loop and move forward? We are a lot further forward than Osaka. We have a *complete* syntax specification on the table, which has received zero objections outside of this group. No further DDLm problems have been identified. The only issue left unresolved is that not enough encodings are allowed, although the one encoding that is allowed is actually sufficient for all of the useful work that the IUCr expect to do. We could take what we have to Madrid, with a single caveat that a system for dealing with non UTF8 encodings is under consideration, and (if the response on the mailing lists is any indication) everybody would be happy outside of this list. As for demonstrations, Nick and Syd have been demonstrating this system for over a decade (with cosmetic differences in syntax). > We need a realistic plan to get our job done and have a complete > specification with the necessary supporting software for CIF2 in place > and ready to demonstrate for Madrid, or I would suggest we > accept the failure of this effort, and start over. Somewhat overwrought, don't you think? Because we can't agree on a scheme for additional encodings we should chuck CIF2 syntax, DDLm and dREL overboard?? When the IUCr will function perfectly well with UTF8 only? If you would like to start coding, please structure your code so that the decoding step may take other encodings beyond UTF8. The rest is in the draft standard (you will be pleased to see the lack of ambiguity in that standard, it will make your task easier). > ? -- Herbert > > > > At 10:32 PM +1000 9/13/10, James Hester wrote: >>The original concept was to edit the non UTF8 files in the text editor >>of choice, then run a simple checksumming application (that >>understands CIF2 syntax) to update the checksum. ?This application >>would also pick out sections of text that would be displayed >>incorrectly in the wrong encoding, and ask the user to confirm that >>the text was displayed correctly. ?Such an application could be made >>freely available by the IUCr. >> >>On Mon, Sep 13, 2010 at 8:22 PM, SIMON WESTRIP >> wrote: >>> ?I questioned: >>> >>> ?"For example, if mandatory, does that mean it becomes impossible to create a >>> ?non-UTF8 CIF without using >>> ?CIF2-aware software?" >>> >>> ?In some respects this might not be a bad idea - i.e.restricting the use of >>> ?non-UTF8 to CIF2-aware systems... >>> >>> ?Simon (thinking aloud) >>> >>> ?________________________________ >>> ?From: SIMON WESTRIP >>> ?To: Group for discussing encoding and content validation schemes for CIF2 >>> ? >>> ?Sent: Monday, 13 September, 2010 11:05:12 >>> ?Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . >>> >>> ?Yes - I beleive that such a declaration should be mandatory for all non-UTF8 >>> ?CIF2 files, >>> ?and agree that a supporting checksum mechanism would be very useful to >>> ?CIF2-aware >>> ?programs. Until I've revisited the checksum scheme, I can not say that the >>> ?checksum should be mandatory too. >>> ?For example, if mandatory, does that mean it becomes impossible to create a >>> ?non-UTF8 CIF without using >>> ?CIF2-aware software? >>> >>> ?I need to review the discussions on checksums and indeed the various forms >>> ?that such a declaration might take, >>> ?but I do beleive in the principle that it should be mandatory for all >>> ?'stand-alone' non-UTF8 CIF2 files. >>> ?If a CIF is packaged in a container, then it will be the job of non-CIF >>> ?software to retreive it from the container >>> ?and deliver it in its original form. So a non-UTF8 CIF packaged in a >>> ?non-UTF8 container (or even a UTF8 container) >>> ?should still carry its non-UTF8 declaration. >>> >>> ?Cheers >>> >>> ?Simon >>> >>> ?________________________________ >>> ?From: James Hester >>> ?To: Group for discussing encoding and content validation schemes for CIF2 >>> ? >>> ?Sent: Monday, 13 September, 2010 6:24:42 >>> ?Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . >>> >>> ?Hi Simon: the issue with such an encoding declaration is that it is >>> ?not supported by generic text tools, and so would not be automatically >> ?> inserted, updated or respected when creating, editing (ie open in one >>> ?encoding, save in another) or transcoding a CIF2 file. ?This means it >>> ?has no status beyond a hint that could cause as many problems as it >>> ?solves. Such a declaration becomes more robust if accompanied by the >>> ?checksum that John B suggested. ?The checksum gives some guarantee >>> ?that the encoding has been checked by a CIF-aware program. >>> >>> ?If you are proposing that such a declaration and checksum be mandatory >>> ?for all non-UTF8 CIF2 files (not only during transfer), I agree with >>> ?you that this would be acceptable. >>> >>> >>_______________________________________________ >>cif2-encoding mailing list >>cif2-encoding at iucr.org >>http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > -- > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? ?Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From simonwestrip at btinternet.com Mon Sep 13 15:12:40 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 13 Sep 2010 07:12:40 -0700 (PDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBF@SJMEMXMBS11.stjude.sjcrh.local> <856929.33676.qm@web87013.mail.ird.yahoo.com> Message-ID: <569253.75369.qm@web87007.mail.ird.yahoo.com> "Simon was mistaken in thinking that people couldn't use normal text editors to edit a CIF under the checksum type proposals, so please disregard that idea." It wasnt that you couldnt use a normal text editor - rather you would need CIF2 software to generate the checksum? Anyway, this is perhaps a side issue ... I need to consider other points and subsequent messages before trying to offer anything useful... Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 13 September, 2010 14:31:30 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . See comments below On Mon, Sep 13, 2010 at 9:19 PM, Herbert J. Bernstein wrote: > Dear Colleagues, > > I guess I do not understand the role of COMCIFS. It appears that some of > us think that COMCIFS has the power to control what people do. I disagree. > We don't. We cannot _require_ anything. We can make reasonable > suggestions, and, if they are indeed reasonable, many people will follow > those suggestions. If, as in the case of requiring people not to use > ordinary text editors to edit a CIF, what is being suggested is an > inconvenient nuisance, we will simply be ignored, and we will have a large > supply of non-compliant, unidentified pseudo-CIF2 files. Of course the IUCr do not possess shock troops in black helicopters that will descend on your laboratory the moment you use your homebrew program to display a structure from a non-compliant CIF (Simon, please confirm this...) What we do is make standards. Our syntax standards are aimed primarily at programmers. What these programmers want is assurance that their programs will produce files that can read and write files written or read by other compliant programs. Suggestions are very polite, but don't provide any certainty as to what other programmers will do. Will they accept the suggestion? It is, after all, only a suggestion. By *mandating* we do not mean "do this or we'll send the black helicopters around". We simply mean that this is what compliant files always look like. Simon was mistaken in thinking that people couldn't use normal text editors to edit a CIF under the checksum type proposals, so please disregard that idea. >>> Let me propose what I think would be a reasonable resolution: >>> >>> 1. We come to a final resolution on what _information_ is in CIF2, >>> independent of the representation used. I think we have that in hand. >>> >>> 2. We present one UTF-8 based _representation_ of that information >>> for two essential purposes: >>> 2.1. To have a concrete way in which to present examples of CIF2; and >>> 2.2. To have a default assumed representation in which a CIF2 the >>> representation of which is not otherwise identified is most likely >>> to have been presented. >>> >>> 3. That we suggest some reasonable mechanisms for helping software >>> developers and users to determine which of the very large number of >>> possible reprentations has been used for a given file, including, >>> but not limited to: >>> BOM >>> Magic number >>> Extended idenfifying comments >>> Encoding tags in the file itself with sufficient detail to allow >>> developers >>> to get started, but >>> with a final decision deferred on everything other than the BOM >>> to allow for broad-based community discussion of what is clearly >>> a contentious issue. The BOM is going to be in _any_ final list >>> because is is well-supported by several existing text >>> editors, and keeps getting forced into files without any user >>> control. > > is about as far as we can go and have any hope of any degree of compliance. By being so vague about how to deal with encodings, you are simply building in the potential for ambiguity and misunderstanding, thereby creating the inconvenient nuisance you intend to remove. > To be blunt and specific -- if we _mandate_ a checksum in the file, we will > just end up with lots of files with checksums that don't agree. Our only > hope is to recommend a checksum, and, most importantly, provide widely > supported software to calculate it both for automatic insertions and for > validation. > > as for > >> Herbert: you have not reacted to a suggestion that we simply reserve >> the first line of a CIF2 file for future expansion, and state that non >> UTF8 encodings for CIF2 would be considered by COMCIFS as the need >> arose. > > this brings us back to being unreasonably rigid and fussy and certain to be > ignored. I cannot fgure out what "reserve the first line of a CIF2 file" > means in practice, and "non UTF8 encodings ... be considered by COMCIFS" has > no practical meaning for what a poor user or software developer in, say, a > code-page or UCS2 environment is supposed to do now. Please enlarge on this important use case. Are you suggesting that there are systems out there that enforce UCS2 for all text files or use a code-page that imposes an encoding on all text files in the system? Your imgCIF use case (embedded in an XML file) was very helpful in resolving one issue, so perhaps you could describe these code-page systems and how they manage file encodings. > At this rate, CIF2 will _never_ be adopted (and first we have to propose > something), and that is a great shame. I believe that what I suggested > above is as far as we can go if we want to stop debating how many angels can > dance on the head of a pin and get on with having a CIF2 for real people to > use with real data and real software. Don't forget that we already have a complete proposal on the table (UTF8 only) which has provoked precisely zero objections outside of this group. A draft core DDLm dictionary based on Syd and Nick's previous efforts is almost ready, and the moment we resolve the encoding issue we can present the syntax and DDLm drafts to COMCIFS. Moreover, I think we all anticipate that the UTF8-only proposal will be adequate for the IUCr's purposes, as they are likely to only accept UTF8 encoded files, so whenever we tire of angel-counting we can simply choose to defer the development of a multiple-encoding standard for another time. The intent of reserving the first line of the CIF2 file was to allow development of a Scheme-B type proposal at a later date, should we decide to give up for now. There is thus plenty of scope in the current UTF8 standard for creation of real programs in the real world. > Regards, > Herbert > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Mon, 13 Sep 2010, James Hester wrote: > >> Your step 3 is what we are discussing here. It is not sufficient to >> simply "suggest" reasonable mechanisms, as this leaves developers >> bewildered as to which of these potentially vague suggestions they can >> and should support, leading to confusion and inability of programs to >> communicate with one another. Far better to *specify* mechanisms, >> which is what we are groping towards doing here. >> >> Herbert: you have not reacted to a suggestion that we simply reserve >> the first line of a CIF2 file for future expansion, and state that non >> UTF8 encodings for CIF2 would be considered by COMCIFS as the need >> arose. >> >> On Sun, Sep 12, 2010 at 12:33 AM, Herbert J. Bernstein >> wrote: >>> >>> Dear Colleagues, >>> >>> Let me propose what I think would be a reasonable resolution: >>> >>> 1. We come to a final resolution on what _information_ is in CIF2, >>> independent of the representation used. I think we have that in hand. >>> >>> 2. We present one UTF-8 based _representation_ of that information >>> for two essential purposes: >>> 2.1. To have a concrete way in which to present examples of CIF2; and >>> 2.2. To have a default assumed representation in which a CIF2 the >>> representation of which is not otherwise identified is most likely >>> to have been presented. >>> >>> 3. That we suggest some reasonable mechanisms for helping software >>> developers and users to determine which of the very large number of >>> possible reprentations has been used for a given file, including, >>> but not limited to: >>> BOM >>> Magic number >>> Extended idenfifying comments >>> Encoding tags in the file itself with sufficient detail to allow >>> developers >>> to get started, but >>> with a final decision deferred on everything other than the BOM >>> to allow for broad-based community discussion of what is clearly >>> a contentious issue. The BOM is going to be in _any_ final list >>> because is is well-supported by several existing text >>> editors, and keeps getting forced into files without any user >>> control. >>> >>> Regards, >>> Herbert >>> >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> yaya at dowling.edu >>> ===================================================== >>> >>> On Sat, 11 Sep 2010, SIMON WESTRIP wrote: >>> >>>> Dear all >>>> >>>> I have found recent exchanges, especially Herbert's contributions >>>> regarding >>>> the real-world use of imgCIF, very >>>> enlightening. Primarily for reasons of flexibility, I now find myself >>>> inclined to support a CIF specification >>>> that allows a variety of encodings, provided that such are "clearly and >>>> unambiguously defined". >>>> >>>> To me, the clear and unambiguous definition should encompass a clear and >>>> unambiguous *declaration* >>>> of the encoding; in the absence of such a declaration in the CIF or in >>>> its >>>> container, a default encoding >>>> should be assummed, either the default CIF encoding (which I think most >>>> agree should be UTF8) or inherited >>>> from the container? >>>> >>>> Though CIF1 has been successful without such a declaration (largely >>>> because >>>> of the ASCII restriction), >>>> I beleive it is essential in the case of CIF2. >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ____________________________________________________________________________ >>>> From: "Bollinger, John C" >>>> To: Group for discussing encoding and content validation schemes for >>>> CIF2 >>>> >>>> Sent: Friday, 10 September, 2010 19:24:05 >>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. >>>> .. >>>> . >>>> >>>> >>>> On Friday, September 10, 2010 11:02 AM, Herbert J. Bernstein wrote: >>>>> >>>>> As I have said before, we went through this approach >>>>> in 1997 and ended up going the other way -- treating the >>>>> text-based CIF and the binary CBF as parts of the _same_ >>>>> format, not two different formats, not one being a serialization >>>>> of the other, but the same format. This may seem like a >>>>> minor distinction, but it actually has strong implications >>>>> for software design and implementation, ensuring that >>>>> binaries in a CIF context are just a particular type of data >>>>> handled with all the same mecnahisms as ASCII data, allowing, >>>>> for example, multiple diffraction images and thumbnails in >>>>> one file in an order-independent way. >>>>> >>>>> You may be interested to know that the false dichotomy between >>>>> binary and text-based representations is not starting >>>>> to imapct HDF5, requiring some significant effort to now >>>>> work in database access, an aspect CIF1 supports -- why >>>>> throw it away for CIF2? >>>> >>>> Herb, >>>> >>>> Perhaps you're reading more into my comments than I intended to put >>>> there. >>>> In particular, I did not aim to suggest one on-disk/wire format should >>>> be >>>> a >>>> serialization of another, but rather that *all* on-disk/wire formats be >>>> characterized in terms of serialization of the Unicode character >>>> sequences >>>> described by most of the spec. I meant "text" in that sense -- a >>>> sequence >>>> of Unicode characters -- not in the sense of a sequence of bytes >>>> conforming >>>> to some particular set of local conventions for text. I meant >>>> "serialization" in the general sense of any reversible transformation of >>>> CIF >>>> text into a byte sequence, including those that rely on interpreting the >>>> CIF >>>> syntax. That's aimed primarily at recognizing the use case in which >>>> CIF2 >>>> is >>>> embedded in or transformed into some other format, such as XML. >>>> >>>> I postulate, but do not specify, a serialization form defining the CIF2 >>>> version of what we have conventionally called "a CIF." The details of >>>> that >>>> form are exactly what this list was established to discuss, and I did >>>> not >>>> intend to imply a particular resolution of our ongoing debate. It was >>>> perhaps a mistake to include imgCIF/CBF on the list of possible >>>> alternative >>>> serialization forms, as it is far from settled whether it will fit under >>>> the >>>> umbrella of the 'CIF File' serialization form. I apologize if that >>>> caused >>>> confusion. >>>> >>>> [... I wrote:] >>>>>> >>>>>> I think this matter would be best addressed by explicitly adopting an >>>> >>>> idea that we have discussed before: a formal separation between the >>>> definition of CIF text (i.e. James's "CIF2-conformant character stream") >>>> and >>>> the particular kind of packaging that we are accustomed to calling "a >>>> CIF" >>>> or "a CIF file". James's suggestion implies such a separation anyway, >>>> so >>>> let's not do it halfway. Given such a separation, the explanatory >>>> comment >>>> could be as simple as: >>>>>> >>>>>> "This specification's definition of the 'CIF File' serialization form >>>>>> for >>>> >>>> CIF2 text is not intended to preclude definition or use of other >>>> serialization forms, such as HDF5-based forms, XML-based forms, or >>>> imgCIF/CBF." >>>>>> >>>>>> I choose the term "serialization form" because it puts primary >>>>>> emphasis >>>> >>>> on the CIF text (which after all is the subject of the bulk of the >>>> specification). Every correct serialization of CIF text is, by >>>> definition, >>>> transformable into CIF text form. >>>> >>>> >>>> Regards, >>>> >>>> John >>>> -- >>>> John C. Bollinger, Ph.D. >>>> Department of Structural Biology >>>> St. Jude Children's Research Hospital >>>> >>>> >>>> >>>> Email Disclaimer: www.stjude.org/emaildisclaimer >>>> >>>> _______________________________________________ >>>> cif2-encoding mailing list >>>> cif2-encoding at iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>>> >>> >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >>> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100913/0c46381e/attachment-0001.html From yaya at bernstein-plus-sons.com Mon Sep 13 15:47:46 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 13 Sep 2010 10:47:46 -0400 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <856929.33676.qm@web87013.mail.ird.yahoo.com> <427596.65604.qm@web87014.mail.ird.yahoo.com> <823366.90417.qm@web87010.mail.ird.yahoo.com> Message-ID: Dear James, >Somewhat overwrought, don't you think? Because we can't agree on a >scheme for additional encodings we should chuck CIF2 syntax, DDLm and >dREL overboard?? When the IUCr will function perfectly well with UTF8 >only? If you would like to start coding, please structure your code so >that the decoding step may take other encodings beyond UTF8. The rest >is in the draft standard (you will be pleased to see the lack of >ambiguity in that standard, it will make your task easier). No, actually, I am being very, very restrained in my public comments. Right now the CIF2 efforts really seems to be headed nowhere. I would like it to be used. If the IUCr will "function perfectly well with the UTF8 only" version, then let's get the IUCr workflows converted and get this thing in use. Please point to the current best URLs, let's see if we have agreement on those specific documents by putting them to a formal COMCIFS vote and then let's ask the IUCr journals operation to give it a try. >Why? It is simply a CIF2 syntax parser with checksum of all the >contents thus parsed. It is not worth spending the time on something >so obviously possible until we agree that we want such a system. The concept may be obvious, the details of implementation most certainly are not, and without a reference implementation, we will end up with multiple incompatible interpretations, e.g. do lines get trailing blank stripped? does embedded whitespace in bracketed constructs get compressed? God and the devil are in the details. >What we do is make standards. Our syntax standards are aimed >primarily at programmers. What these programmers want is assurance >that their programs will produce files that can read and write files >written or read by other compliant programs. Suggestions are very >polite, but don't provide any certainty as to what other programmers >will do. Will they accept the suggestion? It is, after all, only a >suggestion. By *mandating* we do not mean "do this or we'll send the >black helicopters around". We simply mean that this is what compliant >files always look like. Sorry, but we must deal with very different programmers. The ones I deal with rebel at anything stronger than a polite suggestion grounded in common interest. You mistake the silence on the DDLm proposal as agreement. I would suggest asking some of the major developers if they have even read it yet. The burden is on us to justify _any_ effort we will require of them. We need to have something finished, complete and well supported with necessary software before most of them will even think about learning what we have been up to. >By being so vague about how to deal with encodings, you are simply >building in the potential for ambiguity and misunderstanding, thereby >creating the inconvenient nuisance you intend to remove. No, I am simply respecting the difference between a text representation and a binary representation. Multiple encodings are a fact of life when working with text. > > this brings us back to being unreasonably rigid and fussy and certain to be >> ignored. I cannot fgure out what "reserve the first line of a CIF2 file" >> means in practice, and "non UTF8 encodings ... be considered by COMCIFS" has >> no practical meaning for what a poor user or software developer in, say, a >> code-page or UCS2 environment is supposed to do now. > >Please enlarge on this important use case. Are you suggesting that >there are systems out there that enforce UCS2 for all text files or >use a code-page that imposes an encoding on all text files in the >system? Your imgCIF use case (embedded in an XML file) was very >helpful in resolving one issue, so perhaps you could describe these >code-page systems and how they manage file encodings. Here is a simple example use case: Assume we have specified UTF-8 with no BOM. Assume a user on a system with an editor that writes a BOM on output of all UTF-8 files edits a CIF2 file. What is he supposed to do now (not in theory, but with the tools be has available on normal OS's plus what we are providing), especially because he probably has no indication of the change. Here is another simple example use case: We have a user working with an EUC-CN code page based editor with a UTF-8 based CIF. What should be do to edit that CIF and return it to the IUCr? We are dealing with both software developers and end-users. We need to consider both. Regards, Herbert At 11:47 PM +1000 9/13/10, James Hester wrote: >See comments below. > >On Mon, Sep 13, 2010 at 10:52 PM, Herbert J. Bernstein > wrote: >> I would suggest actually writing the utility you have in mind. > >Why? It is simply a CIF2 syntax parser with checksum of all the >contents thus parsed. It is not worth spending the time on something >so obviously possible until we agree that we want such a system. > >> In practice, inasmuch as a CIF file looks like a text file, people >> are very likely to just pick one up in any convenient text editor >> change what they want to change and write an unidentified pseudo-cif >> file back out. Anything else needs to be provided to them in >> a complete, platform portable, well-documented package they can >> use easily in place of an editor that they use all the time for >> everything else. > >Note that we are not suggesting replacing editors, far from it, if we >could do that we wouldn't have a problem in the first place. > >> Please be practical -- CIF is a working tool, embedded in the IUCr >> journal process flows, in many crystallographic applications, in >> the PDB workflows, in Dectris detector software, etc.,. etc. >> >> The more disruptive you make the transition from CIF1 to CIF2, the >> more software and documentation you need to create to allow people >> to make the transition actually happen. We are essentially in >> the same place we were in Osaka. How do we break out of this >> loop and move forward? > >We are a lot further forward than Osaka. We have a *complete* syntax >specification on the table, which has received zero objections outside >of this group. No further DDLm problems have been identified. The >only issue left unresolved is that not enough encodings are allowed, >although the one encoding that is allowed is actually sufficient for >all of the useful work that the IUCr expect to do. We could take what >we have to Madrid, with a single caveat that a system for dealing with >non UTF8 encodings is under consideration, and (if the response on the >mailing lists is any indication) everybody would be happy outside of >this list. As for demonstrations, Nick and Syd have been >demonstrating this system for over a decade (with cosmetic differences >in syntax). > >> We need a realistic plan to get our job done and have a complete >> specification with the necessary supporting software for CIF2 in place >> and ready to demonstrate for Madrid, or I would suggest we >> accept the failure of this effort, and start over. > >Somewhat overwrought, don't you think? Because we can't agree on a >scheme for additional encodings we should chuck CIF2 syntax, DDLm and >dREL overboard?? When the IUCr will function perfectly well with UTF8 >only? If you would like to start coding, please structure your code so >that the decoding step may take other encodings beyond UTF8. The rest >is in the draft standard (you will be pleased to see the lack of >ambiguity in that standard, it will make your task easier). > >> -- Herbert >> >> >> >> At 10:32 PM +1000 9/13/10, James Hester wrote: >>>The original concept was to edit the non UTF8 files in the text editor >>>of choice, then run a simple checksumming application (that >>>understands CIF2 syntax) to update the checksum. This application >>>would also pick out sections of text that would be displayed >>>incorrectly in the wrong encoding, and ask the user to confirm that >>>the text was displayed correctly. Such an application could be made > >>freely available by the IUCr. >>> >>>On Mon, Sep 13, 2010 at 8:22 PM, SIMON WESTRIP >>> wrote: >>>> I questioned: >>>> >>>> "For example, if mandatory, does that mean it becomes >>>>impossible to create a >>>> non-UTF8 CIF without using >>>> CIF2-aware software?" >>>> >>>> In some respects this might not be a bad idea - i.e.restricting >>>>the use of >>>> non-UTF8 to CIF2-aware systems... >>>> >>>> Simon (thinking aloud) >>>> >>>> ________________________________ >>>> From: SIMON WESTRIP >>>> To: Group for discussing encoding and content validation schemes for CIF2 >>>> >>>> Sent: Monday, 13 September, 2010 11:05:12 >>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other >>>>sub-topics. .. . >>>> >>>> Yes - I beleive that such a declaration should be mandatory for >>>>all non-UTF8 >>>> CIF2 files, >>>> and agree that a supporting checksum mechanism would be very useful to >>>> CIF2-aware >>>> programs. Until I've revisited the checksum scheme, I can not >>>>say that the >>>> checksum should be mandatory too. >>>> For example, if mandatory, does that mean it becomes impossible >>>>to create a >>>> non-UTF8 CIF without using >>>> CIF2-aware software? >>>> >>>> I need to review the discussions on checksums and indeed the >>>>various forms >>>> that such a declaration might take, >>>> but I do beleive in the principle that it should be mandatory for all >>>> 'stand-alone' non-UTF8 CIF2 files. >>>> If a CIF is packaged in a container, then it will be the job of non-CIF >>>> software to retreive it from the container >>>> and deliver it in its original form. So a non-UTF8 CIF packaged in a >>>> non-UTF8 container (or even a UTF8 container) >>>> should still carry its non-UTF8 declaration. >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> ________________________________ >>>> From: James Hester >>>> To: Group for discussing encoding and content validation schemes for CIF2 >>>> >>>> Sent: Monday, 13 September, 2010 6:24:42 >>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other >>>>sub-topics. .. . >>>> >>>> Hi Simon: the issue with such an encoding declaration is that it is >>>> not supported by generic text tools, and so would not be automatically >>> > inserted, updated or respected when creating, editing (ie open in one >>>> encoding, save in another) or transcoding a CIF2 file. This means it >>>> has no status beyond a hint that could cause as many problems as it >>>> solves. Such a declaration becomes more robust if accompanied by the >>>> checksum that John B suggested. The checksum gives some guarantee >>>> that the encoding has been checked by a CIF-aware program. >>>> >>>> If you are proposing that such a declaration and checksum be mandatory >>>> for all non-UTF8 CIF2 files (not only during transfer), I agree with >>>> you that this would be acceptable. >>>> >>>> >>>_______________________________________________ >>>cif2-encoding mailing list >>>cif2-encoding at iucr.org >>>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> >> -- >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 > > >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> > > > >-- >T +61 (02) 9717 9907 >F +61 (02) 9717 3145 >M +61 (04) 0249 4148 >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From simonwestrip at btinternet.com Mon Sep 13 17:10:57 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 13 Sep 2010 16:10:57 +0000 (GMT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <856929.33676.qm@web87013.mail.ird.yahoo.com> <427596.65604.qm@web87014.mail.ird.yahoo.com> <823366.90417.qm@web87010.mail.ird.yahoo.com> Message-ID: <706034.11045.qm@web87005.mail.ird.yahoo.com> A few notes on IUCr workflow and the impact of the encoding issue: Submission/author services (i.e. invloving CIF upload): I envisage that where the encoding could not be determined by e.g. BOM, upon submission of a CIF it may be necessary to prompt the user to confirm the encoding (maybe using an interactive tool that allows the uploaded CIF to be viewed in a variety of encodings). This is probably not unreasonable as most of the IUCr's author services are interactive. Processing: Changes to subsequent processing would probably not involve much more than converting the CIF to e.g. UTF-8, or whatever encoding is required by the processing software. Archive: Though changes to the CIF archive would be negligible, the way in which CIFs are retrieved from the archive may require some changes (e.g. offering the recepient a choice of encoding if permitted by the spec), though the content that is made publically available is unlikely to contain non-ASCII characters. So work would be required, but not nearly as much as involved in working with the new dictionaries and developing a system to handle both CIF1 and CIF2 in the transition period. As far as the user is concerned, there may be slight inconvenience of having to confirm the encoding of their CIF every time it is uploaded to an IUCr service. This perceived impact would probably hold regardless of what is decided upon: if CIF2 were to be UTF8 only, in recognition of the variety of encodings available and user practice with standard (non-CIF) text editors, I expect the IUCr would still attempt to accommodate non-UTF8 cifs. So you may ask why I've bothered to support or otherwise some of the proposals discussed in this thread. Basically, given that CIF is a 'text' format, the specification should address the issues arising from that format, so I do not agree that text-encoding should play no part in, or be treated separately from the standard. Equally, the standard should not mandate anything that markedly affects the treatment of CIF as 'text' (i.e. complete reliance on CIF-only software). I was in favour of UTF8-only as in the draft spec, but after Herbert's description of imgCIF in particular, I now find myself thinking we ought to be more flexible (at the very least by leaving the door open to other encodings). To this end, I think the specification should allow a 'declaration' of the encoding, however unreliable given current practice of using any old text editor. Furthermore, I do not think it unreasonable for the specification to define a default encoding. Afterall, CIF is a data-exchange format and surely that requires strict definitions if it is to work as such. Overall, I suspect there will be problems relating to encoding, but I am of the view that with good software support and a specification that addresses the issue, they will be minimal. Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 13 September, 2010 15:47:46 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . Dear James, >Somewhat overwrought, don't you think? Because we can't agree on a >scheme for additional encodings we should chuck CIF2 syntax, DDLm and >dREL overboard?? When the IUCr will function perfectly well with UTF8 >only? If you would like to start coding, please structure your code so >that the decoding step may take other encodings beyond UTF8. The rest >is in the draft standard (you will be pleased to see the lack of >ambiguity in that standard, it will make your task easier). No, actually, I am being very, very restrained in my public comments. Right now the CIF2 efforts really seems to be headed nowhere. I would like it to be used. If the IUCr will "function perfectly well with the UTF8 only" version, then let's get the IUCr workflows converted and get this thing in use. Please point to the current best URLs, let's see if we have agreement on those specific documents by putting them to a formal COMCIFS vote and then let's ask the IUCr journals operation to give it a try. >Why? It is simply a CIF2 syntax parser with checksum of all the >contents thus parsed. It is not worth spending the time on something >so obviously possible until we agree that we want such a system. The concept may be obvious, the details of implementation most certainly are not, and without a reference implementation, we will end up with multiple incompatible interpretations, e.g. do lines get trailing blank stripped? does embedded whitespace in bracketed constructs get compressed? God and the devil are in the details. >What we do is make standards. Our syntax standards are aimed >primarily at programmers. What these programmers want is assurance >that their programs will produce files that can read and write files >written or read by other compliant programs. Suggestions are very >polite, but don't provide any certainty as to what other programmers >will do. Will they accept the suggestion? It is, after all, only a >suggestion. By *mandating* we do not mean "do this or we'll send the >black helicopters around". We simply mean that this is what compliant >files always look like. Sorry, but we must deal with very different programmers. The ones I deal with rebel at anything stronger than a polite suggestion grounded in common interest. You mistake the silence on the DDLm proposal as agreement. I would suggest asking some of the major developers if they have even read it yet. The burden is on us to justify _any_ effort we will require of them. We need to have something finished, complete and well supported with necessary software before most of them will even think about learning what we have been up to. >By being so vague about how to deal with encodings, you are simply >building in the potential for ambiguity and misunderstanding, thereby >creating the inconvenient nuisance you intend to remove. No, I am simply respecting the difference between a text representation and a binary representation. Multiple encodings are a fact of life when working with text. > > this brings us back to being unreasonably rigid and fussy and certain to be >> ignored. I cannot fgure out what "reserve the first line of a CIF2 file" >> means in practice, and "non UTF8 encodings ... be considered by COMCIFS" has >> no practical meaning for what a poor user or software developer in, say, a >> code-page or UCS2 environment is supposed to do now. > >Please enlarge on this important use case. Are you suggesting that >there are systems out there that enforce UCS2 for all text files or >use a code-page that imposes an encoding on all text files in the >system? Your imgCIF use case (embedded in an XML file) was very >helpful in resolving one issue, so perhaps you could describe these >code-page systems and how they manage file encodings. Here is a simple example use case: Assume we have specified UTF-8 with no BOM. Assume a user on a system with an editor that writes a BOM on output of all UTF-8 files edits a CIF2 file. What is he supposed to do now (not in theory, but with the tools be has available on normal OS's plus what we are providing), especially because he probably has no indication of the change. Here is another simple example use case: We have a user working with an EUC-CN code page based editor with a UTF-8 based CIF. What should be do to edit that CIF and return it to the IUCr? We are dealing with both software developers and end-users. We need to consider both. Regards, Herbert At 11:47 PM +1000 9/13/10, James Hester wrote: >See comments below. > >On Mon, Sep 13, 2010 at 10:52 PM, Herbert J. Bernstein > wrote: >> I would suggest actually writing the utility you have in mind. > >Why? It is simply a CIF2 syntax parser with checksum of all the >contents thus parsed. It is not worth spending the time on something >so obviously possible until we agree that we want such a system. > >> In practice, inasmuch as a CIF file looks like a text file, people >> are very likely to just pick one up in any convenient text editor >> change what they want to change and write an unidentified pseudo-cif >> file back out. Anything else needs to be provided to them in >> a complete, platform portable, well-documented package they can >> use easily in place of an editor that they use all the time for >> everything else. > >Note that we are not suggesting replacing editors, far from it, if we >could do that we wouldn't have a problem in the first place. > >> Please be practical -- CIF is a working tool, embedded in the IUCr >> journal process flows, in many crystallographic applications, in >> the PDB workflows, in Dectris detector software, etc.,. etc. >> >> The more disruptive you make the transition from CIF1 to CIF2, the >> more software and documentation you need to create to allow people >> to make the transition actually happen. We are essentially in >> the same place we were in Osaka. How do we break out of this >> loop and move forward? > >We are a lot further forward than Osaka. We have a *complete* syntax >specification on the table, which has received zero objections outside >of this group. No further DDLm problems have been identified. The >only issue left unresolved is that not enough encodings are allowed, >although the one encoding that is allowed is actually sufficient for >all of the useful work that the IUCr expect to do. We could take what >we have to Madrid, with a single caveat that a system for dealing with >non UTF8 encodings is under consideration, and (if the response on the >mailing lists is any indication) everybody would be happy outside of >this list. As for demonstrations, Nick and Syd have been >demonstrating this system for over a decade (with cosmetic differences >in syntax). > >> We need a realistic plan to get our job done and have a complete >> specification with the necessary supporting software for CIF2 in place >> and ready to demonstrate for Madrid, or I would suggest we >> accept the failure of this effort, and start over. > >Somewhat overwrought, don't you think? Because we can't agree on a >scheme for additional encodings we should chuck CIF2 syntax, DDLm and >dREL overboard?? When the IUCr will function perfectly well with UTF8 >only? If you would like to start coding, please structure your code so >that the decoding step may take other encodings beyond UTF8. The rest >is in the draft standard (you will be pleased to see the lack of >ambiguity in that standard, it will make your task easier). > >> -- Herbert >> >> >> >> At 10:32 PM +1000 9/13/10, James Hester wrote: >>>The original concept was to edit the non UTF8 files in the text editor >>>of choice, then run a simple checksumming application (that >>>understands CIF2 syntax) to update the checksum. This application >>>would also pick out sections of text that would be displayed >>>incorrectly in the wrong encoding, and ask the user to confirm that >>>the text was displayed correctly. Such an application could be made > >>freely available by the IUCr. >>> >>>On Mon, Sep 13, 2010 at 8:22 PM, SIMON WESTRIP >>> wrote: >>>> I questioned: >>>> >>>> "For example, if mandatory, does that mean it becomes >>>>impossible to create a >>>> non-UTF8 CIF without using >>>> CIF2-aware software?" >>>> >>>> In some respects this might not be a bad idea - i.e.restricting >>>>the use of >>>> non-UTF8 to CIF2-aware systems... >>>> >>>> Simon (thinking aloud) >>>> >>>> ________________________________ >>>> From: SIMON WESTRIP >>>> To: Group for discussing encoding and content validation schemes for CIF2 >>>> >>>> Sent: Monday, 13 September, 2010 11:05:12 >>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other >>>>sub-topics. .. . >>>> >>>> Yes - I beleive that such a declaration should be mandatory for >>>>all non-UTF8 >>>> CIF2 files, >>>> and agree that a supporting checksum mechanism would be very useful to >>>> CIF2-aware >>>> programs. Until I've revisited the checksum scheme, I can not >>>>say that the >>>> checksum should be mandatory too. >>>> For example, if mandatory, does that mean it becomes impossible >>>>to create a >>>> non-UTF8 CIF without using >>>> CIF2-aware software? >>>> >>>> I need to review the discussions on checksums and indeed the >>>>various forms >>>> that such a declaration might take, >>>> but I do beleive in the principle that it should be mandatory for all >>>> 'stand-alone' non-UTF8 CIF2 files. >>>> If a CIF is packaged in a container, then it will be the job of non-CIF >>>> software to retreive it from the container >>>> and deliver it in its original form. So a non-UTF8 CIF packaged in a >>>> non-UTF8 container (or even a UTF8 container) >>>> should still carry its non-UTF8 declaration. >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> ________________________________ >>>> From: James Hester >>>> To: Group for discussing encoding and content validation schemes for CIF2 >>>> >>>> Sent: Monday, 13 September, 2010 6:24:42 >>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other >>>>sub-topics. .. . >>>> >>>> Hi Simon: the issue with such an encoding declaration is that it is >>>> not supported by generic text tools, and so would not be automatically >>> > inserted, updated or respected when creating, editing (ie open in one >>>> encoding, save in another) or transcoding a CIF2 file. This means it >>>> has no status beyond a hint that could cause as many problems as it >>>> solves. Such a declaration becomes more robust if accompanied by the >>>> checksum that John B suggested. The checksum gives some guarantee >>>> that the encoding has been checked by a CIF-aware program. >>>> >>>> If you are proposing that such a declaration and checksum be mandatory >>>> for all non-UTF8 CIF2 files (not only during transfer), I agree with >>>> you that this would be acceptable. >>>> >>>> >>>_______________________________________________ >>>cif2-encoding mailing list >>>cif2-encoding at iucr.org >>>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> >> -- >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 > > >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> > > > >-- >T +61 (02) 9717 9907 >F +61 (02) 9717 3145 >M +61 (04) 0249 4148 >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100913/f16d8e5d/attachment-0001.html From John.Bollinger at STJUDE.ORG Mon Sep 13 19:52:26 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Mon, 13 Sep 2010 13:52:26 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> On Sunday, September 12, 2010 11:26 PM, James Hester wrote: [...] >To my mind, the encoding of plain CIF files remains an open issue. I >do not view the mechanisms for managing file encoding that are >provided by current OSs to be sufficiently robust, widespread or >consistent that we can rely on developers or text editors respecting >them [...]. I agree that the encoding of plain CIF files remains an open issue. I confess I find your concerns there somewhat vague, especially to the extent that they apply within the confines of a single machine. Do your concerns extend to that level? If so, can you provide an example or two of what you fear might go wrong in that context? As Herb recently wrote, "Multiple encodings are a fact of life when working with text." CIF2 looks like text, it feels like text, and despite some exotic spice, it tastes like text -- even in UTF-8 only form. We cannot pretend that we're dealing with anything other than text. We need to accept, therefore, that no matter what we do, authors and programmers will need to account for multiple encodings, one way or another. The format specification cannot relieve either group of that responsibility. That doesn't necessarily mean, however, that CIF must follow the XML model of being self-defining with regard to text encoding. Given CIF's various uses, we gain little of practical value in this area by defining CIF2 as UTF-8 only, and perhaps equally little by defining required decorations for expressing random encodings. Moreover, the best reading of CIF1 is that it relies on the *local* text conventions, whatever they may be, which is quite a different thing than handling all text conventions that might conceivably be employed. With that being the case, I don't think it needful for CIF2 in any given environment to endorse foreign encoding conventions other than UTF-8. CIF2 reasonably could endorse UTF-16 as well, though, as that cannot be confused with any ASCII-compatible encoding. Allowing UTF-16 would open up useful possibilities both for imgCIF and for future uses not yet conceived. Additionally, since CIF is text I still think it important for CIF2 to endorse the default text conventions of its operating environment. Could we agree on those three as allowed encodings? Consider, given that combination of supported alternatives and no extra support from the spec, how might various parties deal with the unavoidable encoding issue. Here are some of the more reasonable alternatives I see: 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: Option a) accept and provide only UTF-8 and/or UTF-16 CIFs. The responsibility to perform any needed transcoding is on the other party. This is just as it might be with UTF-8-only. Option b) in addition to supporting UTF-8 and/or UTF-16, support other encodings by allowing users to explicitly specify them as part of the submission/retrieval process. The processor / repository would either ensure the CIF is properly labeled, or, better, transcode it to UTF-8[/16]. This also is just as it might be with UTF-8 only. 2. Programs and Libraries: Option a) On input, detect encoding by checking first for UTF-16, assuming UTF-8 if not UTF-16, and falling back to default text conventions if a UTF-8 decoding error is encountered. On output, encode as directed by the user (among the two/three options), defaulting to the input encoding when that is available and feasible. These would be desirable behaviors even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, but they do exceed UTF-8-only requirements. Option b) Require input and produce output according to a fixed set of conventions (whether local text conventions or UTF-8/16). The program user is responsible for any needed transcoding. This would be sufficient for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those differ, however, in which text conventions would be assumed. 3. Users/Authors: 3.1. Creating / editing CIFs No change from current practice is needed, but users might choose to store CIFs in UTF-8[/16] form. This is just as it would likely be under UTF-8 only. 3.2. Transferring CIFs Unless an alternative agreement on encoding can be reached by some means, the transferor must ensure the CIF is encoded in UTF-8[/16]. This differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed. 3.3. Receiving CIFs The receiver may reasonably demand that the CIF be provided in UTF-8[/16] form. He should *expect* that form unless some alternative agreement is established. Any desired transcoding from UTF-8[/16] to an alternative encoding is the user's responsibility. Again, this is not significantly different from the UTF-8 only case. A driving force in many of those cases is the well-understood (especially here!) fact that different systems cannot be relied upon to share text conventions, thus leaving UTF-8[/16] as the only available general-purpose medium of exchange. At the same time, local conventions are not forbidden from use where they can be relied upon -- most notably, within the same computer. Even if end-users, as a group, do not appreciate those details, we can ensure via the spec that CIF2 implementers do. That's sufficient. So, if pretty much all my expected behavior under UTF-8[/16]+local is the same as it would be under UTF-8-only, then why prefer the former? Because under UTF-8[/16]+local, all the behavior described is conformant to the spec, whereas under UTF-8 only, a significant proportion is not. If the standard adequately covers these behaviors then we can expect more uniform support. Moreover, this bears directly on community acceptance of the spec. If flaunting the spec with respect to encoding becomes common, then the spec will have failed, at least in that area. Having failed in one area, it is more likely to fail in others. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From simonwestrip at btinternet.com Tue Sep 14 13:19:45 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Tue, 14 Sep 2010 12:19:45 +0000 (GMT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <930138.36485.qm@web87008.mail.ird.yahoo.com> I sense some common ground here with my previous post. The UTF8/16 pair could possibly be extended to any unicode encoding that is unambiguously/inherently identifiable? The 'local' encodings then encompass everything else? However, I think we've yet to agree that anything but UTF8 is to be allowed at all. We have a draft spec that stipulates UTF8, but I infer from this thread that there is scope to relax that restriction. The views seem to range from at least 'leaving the door open' in recognition of the variety of encodings available, to advocating that the encoding should not be part of the specification at all, and it will be down to developers to accommodate/influence user practice. I'm in favour of a default encoding or maybe any encoding that is inherently identifiable, and providing a means to declare other encodings (however untrustworthy the declaration may be, it would at least be available to conscientious users/developers), all documented in the spec. Please forgive me if this summary is off the mark; my conclusion is that there's a willingness to accommodate multiple encodings in this (albeit very small) group. Given that we are starting from the position of having a single encoding (agreed upon after much earlier debate), I cannot see us performing a complete U-turn to allow any (potentially unrecognizable) encoding as in CIF1, i.e. without some specification of a canonical encoding or mechanisms to identify/declare the encoding. On the other hand, I hope to see a revised spec that isnt UTF8 only. To get to the point - is there any hope of reaching a compromise? Cheers Simon ________________________________ From: "Bollinger, John C" To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 13 September, 2010 19:52:26 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . On Sunday, September 12, 2010 11:26 PM, James Hester wrote: [...] >To my mind, the encoding of plain CIF files remains an open issue. I >do not view the mechanisms for managing file encoding that are >provided by current OSs to be sufficiently robust, widespread or >consistent that we can rely on developers or text editors respecting >them [...]. I agree that the encoding of plain CIF files remains an open issue. I confess I find your concerns there somewhat vague, especially to the extent that they apply within the confines of a single machine. Do your concerns extend to that level? If so, can you provide an example or two of what you fear might go wrong in that context? As Herb recently wrote, "Multiple encodings are a fact of life when working with text." CIF2 looks like text, it feels like text, and despite some exotic spice, it tastes like text -- even in UTF-8 only form. We cannot pretend that we're dealing with anything other than text. We need to accept, therefore, that no matter what we do, authors and programmers will need to account for multiple encodings, one way or another. The format specification cannot relieve either group of that responsibility. That doesn't necessarily mean, however, that CIF must follow the XML model of being self-defining with regard to text encoding. Given CIF's various uses, we gain little of practical value in this area by defining CIF2 as UTF-8 only, and perhaps equally little by defining required decorations for expressing random encodings. Moreover, the best reading of CIF1 is that it relies on the *local* text conventions, whatever they may be, which is quite a different thing than handling all text conventions that might conceivably be employed. With that being the case, I don't think it needful for CIF2 in any given environment to endorse foreign encoding conventions other than UTF-8. CIF2 reasonably could endorse UTF-16 as well, though, as that cannot be confused with any ASCII-compatible encoding. Allowing UTF-16 would open up useful possibilities both for imgCIF and for future uses not yet conceived. Additionally, since CIF is text I still think it important for CIF2 to endorse the default text conventions of its operating environment. Could we agree on those three as allowed encodings? Consider, given that combination of supported alternatives and no extra support from the spec, how might various parties deal with the unavoidable encoding issue. Here are some of the more reasonable alternatives I see: 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: Option a) accept and provide only UTF-8 and/or UTF-16 CIFs. The responsibility to perform any needed transcoding is on the other party. This is just as it might be with UTF-8-only. Option b) in addition to supporting UTF-8 and/or UTF-16, support other encodings by allowing users to explicitly specify them as part of the submission/retrieval process. The processor / repository would either ensure the CIF is properly labeled, or, better, transcode it to UTF-8[/16]. This also is just as it might be with UTF-8 only. 2. Programs and Libraries: Option a) On input, detect encoding by checking first for UTF-16, assuming UTF-8 if not UTF-16, and falling back to default text conventions if a UTF-8 decoding error is encountered. On output, encode as directed by the user (among the two/three options), defaulting to the input encoding when that is available and feasible. These would be desirable behaviors even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, but they do exceed UTF-8-only requirements. Option b) Require input and produce output according to a fixed set of conventions (whether local text conventions or UTF-8/16). The program user is responsible for any needed transcoding. This would be sufficient for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those differ, however, in which text conventions would be assumed. 3. Users/Authors: 3.1. Creating / editing CIFs No change from current practice is needed, but users might choose to store CIFs in UTF-8[/16] form. This is just as it would likely be under UTF-8 only. 3.2. Transferring CIFs Unless an alternative agreement on encoding can be reached by some means, the transferor must ensure the CIF is encoded in UTF-8[/16]. This differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed. 3.3. Receiving CIFs The receiver may reasonably demand that the CIF be provided in UTF-8[/16] form. He should *expect* that form unless some alternative agreement is established. Any desired transcoding from UTF-8[/16] to an alternative encoding is the user's responsibility. Again, this is not significantly different from the UTF-8 only case. A driving force in many of those cases is the well-understood (especially here!) fact that different systems cannot be relied upon to share text conventions, thus leaving UTF-8[/16] as the only available general-purpose medium of exchange. At the same time, local conventions are not forbidden from use where they can be relied upon -- most notably, within the same computer. Even if end-users, as a group, do not appreciate those details, we can ensure via the spec that CIF2 implementers do. That's sufficient. So, if pretty much all my expected behavior under UTF-8[/16]+local is the same as it would be under UTF-8-only, then why prefer the former? Because under UTF-8[/16]+local, all the behavior described is conformant to the spec, whereas under UTF-8 only, a significant proportion is not. If the standard adequately covers these behaviors then we can expect more uniform support. Moreover, this bears directly on community acceptance of the spec. If flaunting the spec with respect to encoding becomes common, then the spec will have failed, at least in that area. Having failed in one area, it is more likely to fail in others. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100914/e0024b9b/attachment-0001.html From John.Bollinger at STJUDE.ORG Tue Sep 14 15:46:10 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Tue, 14 Sep 2010 09:46:10 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: <930138.36485.qm@web87008.mail.ird.yahoo.com> References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC2@SJMEMXMBS11.stjude.sjcrh.local> Simon, On Tuesday, September 14, 2010 7:20 AM, SIMON WESTRIP wrote: >I sense some common ground here with my previous post. I hope so. My proposal is intended as a compromise position, and I hope it will give all the participants in this discussion enough of what they want that we can finally come to an agreement. >The UTF8/16 pair could possibly be extended to any unicode encoding that is unambiguously/inherently identifiable? Did you have any particular other encodings you would put in that category? The only one(s) I think would qualify are UTF-32 variants, and, to the extent it is distinct from UTF-16, perhaps UTF-16LE. If we're don't tag CIFs with encoding information (and that's not part of my proposal) then I don't think it safe to deem encodings that we do not explicitly enumerate as "inherently identifiable". My proposal intentionally minimizes the list of allowed encodings (even inclusion of UTF-16 is left open to debate) because (i) having more than one allowed encoding already requires the UTF-8 only side to yield some ground, and (ii) having fewer alternatives makes for much simpler autodetection. >The 'local' encodings then encompass everything else? Sort of. "local" is environment-specific. It is what the system's text editors read and (especially) write by default, what the local Fortran I/O library expects of a 'formatted' file, what a Java InputStreamReader in that environment handles correctly when no encoding is explicitly specified to it, etc.. >However, I think we've yet to agree that anything but UTF8 is to be allowed at all. We have a draft spec that stipulates UTF8, >but I infer from this thread that there is scope to relax that restriction. Um, yes. I think perhaps we've snuck one past you: this entire list (Cif2-encoding) was split off from the ddlm-group list for the purpose of discussing that topic, as there strong opinions on both sides. Brian administratively subscribed several of the ddlm-group members to this list when he created it, including you. >The views seem to range from at least 'leaving the door open' >in recognition of the variety of encodings available, to advocating that the encoding should not be part of the specification at all, and it will be down to developers to accommodate/influence user practice. I think a better characterization of the views on the main CIF representation is that they range from 'no encoding but UTF-8 should be permitted' to 'all text conventions must be supported'. We have also discussed a side issue or two, such as what to do about embedding CIF text in other files, but those seem not to be very contentious. A central pillar of the multiple conventions camp's arguments is CIF1's position that CIFs are text files complying with local text conventions. Many CIF1 users and programmers have relied on that, and therefore we would like to avid throwing it out the window. The essential position of the UTF-8-only camp is that CIF2 must be inherently resistant to misinterpretation, especially character encoding mismatches. > I'm in favour of a default encoding or maybe any encoding that is inherently identifiable, and providing a means to declare other encodings (however untrustworthy the declaration may be, it would at least be available to conscientious users/developers), all documented in the spec. My proposal comes close to making UTF-8 a default encoding, though if UTF-16 is allowed as well then it would be a viable candidate for that spot. Inasmuchas these cannot be confused in a CIF context, I don't see the availability of both as a problem. My proposal intentionally avoids requiring any kind of tagging, as (i) Proponents of the UTF-8-only position have been relatively unreceptive to tagging as a solution, mainly citing concerns about reliability of encoding tags (ii) Avoiding tagging avoids giving any impression that CIF processors are expected to handle non-native encodings other than UTF-8[/16] (iii) Leaving out tags keeps it simpler There is room for some kind of tagging scheme as a supplementary convention or standard, and with input from James I have advanced 'Scheme B' for this purpose. You will find discussion of Scheme B in the list archives, especially among the earliest messages on this (cif2-encoding) list. >Please forgive me if this summary is off the mark; my conclusion is that there's a willingness to accommodate multiple encodings >in this (albeit very small) group. Given that we are starting from the position of having a single encoding (agreed upon after much earlier debate), I cannot see us performing a complete U-turn to allow any (potentially unrecognizable) encoding as in CIF1, i.e. without some specification of a canonical encoding or mechanisms to identify/declare the encoding. On the other hand, I hope to see >a revised spec that isnt UTF8 only. Part of my thesis behind the present compromise proposal is that in the context of any particular computing environment, CIF1 in fact *does not* support every possible encoding. It supports *only* the local default text conventions. CIF1 allows all encodings only in the sense that for any given encoding there may be some computing environment, somewhere, for which that encoding is the default -- in that environment, CIF1 supports that encoding. UTF-8-only would be a complete reversal of CIF1 in the sense that UTF-8 is generally not the default convention in current environments. Thus, requiring UTF-8 would demand that CIF2 files comply with NON-native conventions instead of with native ones. Under ASCII-compatible default conventions, the distinction appears only when non-ASCII characters appear in a CIF, but I have come to view that as more of a detriment than an advantage: it would provide fertile ground for bugs and mistakes. Instead of such a complete reversal, then, my compromise proposal basically adds UTF-8 and maybe UTF-16 as allowed encodings, and explicitly specifies that the only other supported encoding is the local default, whatever that happens to be. This acknowledges that CIF2 users will have more exposure to text encoding concerns than CIF1 users do. Herb argues that that is inevitable, and I agree. >To get to the point - is there any hope of reaching a compromise? Scheme B was an attempt to build a compromise, but it doesn't look likely to succeed in that capacity. I think the proposal to which you just responded is the best hope for a compromise that so far has been presented. If that or something like it is not accepted then I'm having trouble seeing where else to turn. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Tue Sep 14 15:46:43 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 14 Sep 2010 10:46:43 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <930138.36485.qm@web87008.mail.ird.yahoo.com> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> Message-ID: Dear Colleagues, To avoid any misunderstandings, rather than worrying about how we got to where we are, let us each just state a clear position. Here is mine: I favor CIF2 being stated in terms of UTF-8 for clarity, but not specifying any particular _mandatory_ encoding of a CIF2 file as long as there is a clearly agreed mechanism between the creator and consumer of a given CIF2 file as to how to faithfully transform the file between creator's and the consumer's encodings. I favor UTF-8 being the default encoding that any CIF2 creator should feel free to use without having to establish any prior agreement with consumers, and that all consumers should try to make arrangements to be able to read, either directly or via some conversion utility or service. If the consumers don't make such arrangements then there may be CIF2 files that they will not be able to read. If a producer creates a CIF2 in any encoding other than UTF8 then there may be consumers who have difficulty reading that CIF2. I favor the IUCr taking responsibility for collecting and disseminating information on particularly useful ways to go to and from UTF8 and/or other popular encodings. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Tue, 14 Sep 2010, SIMON WESTRIP wrote: > I sense some common ground here with my previous post. > > The UTF8/16 pair could possibly be extended to any unicode encoding that is > unambiguously/inherently identifiable? > The 'local' encodings then encompass everything else? > > However, I think we've yet to agree that anything but UTF8 is to be allowed > at all. We have a draft spec that stipulates UTF8, > but I infer from this thread that there is scope to relax that restriction. > The views seem to range from at least 'leaving the door open' > ?in recognition of the variety of encodings available, to advocating that > the encoding should not be part of the specification at all, and it will be > down to developers to accommodate/influence user practice. I'm in favour of > a default encoding or maybe any encoding that is inherently identifiable, > and providing a means to declare other encodings (however untrustworthy the > declaration may be, it would at least be available to conscientious > users/developers), all documented in the spec. > > Please forgive me if this summary is off the mark; my conclusion is that > there's a willingness to accommodate multiple encodings > in this (albeit very small) group. Given that we are starting from the > position of having a single encoding (agreed upon after much earlier > debate), I cannot see us performing a complete U-turn to allow any > (potentially unrecognizable) encoding as in CIF1, i.e. without some > specification of a canonical encoding or mechanisms to identify/declare the > encoding. On the other hand, I hope to see > a revised spec that isnt UTF8 only. > > To get to the point - is there any hope of reaching a compromise? > > Cheers > > Simon > > > ____________________________________________________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Monday, 13 September, 2010 19:52:26 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > > On Sunday, September 12, 2010 11:26 PM, James Hester wrote: > [...] > >To my mind, the encoding of plain CIF files remains an open issue.? I > >do not view the mechanisms for managing file encoding that are > >provided by current OSs to be sufficiently robust, widespread or > >consistent that we can rely on developers or text editors respecting > >them [...]. > > I agree that the encoding of plain CIF files remains an open issue. > > I confess I find your concerns there somewhat vague, especially to the > extent that they apply within the confines of a single machine.? Do your > concerns extend to that level?? If so, can you provide an example or two of > what you fear might go wrong in that context? > > As Herb recently wrote, "Multiple encodings are a fact of life when working > with text."? CIF2 looks like text, it feels like text, and despite some > exotic spice, it tastes like text -- even in UTF-8 only form.? We cannot > pretend that we're dealing with anything other than text.? We need to > accept, therefore, that no matter what we do, authors and programmers will > need to account for multiple encodings, one way or another.? The format > specification cannot relieve either group of that responsibility. > > That doesn't necessarily mean, however, that CIF must follow the XML model > of being self-defining with regard to text encoding.? Given CIF's various > uses, we gain little of practical value in this area by defining CIF2 as > UTF-8 only, and perhaps equally little by defining required decorations for > expressing random encodings.? Moreover, the best reading of CIF1 is that it > relies on the *local* text conventions, whatever they may be, which is quite > a different thing than handling all text conventions that might conceivably > be employed. > > With that being the case, I don't think it needful for CIF2 in any given > environment to endorse foreign encoding conventions other than UTF-8.? CIF2 > reasonably could endorse UTF-16 as well, though, as that cannot be confused > with any ASCII-compatible encoding.? Allowing UTF-16 would open up useful > possibilities both for imgCIF and for future uses not yet conceived.? > Additionally, since CIF is text I still think it important for CIF2 to > endorse the default text conventions of its operating environment. > > Could we agree on those three as allowed encodings?? Consider, given that > combination of supported alternatives and no extra support from the spec, > how might various parties deal with the unavoidable encoding issue.? Here > are some of the more reasonable alternatives I see: > > 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: > > ? ? ? ? Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.? The > responsibility to perform any needed transcoding is on the other party.? > This is just as it might be with UTF-8-only. > > ? ? ? ? Option b) in addition to supporting UTF-8 and/or UTF-16, support > other encodings by allowing users to explicitly specify them as part of the > submission/retrieval process.? The processor / repository would either > ensure the CIF is properly labeled, or, better, transcode it to UTF-8[/16].? > This also is just as it might be with UTF-8 only. > > 2. Programs and Libraries: > > ? ? ? ? Option a) On input, detect encoding by checking first for UTF-16, > assuming UTF-8 if not UTF-16, and falling back to default text conventions > if a UTF-8 decoding error is encountered.? On output, encode as directed by > the user (among the two/three options), defaulting to the input encoding > when that is available and feasible.? These would be desirable behaviors > even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, > but they do exceed UTF-8-only requirements. > > ? ? ? ? Option b) Require input and produce output according to a fixed set > of conventions (whether local text conventions or UTF-8/16).? The program > user is responsible for any needed transcoding.? This would be sufficient > for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those > differ, however, in which text conventions would be assumed. > > 3. Users/Authors: > 3.1. Creating / editing CIFs > ? ? ? ? No change from current practice is needed, but users might choose to > store CIFs in UTF-8[/16] form.? This is just as it would likely be under > UTF-8 only. > > 3.2. Transferring CIFs > ? ? ? ? Unless an alternative agreement on encoding can be reached by some > means, the transferor must ensure the CIF is encoded in UTF-8[/16].? This > differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed. > > 3.3. Receiving CIFs > ? ? ? ? The receiver may reasonably demand that the CIF be provided in > UTF-8[/16] form.? He should *expect* that form unless some alternative > agreement is established.? Any desired transcoding from UTF-8[/16] to an > alternative encoding is the user's responsibility.? Again, this is not > significantly different from the UTF-8 only case. > > > A driving force in many of those cases is the well-understood (especially > here!) fact that different systems cannot be relied upon to share text > conventions, thus leaving UTF-8[/16] as the only available general-purpose > medium of exchange.? At the same time, local conventions are not forbidden > from use where they can be relied upon -- most notably, within the same > computer.? Even if end-users, as a group, do not appreciate those details, > we can ensure via the spec that CIF2 implementers do.? That's sufficient. > > So, if pretty much all my expected behavior under UTF-8[/16]+local is the > same as it would be under UTF-8-only, then why prefer the former?? Because > under UTF-8[/16]+local, all the behavior described is conformant to the > spec, whereas under UTF-8 only, a significant proportion is not.? If the > standard adequately covers these behaviors then we can expect more uniform > support.? Moreover, this bears directly on community acceptance of the > spec.? If flaunting the spec with respect to encoding becomes common, then > the spec will have failed, at least in that area.? Having failed in one > area, it is more likely to fail in others. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > > Email Disclaimer:? www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > From yaya at bernstein-plus-sons.com Tue Sep 14 15:58:39 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 14 Sep 2010 10:58:39 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> Message-ID: One, hopefully relevant, aside -- ascii files are not as unambiguous as one might think. Depending on what localization one has one one's computer, the code point 0x5c (one of the characters in the first 127) will be shown as a reverse solidus, a yen currency symbol or a won currency symbol. This is a holdover from the days of national variants of the ISO character set, and shows no signs of going away any time soon. This is _not_ the only such case, but it is one that impacts most programming languages, including dREL, and existing CIF files, including the PDB's mmCIF files. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Tue, 14 Sep 2010, Herbert J. Bernstein wrote: > Dear Colleagues, > > To avoid any misunderstandings, rather than worrying about how > we got to where we are, let us each just state a clear position. > Here is mine: > > I favor CIF2 being stated in terms of UTF-8 for clarity, but > not specifying any particular _mandatory_ encoding of a CIF2 file > as long as there is a clearly agreed mechanism between the > creator and consumer of a given CIF2 file as to how to faithfully > transform the file between creator's and the consumer's encodings. > > I favor UTF-8 being the default encoding that any CIF2 creator > should feel free to use without having to establish any prior > agreement with consumers, and that all consumers should try > to make arrangements to be able to read, either directly or > via some conversion utility or service. If the consumers don't > make such arrangements then there may be CIF2 files that they > will not be able to read. If a producer creates a CIF2 in any > encoding other than UTF8 then there may be consumers who have > difficulty reading that CIF2. > > I favor the IUCr taking responsibility for collecting and > disseminating information on particularly useful ways to go > to and from UTF8 and/or other popular encodings. > > Regards, > Herbert > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Tue, 14 Sep 2010, SIMON WESTRIP wrote: > >> I sense some common ground here with my previous post. >> >> The UTF8/16 pair could possibly be extended to any unicode encoding that is >> unambiguously/inherently identifiable? >> The 'local' encodings then encompass everything else? >> >> However, I think we've yet to agree that anything but UTF8 is to be allowed >> at all. We have a draft spec that stipulates UTF8, >> but I infer from this thread that there is scope to relax that restriction. >> The views seem to range from at least 'leaving the door open' >> ?in recognition of the variety of encodings available, to advocating that >> the encoding should not be part of the specification at all, and it will be >> down to developers to accommodate/influence user practice. I'm in favour of >> a default encoding or maybe any encoding that is inherently identifiable, >> and providing a means to declare other encodings (however untrustworthy the >> declaration may be, it would at least be available to conscientious >> users/developers), all documented in the spec. >> >> Please forgive me if this summary is off the mark; my conclusion is that >> there's a willingness to accommodate multiple encodings >> in this (albeit very small) group. Given that we are starting from the >> position of having a single encoding (agreed upon after much earlier >> debate), I cannot see us performing a complete U-turn to allow any >> (potentially unrecognizable) encoding as in CIF1, i.e. without some >> specification of a canonical encoding or mechanisms to identify/declare the >> encoding. On the other hand, I hope to see >> a revised spec that isnt UTF8 only. >> >> To get to the point - is there any hope of reaching a compromise? >> >> Cheers >> >> Simon >> >> >> ____________________________________________________________________________ >> From: "Bollinger, John C" >> To: Group for discussing encoding and content validation schemes for CIF2 >> >> Sent: Monday, 13 September, 2010 19:52:26 >> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . >> >> >> On Sunday, September 12, 2010 11:26 PM, James Hester wrote: >> [...] >> >To my mind, the encoding of plain CIF files remains an open issue.? I >> >do not view the mechanisms for managing file encoding that are >> >provided by current OSs to be sufficiently robust, widespread or >> >consistent that we can rely on developers or text editors respecting >> >them [...]. >> >> I agree that the encoding of plain CIF files remains an open issue. >> >> I confess I find your concerns there somewhat vague, especially to the >> extent that they apply within the confines of a single machine.? Do your >> concerns extend to that level?? If so, can you provide an example or two of >> what you fear might go wrong in that context? >> >> As Herb recently wrote, "Multiple encodings are a fact of life when working >> with text."? CIF2 looks like text, it feels like text, and despite some >> exotic spice, it tastes like text -- even in UTF-8 only form.? We cannot >> pretend that we're dealing with anything other than text.? We need to >> accept, therefore, that no matter what we do, authors and programmers will >> need to account for multiple encodings, one way or another.? The format >> specification cannot relieve either group of that responsibility. >> >> That doesn't necessarily mean, however, that CIF must follow the XML model >> of being self-defining with regard to text encoding.? Given CIF's various >> uses, we gain little of practical value in this area by defining CIF2 as >> UTF-8 only, and perhaps equally little by defining required decorations for >> expressing random encodings.? Moreover, the best reading of CIF1 is that it >> relies on the *local* text conventions, whatever they may be, which is >> quite >> a different thing than handling all text conventions that might conceivably >> be employed. >> >> With that being the case, I don't think it needful for CIF2 in any given >> environment to endorse foreign encoding conventions other than UTF-8.? CIF2 >> reasonably could endorse UTF-16 as well, though, as that cannot be confused >> with any ASCII-compatible encoding.? Allowing UTF-16 would open up useful >> possibilities both for imgCIF and for future uses not yet conceived.? >> Additionally, since CIF is text I still think it important for CIF2 to >> endorse the default text conventions of its operating environment. >> >> Could we agree on those three as allowed encodings?? Consider, given that >> combination of supported alternatives and no extra support from the spec, >> how might various parties deal with the unavoidable encoding issue.? Here >> are some of the more reasonable alternatives I see: >> >> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: >> >> ? ? ? ? Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.? The >> responsibility to perform any needed transcoding is on the other party.? >> This is just as it might be with UTF-8-only. >> >> ? ? ? ? Option b) in addition to supporting UTF-8 and/or UTF-16, support >> other encodings by allowing users to explicitly specify them as part of the >> submission/retrieval process.? The processor / repository would either >> ensure the CIF is properly labeled, or, better, transcode it to >> UTF-8[/16].? >> This also is just as it might be with UTF-8 only. >> >> 2. Programs and Libraries: >> >> ? ? ? ? Option a) On input, detect encoding by checking first for UTF-16, >> assuming UTF-8 if not UTF-16, and falling back to default text conventions >> if a UTF-8 decoding error is encountered.? On output, encode as directed by >> the user (among the two/three options), defaulting to the input encoding >> when that is available and feasible.? These would be desirable behaviors >> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, >> but they do exceed UTF-8-only requirements. >> >> ? ? ? ? Option b) Require input and produce output according to a fixed set >> of conventions (whether local text conventions or UTF-8/16).? The program >> user is responsible for any needed transcoding.? This would be sufficient >> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those >> differ, however, in which text conventions would be assumed. >> >> 3. Users/Authors: >> 3.1. Creating / editing CIFs >> ? ? ? ? No change from current practice is needed, but users might choose >> to >> store CIFs in UTF-8[/16] form.? This is just as it would likely be under >> UTF-8 only. >> >> 3.2. Transferring CIFs >> ? ? ? ? Unless an alternative agreement on encoding can be reached by some >> means, the transferor must ensure the CIF is encoded in UTF-8[/16].? This >> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) >> allowed. >> >> 3.3. Receiving CIFs >> ? ? ? ? The receiver may reasonably demand that the CIF be provided in >> UTF-8[/16] form.? He should *expect* that form unless some alternative >> agreement is established.? Any desired transcoding from UTF-8[/16] to an >> alternative encoding is the user's responsibility.? Again, this is not >> significantly different from the UTF-8 only case. >> >> >> A driving force in many of those cases is the well-understood (especially >> here!) fact that different systems cannot be relied upon to share text >> conventions, thus leaving UTF-8[/16] as the only available general-purpose >> medium of exchange.? At the same time, local conventions are not forbidden >> from use where they can be relied upon -- most notably, within the same >> computer.? Even if end-users, as a group, do not appreciate those details, >> we can ensure via the spec that CIF2 implementers do.? That's sufficient. >> >> So, if pretty much all my expected behavior under UTF-8[/16]+local is the >> same as it would be under UTF-8-only, then why prefer the former?? Because >> under UTF-8[/16]+local, all the behavior described is conformant to the >> spec, whereas under UTF-8 only, a significant proportion is not.? If the >> standard adequately covers these behaviors then we can expect more uniform >> support.? Moreover, this bears directly on community acceptance of the >> spec.? If flaunting the spec with respect to encoding becomes common, then >> the spec will have failed, at least in that area.? Having failed in one >> area, it is more likely to fail in others. >> >> >> Regards, >> >> John >> -- >> John C. Bollinger, Ph.D. >> Department of Structural Biology >> St. Jude Children's Research Hospital >> >> >> >> >> Email Disclaimer:? www.stjude.org/emaildisclaimer >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> > From simonwestrip at btinternet.com Tue Sep 14 17:02:24 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Tue, 14 Sep 2010 16:02:24 +0000 (GMT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC2@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC2@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <452384.15312.qm@web87013.mail.ird.yahoo.com> Thank you John for your response. I will state my position in due course (hopefully with more clarity than I usually employ!) but in the meantime, I'll briefly answer your question regarding extending the UTF8/16 set: Yes, I was thinking of the existing 'UTF family', while also allowing extension in the future to any encodings that fall within the same class of 'inherently identifiable' encodings. By 'inherently identifiable' I mean encodings that are identifiable by e.g. BOM; but as you explain, this is not appropriate for your proposal. Cheers Simon ________________________________ From: "Bollinger, John C" To: Group for discussing encoding and content validation schemes for CIF2 Sent: Tuesday, 14 September, 2010 15:46:10 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . Simon, On Tuesday, September 14, 2010 7:20 AM, SIMON WESTRIP wrote: >I sense some common ground here with my previous post. I hope so. My proposal is intended as a compromise position, and I hope it will give all the participants in this discussion enough of what they want that we can finally come to an agreement. >The UTF8/16 pair could possibly be extended to any unicode encoding that is >unambiguously/inherently identifiable? Did you have any particular other encodings you would put in that category? The only one(s) I think would qualify are UTF-32 variants, and, to the extent it is distinct from UTF-16, perhaps UTF-16LE. If we're don't tag CIFs with encoding information (and that's not part of my proposal) then I don't think it safe to deem encodings that we do not explicitly enumerate as "inherently identifiable". My proposal intentionally minimizes the list of allowed encodings (even inclusion of UTF-16 is left open to debate) because (i) having more than one allowed encoding already requires the UTF-8 only side to yield some ground, and (ii) having fewer alternatives makes for much simpler autodetection. >The 'local' encodings then encompass everything else? Sort of. "local" is environment-specific. It is what the system's text editors read and (especially) write by default, what the local Fortran I/O library expects of a 'formatted' file, what a Java InputStreamReader in that environment handles correctly when no encoding is explicitly specified to it, etc.. >However, I think we've yet to agree that anything but UTF8 is to be allowed at >all. We have a draft spec that stipulates UTF8, >but I infer from this thread that there is scope to relax that restriction. Um, yes. I think perhaps we've snuck one past you: this entire list (Cif2-encoding) was split off from the ddlm-group list for the purpose of discussing that topic, as there strong opinions on both sides. Brian administratively subscribed several of the ddlm-group members to this list when he created it, including you. >The views seem to range from at least 'leaving the door open' >in recognition of the variety of encodings available, to advocating that the >encoding should not be part of the specification at all, and it will be down to >developers to accommodate/influence user practice. I think a better characterization of the views on the main CIF representation is that they range from 'no encoding but UTF-8 should be permitted' to 'all text conventions must be supported'. We have also discussed a side issue or two, such as what to do about embedding CIF text in other files, but those seem not to be very contentious. A central pillar of the multiple conventions camp's arguments is CIF1's position that CIFs are text files complying with local text conventions. Many CIF1 users and programmers have relied on that, and therefore we would like to avid throwing it out the window. The essential position of the UTF-8-only camp is that CIF2 must be inherently resistant to misinterpretation, especially character encoding mismatches. > I'm in favour of a default encoding or maybe any encoding that is inherently >identifiable, and providing a means to declare other encodings (however >untrustworthy the declaration may be, it would at least be available to >conscientious users/developers), all documented in the spec. My proposal comes close to making UTF-8 a default encoding, though if UTF-16 is allowed as well then it would be a viable candidate for that spot. Inasmuchas these cannot be confused in a CIF context, I don't see the availability of both as a problem. My proposal intentionally avoids requiring any kind of tagging, as (i) Proponents of the UTF-8-only position have been relatively unreceptive to tagging as a solution, mainly citing concerns about reliability of encoding tags (ii) Avoiding tagging avoids giving any impression that CIF processors are expected to handle non-native encodings other than UTF-8[/16] (iii) Leaving out tags keeps it simpler There is room for some kind of tagging scheme as a supplementary convention or standard, and with input from James I have advanced 'Scheme B' for this purpose. You will find discussion of Scheme B in the list archives, especially among the earliest messages on this (cif2-encoding) list. >Please forgive me if this summary is off the mark; my conclusion is that there's >a willingness to accommodate multiple encodings >in this (albeit very small) group. Given that we are starting from the position >of having a single encoding (agreed upon after much earlier debate), I cannot >see us performing a complete U-turn to allow any (potentially unrecognizable) >encoding as in CIF1, i.e. without some specification of a canonical encoding or >mechanisms to identify/declare the encoding. On the other hand, I hope to see >a revised spec that isnt UTF8 only. Part of my thesis behind the present compromise proposal is that in the context of any particular computing environment, CIF1 in fact *does not* support every possible encoding. It supports *only* the local default text conventions. CIF1 allows all encodings only in the sense that for any given encoding there may be some computing environment, somewhere, for which that encoding is the default -- in that environment, CIF1 supports that encoding. UTF-8-only would be a complete reversal of CIF1 in the sense that UTF-8 is generally not the default convention in current environments. Thus, requiring UTF-8 would demand that CIF2 files comply with NON-native conventions instead of with native ones. Under ASCII-compatible default conventions, the distinction appears only when non-ASCII characters appear in a CIF, but I have come to view that as more of a detriment than an advantage: it would provide fertile ground for bugs and mistakes. Instead of such a complete reversal, then, my compromise proposal basically adds UTF-8 and maybe UTF-16 as allowed encodings, and explicitly specifies that the only other supported encoding is the local default, whatever that happens to be. This acknowledges that CIF2 users will have more exposure to text encoding concerns than CIF1 users do. Herb argues that that is inevitable, and I agree. >To get to the point - is there any hope of reaching a compromise? Scheme B was an attempt to build a compromise, but it doesn't look likely to succeed in that capacity. I think the proposal to which you just responded is the best hope for a compromise that so far has been presented. If that or something like it is not accepted then I'm having trouble seeing where else to turn. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100914/30248db0/attachment.html From John.Bollinger at STJUDE.ORG Tue Sep 14 17:10:09 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Tue, 14 Sep 2010 11:10:09 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC3@SJMEMXMBS11.stjude.sjcrh.local> On Tuesday, September 14, 2010 9:47 AM, Herbert J. Bernstein wrote: > To avoid any misunderstandings, rather than worrying about how we got to where we are, let us each just state a clear position. Very well: I favor restricting the scope of the CIF2 specification to the file format, excluding any explicit requirements of programs, users, or other entities. *Non-normative* commentary on the meaning, impact, and use of the normative format definition is welcome, however. In that light, I favor CIF2 defining binary "CIFs" formed by encoding the underlying Unicode text according to local text conventions, as in CIF1, and those formed by encoding the underlying Unicode text according to UTF-8. CIFs of the former type are "text files" in their context; those of latter type might also be text files under some circumstances. If the Unicode text consists exclusively of ASCII characters then these two options are indistinguishable in many contexts. I am open to CIF2 additionally defining binary CIFs formed by encoding the underlying Unicode text according to specific alternative schemes. In particular, I would agree to UTF-16. My support for other specific alternatives would be granted or withheld on a case-by-case basis. I disfavor CIF2 defining binary CIFs formed in other ways, or leaving the definition of a "CIF" open-ended, but I favor express recognition of the possibility of alternative serializations CIF-conformant Unicode text. In that vein, I favor creating a supplementary specification for CIF storage and exchange that addresses the multitude of possible encodings that CIF2 support for local defaults would permit in various environments. My use of the term "Unicode text" is meant to emphasize that the vast majority of the CIF2 spec is independent of any encoding. I think the latest (May) draft of the spec for the most part uses similar terminology, and I favor that form of description over one based on UTF-8 or some other specific encoding as a placeholder or reference. It is my expectation that a result of the above provisions would be establishment of UTF-8 as a de facto default encoding for CIF2 CIFs. Best, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From simonwestrip at btinternet.com Tue Sep 14 22:06:42 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Tue, 14 Sep 2010 14:06:42 -0700 (PDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> Message-ID: <411401.96839.qm@web87006.mail.ird.yahoo.com> Dear all My position regarding the encoding issue is based on the fundamental belief that CIF is 'text' - always has been. It can be read by humans, and even has 'dictionaries' that define its content and can stand alone as human-readable documents (indeed, DDLm dictionaries will probably provide a wealth of human-readable information). So the battle is against the machines - whereas pen and paper can convey text unambiguously, computers have to translate (encode/decode) that text for us humans to read... OK - analogy abandoned - no more cliches - my thoughts on CIF2 encoding can be summarized in one initial sequence: BOM (if required to identify the encoding) + declaration that CIF2 + declaration of encoding (if not inherently identifiable) This is based on a specification that allows any text encoding, requires the declaration of the encoding if it is not unambiguously identifiable without such a declaration, and defines a default encoding that should be assumed in the absence of any pointers to the contrary and that should be considered as the base 'language' that all CIF readers should understand. Though brief and lacking in specifics, I hope this sums up my current thinking with respect to a CIF2 'standard' (I recognize that the reality of dealing with multiple encodings will involve flexibility to accommodate user practice, whatever is specified). Cheers Simon PS Within this framework, I would allow the encoding of CIF data values to be 'switched' according to dictionary definitions... :-) ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Tuesday, 14 September, 2010 15:46:43 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . Dear Colleagues, To avoid any misunderstandings, rather than worrying about how we got to where we are, let us each just state a clear position. Here is mine: I favor CIF2 being stated in terms of UTF-8 for clarity, but not specifying any particular _mandatory_ encoding of a CIF2 file as long as there is a clearly agreed mechanism between the creator and consumer of a given CIF2 file as to how to faithfully transform the file between creator's and the consumer's encodings. I favor UTF-8 being the default encoding that any CIF2 creator should feel free to use without having to establish any prior agreement with consumers, and that all consumers should try to make arrangements to be able to read, either directly or via some conversion utility or service. If the consumers don't make such arrangements then there may be CIF2 files that they will not be able to read. If a producer creates a CIF2 in any encoding other than UTF8 then there may be consumers who have difficulty reading that CIF2. I favor the IUCr taking responsibility for collecting and disseminating information on particularly useful ways to go to and from UTF8 and/or other popular encodings. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Tue, 14 Sep 2010, SIMON WESTRIP wrote: > I sense some common ground here with my previous post. > > The UTF8/16 pair could possibly be extended to any unicode encoding that is > unambiguously/inherently identifiable? > The 'local' encodings then encompass everything else? > > However, I think we've yet to agree that anything but UTF8 is to be allowed > at all. We have a draft spec that stipulates UTF8, > but I infer from this thread that there is scope to relax that restriction. > The views seem to range from at least 'leaving the door open' > in recognition of the variety of encodings available, to advocating that > the encoding should not be part of the specification at all, and it will be > down to developers to accommodate/influence user practice. I'm in favour of > a default encoding or maybe any encoding that is inherently identifiable, > and providing a means to declare other encodings (however untrustworthy the > declaration may be, it would at least be available to conscientious > users/developers), all documented in the spec. > > Please forgive me if this summary is off the mark; my conclusion is that > there's a willingness to accommodate multiple encodings > in this (albeit very small) group. Given that we are starting from the > position of having a single encoding (agreed upon after much earlier > debate), I cannot see us performing a complete U-turn to allow any > (potentially unrecognizable) encoding as in CIF1, i.e. without some > specification of a canonical encoding or mechanisms to identify/declare the > encoding. On the other hand, I hope to see > a revised spec that isnt UTF8 only. > > To get to the point - is there any hope of reaching a compromise? > > Cheers > > Simon > > > ____________________________________________________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Monday, 13 September, 2010 19:52:26 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > > On Sunday, September 12, 2010 11:26 PM, James Hester wrote: > [...] > >To my mind, the encoding of plain CIF files remains an open issue. I > >do not view the mechanisms for managing file encoding that are > >provided by current OSs to be sufficiently robust, widespread or > >consistent that we can rely on developers or text editors respecting > >them [...]. > > I agree that the encoding of plain CIF files remains an open issue. > > I confess I find your concerns there somewhat vague, especially to the > extent that they apply within the confines of a single machine. Do your > concerns extend to that level? If so, can you provide an example or two of > what you fear might go wrong in that context? > > As Herb recently wrote, "Multiple encodings are a fact of life when working > with text." CIF2 looks like text, it feels like text, and despite some > exotic spice, it tastes like text -- even in UTF-8 only form. We cannot > pretend that we're dealing with anything other than text. We need to > accept, therefore, that no matter what we do, authors and programmers will > need to account for multiple encodings, one way or another. The format > specification cannot relieve either group of that responsibility. > > That doesn't necessarily mean, however, that CIF must follow the XML model > of being self-defining with regard to text encoding. Given CIF's various > uses, we gain little of practical value in this area by defining CIF2 as > UTF-8 only, and perhaps equally little by defining required decorations for > expressing random encodings. Moreover, the best reading of CIF1 is that it > relies on the *local* text conventions, whatever they may be, which is quite > a different thing than handling all text conventions that might conceivably > be employed. > > With that being the case, I don't think it needful for CIF2 in any given > environment to endorse foreign encoding conventions other than UTF-8. CIF2 > reasonably could endorse UTF-16 as well, though, as that cannot be confused > with any ASCII-compatible encoding. Allowing UTF-16 would open up useful > possibilities both for imgCIF and for future uses not yet conceived. > Additionally, since CIF is text I still think it important for CIF2 to > endorse the default text conventions of its operating environment. > > Could we agree on those three as allowed encodings? Consider, given that > combination of supported alternatives and no extra support from the spec, > how might various parties deal with the unavoidable encoding issue. Here > are some of the more reasonable alternatives I see: > > 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: > > Option a) accept and provide only UTF-8 and/or UTF-16 CIFs. The > responsibility to perform any needed transcoding is on the other party. > This is just as it might be with UTF-8-only. > > Option b) in addition to supporting UTF-8 and/or UTF-16, support > other encodings by allowing users to explicitly specify them as part of the > submission/retrieval process. The processor / repository would either > ensure the CIF is properly labeled, or, better, transcode it to UTF-8[/16]. > This also is just as it might be with UTF-8 only. > > 2. Programs and Libraries: > > Option a) On input, detect encoding by checking first for UTF-16, > assuming UTF-8 if not UTF-16, and falling back to default text conventions > if a UTF-8 decoding error is encountered. On output, encode as directed by > the user (among the two/three options), defaulting to the input encoding > when that is available and feasible. These would be desirable behaviors > even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, > but they do exceed UTF-8-only requirements. > > Option b) Require input and produce output according to a fixed set > of conventions (whether local text conventions or UTF-8/16). The program > user is responsible for any needed transcoding. This would be sufficient > for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those > differ, however, in which text conventions would be assumed. > > 3. Users/Authors: > 3.1. Creating / editing CIFs > No change from current practice is needed, but users might choose to > store CIFs in UTF-8[/16] form. This is just as it would likely be under > UTF-8 only. > > 3.2. Transferring CIFs > Unless an alternative agreement on encoding can be reached by some > means, the transferor must ensure the CIF is encoded in UTF-8[/16]. This > differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed. > > 3.3. Receiving CIFs > The receiver may reasonably demand that the CIF be provided in > UTF-8[/16] form. He should *expect* that form unless some alternative > agreement is established. Any desired transcoding from UTF-8[/16] to an > alternative encoding is the user's responsibility. Again, this is not > significantly different from the UTF-8 only case. > > > A driving force in many of those cases is the well-understood (especially > here!) fact that different systems cannot be relied upon to share text > conventions, thus leaving UTF-8[/16] as the only available general-purpose > medium of exchange. At the same time, local conventions are not forbidden > from use where they can be relied upon -- most notably, within the same > computer. Even if end-users, as a group, do not appreciate those details, > we can ensure via the spec that CIF2 implementers do. That's sufficient. > > So, if pretty much all my expected behavior under UTF-8[/16]+local is the > same as it would be under UTF-8-only, then why prefer the former? Because > under UTF-8[/16]+local, all the behavior described is conformant to the > spec, whereas under UTF-8 only, a significant proportion is not. If the > standard adequately covers these behaviors then we can expect more uniform > support. Moreover, this bears directly on community acceptance of the > spec. If flaunting the spec with respect to encoding becomes common, then > the spec will have failed, at least in that area. Having failed in one > area, it is more likely to fail in others. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100914/39ca6506/attachment-0001.html From bm at iucr.org Wed Sep 15 13:39:27 2010 From: bm at iucr.org (Brian McMahon) Date: Wed, 15 Sep 2010 13:39:27 +0100 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> Message-ID: <20100915123927.GA26246@emerald.iucr.org> I have said little or nothing on this list so far, because I'm not sure that I can add anything that's of concrete use. I've read the many contributions, all of them carefully thought through, and I still see both sides (actually, all sides) of the arguments. I am disinterested in the eventual outcome (but not "uninterested"). But, whatever the outcome, the IUCr will undoubtedly receive files *intended* by the authors as CIF submissions, that come in a variety of character-set encodings. For the most part, we will want to accept these without asking the author what the encoding was, not least because the typical author will have no idea (and increasingly, our typical author will struggle to understand the questions we are posing since English is not his or her native language - or perhaps we will struggle to understand the reply). So my concerns are: (1) how easily can we determine the correct encoding with which the file was generated; (2) how easily can we convert it into our canonical encoding(s) for in-house production, archiving and delivery? First a few comments on that "canonical encoding(s)". Simon and I have both been happy enough to consider UTF-8 as a lingua franca, since we perceive it as a reasonably widespread vehicle for carrying a large (multilingual) character set, and that is widely supported by many generic text processors and platforms. However, many of our existing CIF applications may choke on a UTF-8 file, and we may need to create working formats that are pure ASCII. I would also prefer to retain a single archival version of a CIF (well, ideally several identical copies for redundancy, but nonetheless a single *version*), from which alternative encodings that we choose to support for delivery from the archive can be generated on the fly. So, really, the desire would be to have standalone applications that can convert between character encodings on the fly. Does anyone know of the general availability of such tools? The more, reliable, conversions that can be made, the more relaxed we are about accepting multiple input encodings. I have to say that a very quick Google search hasn't yet thrown up much encouragement here. Now, back to (1). In similar vein, do you know of any standalone utilities that help in determining a text-file character encoding? [I'm happy to be educated, ideally off-list, in whether Content-Encoding negotiation in web forms can help here, since many of our CIF submissions come by that route, but I'm more interested in the general question of how you determine the encoding of a text file that you just happen to find sitting on the filesystem.] One utility we use heavily in the submission system is "file" (http://freshmeat.net/projects/file - we currently use version 4.26 with an augmented and slightly modified magic file). This is rather quiet about different character encodings, though I notice the magic file distributed with the more recent version 5.04 does have a "Unicode" section, namely: #------------------------------------------------------------------------------ # $File: unicode,v 1.5 2009/09/19 16:28:13 christos Exp $ # Unicode: BOM prefixed text files - Adrian Havill # GRR: These types should be recognised in file_ascmagic so these # encodings can be treated by text patterns. # Missing types are already dealt with internally. # 0 string +/v8 Unicode text, UTF-7 0 string +/v9 Unicode text, UTF-7 0 string +/v+ Unicode text, UTF-7 0 string +/v/ Unicode text, UTF-7 0 string \335\163\146\163 Unicode text, UTF-8-EBCDIC 0 string \376\377\000\000 Unicode text, UTF-32, big-endian 0 string \377\376\000\000 Unicode text, UTF-32, little-endian 0 string \016\376\377 Unicode text, SCSU (Standard Compression Scheme for Unicode) Interestingly, the "animation" module of this new magic file conflicts with other possible UTF encodings: # MPA, M1A # updated by Joerg Jenderek # GRR the original test are too common for many DOS files, so test 32 <= kbits < = 448 # GRR this test is still too general as it catches a BOM of UTF-16 files (0xFFFE) # FIXME: Almost all little endian UTF-16 text with BOM are clobbered by these entries And, by the way, the "augmented" magic file we use (the one distributed as part of the KDE desktop distribution) already includes this section: # chemical/x-cif 50 0 string #\#CIF_1.1 >10 byte 9 chemical/x-cif >10 byte 10 chemical/x-cif >10 byte 13 chemical/x-cif It seems to me that without some reasonably reliable discriminator, John's endorsement of support for "local" encodings will allow files to leak out into the wider world where they can't at all easily be handled or even properly identified. (Though, as many have argued persuasively, "forbidding" them is not going to prevent such files from being created, and possibly even used fruitfully within local environments.) Remember that many CIFs will come to us in the end after passage across many heterogeneous systems. I referred in a previous post to my own daily working environment - Solaris, Linux and Windows systems linked by a variety of X servers, X emulators, NFS and SMB cross-mounted filesystems, clipboards communicating with diverse applications and OSes running different default locales... [Incidentally, hasn't SMB now been superseded by "CIFS" !] Perhaps I'm just perverse; but I doubt that I'm quite unique. We'll also see files shuttled between co-authors with different languages, locales, OSes, and exchanged via email, ftp, USB stick etc. "Corruptions" will inevitably be introduced in these interchanges - sometimes subtle ones. For example, outside the CIF world altogether, we see Greek characters change their identity when we run some files through a PDF -> PostScript -> PDF cycle (all using software from the same software house, Adobe). The reason has to do with differences in Windows and Mac encodings, and the failure of the Acrobat software to track and maintain the character mappings through such a cycle. Well, I'll stop here, because in spite of my best intentions I don't think I'm moving the debate along very much, and I apologise if everything here has already been so obvious as not to need saying. I'll defer further comment until I've learned if there are already standard text-encoding identifiers and transcoders. Regards Brian _________________________________________________________________________ Brian McMahon tel: +44 1244 342878 Research and Development Officer fax: +44 1244 314888 International Union of Crystallography e-mail: bm at iucr.org 5 Abbey Square, Chester CH1 2HU, England On Tue, Sep 14, 2010 at 10:58:39AM -0400, Herbert J. Bernstein wrote: > One, hopefully relevant, aside -- ascii files are not as > unambiguous as one might think. Depending on what localization > one has one one's computer, the code point 0x5c (one of the > characters in the first 127) will be shown as a reverse > solidus, a yen currency symbol or a won currency symbol. This > is a holdover from the days of national variants of the ISO > character set, and shows no signs of going away any time soon. > > This is _not_ the only such case, but it is one that impacts > most programming languages, including dREL, and existing CIF > files, including the PDB's mmCIF files. > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Tue, 14 Sep 2010, Herbert J. Bernstein wrote: > >> Dear Colleagues, >> >> To avoid any misunderstandings, rather than worrying about how >> we got to where we are, let us each just state a clear position. >> Here is mine: >> >> I favor CIF2 being stated in terms of UTF-8 for clarity, but >> not specifying any particular _mandatory_ encoding of a CIF2 file >> as long as there is a clearly agreed mechanism between the >> creator and consumer of a given CIF2 file as to how to faithfully >> transform the file between creator's and the consumer's encodings. >> >> I favor UTF-8 being the default encoding that any CIF2 creator >> should feel free to use without having to establish any prior >> agreement with consumers, and that all consumers should try >> to make arrangements to be able to read, either directly or >> via some conversion utility or service. If the consumers don't >> make such arrangements then there may be CIF2 files that they >> will not be able to read. If a producer creates a CIF2 in any >> encoding other than UTF8 then there may be consumers who have >> difficulty reading that CIF2. >> >> I favor the IUCr taking responsibility for collecting and >> disseminating information on particularly useful ways to go >> to and from UTF8 and/or other popular encodings. >> >> Regards, >> Herbert >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> >> On Tue, 14 Sep 2010, SIMON WESTRIP wrote: >> >>> I sense some common ground here with my previous post. >>> >>> The UTF8/16 pair could possibly be extended to any unicode encoding that >>> is >>> unambiguously/inherently identifiable? >>> The 'local' encodings then encompass everything else? >>> >>> However, I think we've yet to agree that anything but UTF8 is to be >>> allowed >>> at all. We have a draft spec that stipulates UTF8, >>> but I infer from this thread that there is scope to relax that >>> restriction. >>> The views seem to range from at least 'leaving the door open' >>> in recognition of the variety of encodings available, to advocating that >>> the encoding should not be part of the specification at all, and it will >>> be >>> down to developers to accommodate/influence user practice. I'm in favour >>> of >>> a default encoding or maybe any encoding that is inherently identifiable, >>> and providing a means to declare other encodings (however untrustworthy >>> the >>> declaration may be, it would at least be available to conscientious >>> users/developers), all documented in the spec. >>> >>> Please forgive me if this summary is off the mark; my conclusion is that >>> there's a willingness to accommodate multiple encodings >>> in this (albeit very small) group. Given that we are starting from the >>> position of having a single encoding (agreed upon after much earlier >>> debate), I cannot see us performing a complete U-turn to allow any >>> (potentially unrecognizable) encoding as in CIF1, i.e. without some >>> specification of a canonical encoding or mechanisms to identify/declare >>> the >>> encoding. On the other hand, I hope to see >>> a revised spec that isnt UTF8 only. >>> >>> To get to the point - is there any hope of reaching a compromise? >>> >>> Cheers >>> >>> Simon >>> >>> >>> ____________________________________________________________________________ >>> From: "Bollinger, John C" >>> To: Group for discussing encoding and content validation schemes for CIF2 >>> >>> Sent: Monday, 13 September, 2010 19:52:26 >>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. >>> . >>> >>> >>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote: >>> [...] >>>> To my mind, the encoding of plain CIF files remains an open issue.? I >>>> do not view the mechanisms for managing file encoding that are >>>> provided by current OSs to be sufficiently robust, widespread or >>>> consistent that we can rely on developers or text editors respecting >>>> them [...]. >>> >>> I agree that the encoding of plain CIF files remains an open issue. >>> >>> I confess I find your concerns there somewhat vague, especially to the >>> extent that they apply within the confines of a single machine.? Do your >>> concerns extend to that level?? If so, can you provide an example or two >>> of >>> what you fear might go wrong in that context? >>> >>> As Herb recently wrote, "Multiple encodings are a fact of life when >>> working >>> with text."? CIF2 looks like text, it feels like text, and despite some >>> exotic spice, it tastes like text -- even in UTF-8 only form.? We cannot >>> pretend that we're dealing with anything other than text.? We need to >>> accept, therefore, that no matter what we do, authors and programmers will >>> need to account for multiple encodings, one way or another.? The format >>> specification cannot relieve either group of that responsibility. >>> >>> That doesn't necessarily mean, however, that CIF must follow the XML model >>> of being self-defining with regard to text encoding.? Given CIF's various >>> uses, we gain little of practical value in this area by defining CIF2 as >>> UTF-8 only, and perhaps equally little by defining required decorations >>> for >>> expressing random encodings.? Moreover, the best reading of CIF1 is that >>> it >>> relies on the *local* text conventions, whatever they may be, which is >>> quite >>> a different thing than handling all text conventions that might >>> conceivably >>> be employed. >>> >>> With that being the case, I don't think it needful for CIF2 in any given >>> environment to endorse foreign encoding conventions other than UTF-8.? >>> CIF2 >>> reasonably could endorse UTF-16 as well, though, as that cannot be >>> confused >>> with any ASCII-compatible encoding.? Allowing UTF-16 would open up useful >>> possibilities both for imgCIF and for future uses not yet conceived.? >>> Additionally, since CIF is text I still think it important for CIF2 to >>> endorse the default text conventions of its operating environment. >>> >>> Could we agree on those three as allowed encodings?? Consider, given that >>> combination of supported alternatives and no extra support from the spec, >>> how might various parties deal with the unavoidable encoding issue.? Here >>> are some of the more reasonable alternatives I see: >>> >>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: >>> >>> Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.? The >>> responsibility to perform any needed transcoding is on the other party.? >>> This is just as it might be with UTF-8-only. >>> >>> Option b) in addition to supporting UTF-8 and/or UTF-16, support >>> other encodings by allowing users to explicitly specify them as part of >>> the >>> submission/retrieval process.? The processor / repository would either >>> ensure the CIF is properly labeled, or, better, transcode it to >>> UTF-8[/16].? >>> This also is just as it might be with UTF-8 only. >>> >>> 2. Programs and Libraries: >>> >>> Option a) On input, detect encoding by checking first for UTF-16, >>> assuming UTF-8 if not UTF-16, and falling back to default text conventions >>> if a UTF-8 decoding error is encountered.? On output, encode as directed >>> by >>> the user (among the two/three options), defaulting to the input encoding >>> when that is available and feasible.? These would be desirable behaviors >>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, >>> but they do exceed UTF-8-only requirements. >>> >>> Option b) Require input and produce output according to a fixed set >>> of conventions (whether local text conventions or UTF-8/16).? The program >>> user is responsible for any needed transcoding.? This would be sufficient >>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those >>> differ, however, in which text conventions would be assumed. >>> >>> 3. Users/Authors: >>> 3.1. Creating / editing CIFs >>> No change from current practice is needed, but users might choose >>> to >>> store CIFs in UTF-8[/16] form.? This is just as it would likely be under >>> UTF-8 only. >>> >>> 3.2. Transferring CIFs >>> Unless an alternative agreement on encoding can be reached by some >>> means, the transferor must ensure the CIF is encoded in UTF-8[/16].? This >>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) >>> allowed. >>> >>> 3.3. Receiving CIFs >>> The receiver may reasonably demand that the CIF be provided in >>> UTF-8[/16] form.? He should *expect* that form unless some alternative >>> agreement is established.? Any desired transcoding from UTF-8[/16] to an >>> alternative encoding is the user's responsibility.? Again, this is not >>> significantly different from the UTF-8 only case. >>> >>> >>> A driving force in many of those cases is the well-understood (especially >>> here!) fact that different systems cannot be relied upon to share text >>> conventions, thus leaving UTF-8[/16] as the only available general-purpose >>> medium of exchange.? At the same time, local conventions are not forbidden >>> from use where they can be relied upon -- most notably, within the same >>> computer.? Even if end-users, as a group, do not appreciate those details, >>> we can ensure via the spec that CIF2 implementers do.? That's sufficient. >>> >>> So, if pretty much all my expected behavior under UTF-8[/16]+local is the >>> same as it would be under UTF-8-only, then why prefer the former?? Because >>> under UTF-8[/16]+local, all the behavior described is conformant to the >>> spec, whereas under UTF-8 only, a significant proportion is not.? If the >>> standard adequately covers these behaviors then we can expect more uniform >>> support.? Moreover, this bears directly on community acceptance of the >>> spec.? If flaunting the spec with respect to encoding becomes common, then >>> the spec will have failed, at least in that area.? Having failed in one >>> area, it is more likely to fail in others. >>> >>> >>> Regards, >>> >>> John >>> -- >>> John C. Bollinger, Ph.D. >>> Department of Structural Biology >>> St. Jude Children's Research Hospital >>> >>> Email Disclaimer:? www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Wed Sep 15 14:42:19 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Wed, 15 Sep 2010 09:42:19 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <20100915123927.GA26246@emerald.iucr.org> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> Message-ID: Dear Colleagues, 1. For a Mac under OSX, I use cyclone for conversion of encodings. 2. No hash scheme will survive random trips through random editors or random systems. 3. Embedded strings of characters (e.g. the 5 accented o's or more) will also undergo strange transformations, but they will be easier to deal with without a lot of external software support. 4. There is no way to make a "pure ascii version" of a general UTF-8 file without adopting some reserved characters stirngs at the lexical level -- \U... or &#...; or somesuch as used in many other systems, but with such an extension, it is easy. 5. We can keep going on this forever -- we need to make some decisions. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Wed, 15 Sep 2010, Brian McMahon wrote: > I have said little or nothing on this list so far, because I'm > not sure that I can add anything that's of concrete use. I've read > the many contributions, all of them carefully thought through, and > I still see both sides (actually, all sides) of the arguments. I > am disinterested in the eventual outcome (but not "uninterested"). > > But, whatever the outcome, the IUCr will undoubtedly receive files > *intended* by the authors as CIF submissions, that come in a variety of > character-set encodings. For the most part, we will want to accept > these without asking the author what the encoding was, not least > because the typical author will have no idea (and increasingly, > our typical author will struggle to understand the questions we are > posing since English is not his or her native language - or perhaps we > will struggle to understand the reply). > > So my concerns are: > > (1) how easily can we determine the correct encoding with which the > file was generated; > > (2) how easily can we convert it into our canonical encoding(s) for > in-house production, archiving and delivery? > > First a few comments on that "canonical encoding(s)". Simon and I have > both been happy enough to consider UTF-8 as a lingua franca, since we > perceive it as a reasonably widespread vehicle for carrying a large > (multilingual) character set, and that is widely supported by many > generic text processors and platforms. However, many of our existing > CIF applications may choke on a UTF-8 file, and we may need to > create working formats that are pure ASCII. I would also prefer to > retain a single archival version of a CIF (well, ideally several > identical copies for redundancy, but nonetheless a single *version*), > from which alternative encodings that we choose to support for > delivery from the archive can be generated on the fly. > > So, really, the desire would be to have standalone applications that > can convert between character encodings on the fly. Does anyone know > of the general availability of such tools? The more, reliable, > conversions that can be made, the more relaxed we are about accepting > multiple input encodings. I have to say that a very quick Google > search hasn't yet thrown up much encouragement here. > > Now, back to (1). In similar vein, do you know of any standalone > utilities that help in determining a text-file character encoding? > > [I'm happy to be educated, ideally off-list, in whether > Content-Encoding negotiation in web forms can help here, since many > of our CIF submissions come by that route, but I'm more interested in > the general question of how you determine the encoding of a text file > that you just happen to find sitting on the filesystem.] > > One utility we use heavily in the submission system is "file" > (http://freshmeat.net/projects/file - we currently use version 4.26 > with an augmented and slightly modified magic file). This is rather > quiet about different character encodings, though I notice the magic > file distributed with the more recent version 5.04 does have a > "Unicode" section, namely: > > #------------------------------------------------------------------------------ > # $File: unicode,v 1.5 2009/09/19 16:28:13 christos Exp $ > # Unicode: BOM prefixed text files - Adrian Havill > # GRR: These types should be recognised in file_ascmagic so these > # encodings can be treated by text patterns. > # Missing types are already dealt with internally. > # > 0 string +/v8 Unicode text, UTF-7 > 0 string +/v9 Unicode text, UTF-7 > 0 string +/v+ Unicode text, UTF-7 > 0 string +/v/ Unicode text, UTF-7 > 0 string \335\163\146\163 Unicode text, UTF-8-EBCDIC > 0 string \376\377\000\000 Unicode text, UTF-32, big-endian > 0 string \377\376\000\000 Unicode text, UTF-32, little-endian > 0 string \016\376\377 Unicode text, SCSU (Standard Compression Scheme for Unicode) > > Interestingly, the "animation" module of this new magic file > conflicts with other possible UTF encodings: > > # MPA, M1A > # updated by Joerg Jenderek > # GRR the original test are too common for many DOS files, so test 32 <= kbits < > = 448 > # GRR this test is still too general as it catches a BOM of UTF-16 files (0xFFFE) > # FIXME: Almost all little endian UTF-16 text with BOM are clobbered by these entries > > > And, by the way, the "augmented" magic file we use (the one distributed as > part of the KDE desktop distribution) already includes this section: > > # chemical/x-cif 50 > 0 string #\#CIF_1.1 > >10 byte 9 chemical/x-cif > >10 byte 10 chemical/x-cif > >10 byte 13 chemical/x-cif > > > > It seems to me that without some reasonably reliable discriminator, > John's endorsement of support for "local" encodings will allow files > to leak out into the wider world where they can't at all easily be > handled or even properly identified. (Though, as many have argued > persuasively, "forbidding" them is not going to prevent such files > from being created, and possibly even used fruitfully within local > environments.) > > Remember that many CIFs will come to us in the end after passage across > many heterogeneous systems. I referred in a previous post to my own > daily working environment - Solaris, Linux and Windows systems linked > by a variety of X servers, X emulators, NFS and SMB cross-mounted > filesystems, clipboards communicating with diverse applications > and OSes running different default locales... > [Incidentally, hasn't SMB now been superseded by "CIFS" !] > > Perhaps I'm just perverse; but I doubt that I'm quite unique. We'll > also see files shuttled between co-authors with different languages, > locales, OSes, and exchanged via email, ftp, USB stick etc. > "Corruptions" will inevitably be introduced in these interchanges - > sometimes subtle ones. For example, outside the CIF world altogether, > we see Greek characters change their identity when we run some files > through a PDF -> PostScript -> PDF cycle (all using software from the > same software house, Adobe). The reason has to do with differences in > Windows and Mac encodings, and the failure of the Acrobat software to > track and maintain the character mappings through such a cycle. > > Well, I'll stop here, because in spite of my best intentions I don't > think I'm moving the debate along very much, and I apologise if > everything here has already been so obvious as not to need saying. > > I'll defer further comment until I've learned if there are already > standard text-encoding identifiers and transcoders. > > Regards > Brian > _________________________________________________________________________ > Brian McMahon tel: +44 1244 342878 > Research and Development Officer fax: +44 1244 314888 > International Union of Crystallography e-mail: bm at iucr.org > 5 Abbey Square, Chester CH1 2HU, England > > > On Tue, Sep 14, 2010 at 10:58:39AM -0400, Herbert J. Bernstein wrote: >> One, hopefully relevant, aside -- ascii files are not as >> unambiguous as one might think. Depending on what localization >> one has one one's computer, the code point 0x5c (one of the >> characters in the first 127) will be shown as a reverse >> solidus, a yen currency symbol or a won currency symbol. This >> is a holdover from the days of national variants of the ISO >> character set, and shows no signs of going away any time soon. >> >> This is _not_ the only such case, but it is one that impacts >> most programming languages, including dREL, and existing CIF >> files, including the PDB's mmCIF files. >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> >> On Tue, 14 Sep 2010, Herbert J. Bernstein wrote: >> >>> Dear Colleagues, >>> >>> To avoid any misunderstandings, rather than worrying about how >>> we got to where we are, let us each just state a clear position. >>> Here is mine: >>> >>> I favor CIF2 being stated in terms of UTF-8 for clarity, but >>> not specifying any particular _mandatory_ encoding of a CIF2 file >>> as long as there is a clearly agreed mechanism between the >>> creator and consumer of a given CIF2 file as to how to faithfully >>> transform the file between creator's and the consumer's encodings. >>> >>> I favor UTF-8 being the default encoding that any CIF2 creator >>> should feel free to use without having to establish any prior >>> agreement with consumers, and that all consumers should try >>> to make arrangements to be able to read, either directly or >>> via some conversion utility or service. If the consumers don't >>> make such arrangements then there may be CIF2 files that they >>> will not be able to read. If a producer creates a CIF2 in any >>> encoding other than UTF8 then there may be consumers who have >>> difficulty reading that CIF2. >>> >>> I favor the IUCr taking responsibility for collecting and >>> disseminating information on particularly useful ways to go >>> to and from UTF8 and/or other popular encodings. >>> >>> Regards, >>> Herbert >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> yaya at dowling.edu >>> ===================================================== >>> >>> On Tue, 14 Sep 2010, SIMON WESTRIP wrote: >>> >>>> I sense some common ground here with my previous post. >>>> >>>> The UTF8/16 pair could possibly be extended to any unicode encoding that >>>> is >>>> unambiguously/inherently identifiable? >>>> The 'local' encodings then encompass everything else? >>>> >>>> However, I think we've yet to agree that anything but UTF8 is to be >>>> allowed >>>> at all. We have a draft spec that stipulates UTF8, >>>> but I infer from this thread that there is scope to relax that >>>> restriction. >>>> The views seem to range from at least 'leaving the door open' >>>> in recognition of the variety of encodings available, to advocating that >>>> the encoding should not be part of the specification at all, and it will >>>> be >>>> down to developers to accommodate/influence user practice. I'm in favour >>>> of >>>> a default encoding or maybe any encoding that is inherently identifiable, >>>> and providing a means to declare other encodings (however untrustworthy >>>> the >>>> declaration may be, it would at least be available to conscientious >>>> users/developers), all documented in the spec. >>>> >>>> Please forgive me if this summary is off the mark; my conclusion is that >>>> there's a willingness to accommodate multiple encodings >>>> in this (albeit very small) group. Given that we are starting from the >>>> position of having a single encoding (agreed upon after much earlier >>>> debate), I cannot see us performing a complete U-turn to allow any >>>> (potentially unrecognizable) encoding as in CIF1, i.e. without some >>>> specification of a canonical encoding or mechanisms to identify/declare >>>> the >>>> encoding. On the other hand, I hope to see >>>> a revised spec that isnt UTF8 only. >>>> >>>> To get to the point - is there any hope of reaching a compromise? >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> >>>> ____________________________________________________________________________ >>>> From: "Bollinger, John C" >>>> To: Group for discussing encoding and content validation schemes for CIF2 >>>> >>>> Sent: Monday, 13 September, 2010 19:52:26 >>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. >>>> . >>>> >>>> >>>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote: >>>> [...] >>>>> To my mind, the encoding of plain CIF files remains an open issue.? I >>>>> do not view the mechanisms for managing file encoding that are >>>>> provided by current OSs to be sufficiently robust, widespread or >>>>> consistent that we can rely on developers or text editors respecting >>>>> them [...]. >>>> >>>> I agree that the encoding of plain CIF files remains an open issue. >>>> >>>> I confess I find your concerns there somewhat vague, especially to the >>>> extent that they apply within the confines of a single machine.? Do your >>>> concerns extend to that level?? If so, can you provide an example or two >>>> of >>>> what you fear might go wrong in that context? >>>> >>>> As Herb recently wrote, "Multiple encodings are a fact of life when >>>> working >>>> with text."? CIF2 looks like text, it feels like text, and despite some >>>> exotic spice, it tastes like text -- even in UTF-8 only form.? We cannot >>>> pretend that we're dealing with anything other than text.? We need to >>>> accept, therefore, that no matter what we do, authors and programmers will >>>> need to account for multiple encodings, one way or another.? The format >>>> specification cannot relieve either group of that responsibility. >>>> >>>> That doesn't necessarily mean, however, that CIF must follow the XML model >>>> of being self-defining with regard to text encoding.? Given CIF's various >>>> uses, we gain little of practical value in this area by defining CIF2 as >>>> UTF-8 only, and perhaps equally little by defining required decorations >>>> for >>>> expressing random encodings.? Moreover, the best reading of CIF1 is that >>>> it >>>> relies on the *local* text conventions, whatever they may be, which is >>>> quite >>>> a different thing than handling all text conventions that might >>>> conceivably >>>> be employed. >>>> >>>> With that being the case, I don't think it needful for CIF2 in any given >>>> environment to endorse foreign encoding conventions other than UTF-8.? >>>> CIF2 >>>> reasonably could endorse UTF-16 as well, though, as that cannot be >>>> confused >>>> with any ASCII-compatible encoding.? Allowing UTF-16 would open up useful >>>> possibilities both for imgCIF and for future uses not yet conceived.? >>>> Additionally, since CIF is text I still think it important for CIF2 to >>>> endorse the default text conventions of its operating environment. >>>> >>>> Could we agree on those three as allowed encodings?? Consider, given that >>>> combination of supported alternatives and no extra support from the spec, >>>> how might various parties deal with the unavoidable encoding issue.? Here >>>> are some of the more reasonable alternatives I see: >>>> >>>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: >>>> >>>> Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.? The >>>> responsibility to perform any needed transcoding is on the other party.? >>>> This is just as it might be with UTF-8-only. >>>> >>>> Option b) in addition to supporting UTF-8 and/or UTF-16, support >>>> other encodings by allowing users to explicitly specify them as part of >>>> the >>>> submission/retrieval process.? The processor / repository would either >>>> ensure the CIF is properly labeled, or, better, transcode it to >>>> UTF-8[/16].? >>>> This also is just as it might be with UTF-8 only. >>>> >>>> 2. Programs and Libraries: >>>> >>>> Option a) On input, detect encoding by checking first for UTF-16, >>>> assuming UTF-8 if not UTF-16, and falling back to default text conventions >>>> if a UTF-8 decoding error is encountered.? On output, encode as directed >>>> by >>>> the user (among the two/three options), defaulting to the input encoding >>>> when that is available and feasible.? These would be desirable behaviors >>>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, >>>> but they do exceed UTF-8-only requirements. >>>> >>>> Option b) Require input and produce output according to a fixed set >>>> of conventions (whether local text conventions or UTF-8/16).? The program >>>> user is responsible for any needed transcoding.? This would be sufficient >>>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those >>>> differ, however, in which text conventions would be assumed. >>>> >>>> 3. Users/Authors: >>>> 3.1. Creating / editing CIFs >>>> No change from current practice is needed, but users might choose >>>> to >>>> store CIFs in UTF-8[/16] form.? This is just as it would likely be under >>>> UTF-8 only. >>>> >>>> 3.2. Transferring CIFs >>>> Unless an alternative agreement on encoding can be reached by some >>>> means, the transferor must ensure the CIF is encoded in UTF-8[/16].? This >>>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) >>>> allowed. >>>> >>>> 3.3. Receiving CIFs >>>> The receiver may reasonably demand that the CIF be provided in >>>> UTF-8[/16] form.? He should *expect* that form unless some alternative >>>> agreement is established.? Any desired transcoding from UTF-8[/16] to an >>>> alternative encoding is the user's responsibility.? Again, this is not >>>> significantly different from the UTF-8 only case. >>>> >>>> >>>> A driving force in many of those cases is the well-understood (especially >>>> here!) fact that different systems cannot be relied upon to share text >>>> conventions, thus leaving UTF-8[/16] as the only available general-purpose >>>> medium of exchange.? At the same time, local conventions are not forbidden >>>> from use where they can be relied upon -- most notably, within the same >>>> computer.? Even if end-users, as a group, do not appreciate those details, >>>> we can ensure via the spec that CIF2 implementers do.? That's sufficient. >>>> >>>> So, if pretty much all my expected behavior under UTF-8[/16]+local is the >>>> same as it would be under UTF-8-only, then why prefer the former?? Because >>>> under UTF-8[/16]+local, all the behavior described is conformant to the >>>> spec, whereas under UTF-8 only, a significant proportion is not.? If the >>>> standard adequately covers these behaviors then we can expect more uniform >>>> support.? Moreover, this bears directly on community acceptance of the >>>> spec.? If flaunting the spec with respect to encoding becomes common, then >>>> the spec will have failed, at least in that area.? Having failed in one >>>> area, it is more likely to fail in others. >>>> >>>> >>>> Regards, >>>> >>>> John >>>> -- >>>> John C. Bollinger, Ph.D. >>>> Department of Structural Biology >>>> St. Jude Children's Research Hospital >>>> >>>> Email Disclaimer:? www.stjude.org/emaildisclaimer > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From simonwestrip at btinternet.com Wed Sep 15 15:11:17 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Wed, 15 Sep 2010 14:11:17 +0000 (GMT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <20100915123927.GA26246@emerald.iucr.org> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> Message-ID: <53458.94290.qm@web87002.mail.ird.yahoo.com> Hi Brian: I dont know of any standard text-encoding identifiers and transcoders. There are certainly SDKs out there that provide text codecs to read/write data; the trick is identifying the original encoding in order to select the codec. Interactive applications might resort to prompting the user to confirm the encoding by presenting them with a view of the text and a list of encodings - the user can toggle through the encodings until the document is rendered correctly (e.g. MS Word does this). Obviously this is not ideal, but is something I've been thinking about as part of the web upload process. In addition, there's documentation on heuristic approaches to detecting encoding (as employed by browsers - indeed Mozilla makes its source available). I dont think this sort of autodetection will prove useful though, and actually may scupper an interactive encoding confirmation mechanism as described above! Cheers Simon ________________________________ From: Brian McMahon To: Group for discussing encoding and content validation schemes for CIF2 Sent: Wednesday, 15 September, 2010 13:39:27 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . I have said little or nothing on this list so far, because I'm not sure that I can add anything that's of concrete use. I've read the many contributions, all of them carefully thought through, and I still see both sides (actually, all sides) of the arguments. I am disinterested in the eventual outcome (but not "uninterested"). But, whatever the outcome, the IUCr will undoubtedly receive files *intended* by the authors as CIF submissions, that come in a variety of character-set encodings. For the most part, we will want to accept these without asking the author what the encoding was, not least because the typical author will have no idea (and increasingly, our typical author will struggle to understand the questions we are posing since English is not his or her native language - or perhaps we will struggle to understand the reply). So my concerns are: (1) how easily can we determine the correct encoding with which the file was generated; (2) how easily can we convert it into our canonical encoding(s) for in-house production, archiving and delivery? First a few comments on that "canonical encoding(s)". Simon and I have both been happy enough to consider UTF-8 as a lingua franca, since we perceive it as a reasonably widespread vehicle for carrying a large (multilingual) character set, and that is widely supported by many generic text processors and platforms. However, many of our existing CIF applications may choke on a UTF-8 file, and we may need to create working formats that are pure ASCII. I would also prefer to retain a single archival version of a CIF (well, ideally several identical copies for redundancy, but nonetheless a single *version*), from which alternative encodings that we choose to support for delivery from the archive can be generated on the fly. So, really, the desire would be to have standalone applications that can convert between character encodings on the fly. Does anyone know of the general availability of such tools? The more, reliable, conversions that can be made, the more relaxed we are about accepting multiple input encodings. I have to say that a very quick Google search hasn't yet thrown up much encouragement here. Now, back to (1). In similar vein, do you know of any standalone utilities that help in determining a text-file character encoding? [I'm happy to be educated, ideally off-list, in whether Content-Encoding negotiation in web forms can help here, since many of our CIF submissions come by that route, but I'm more interested in the general question of how you determine the encoding of a text file that you just happen to find sitting on the filesystem.] One utility we use heavily in the submission system is "file" (http://freshmeat.net/projects/file - we currently use version 4.26 with an augmented and slightly modified magic file). This is rather quiet about different character encodings, though I notice the magic file distributed with the more recent version 5.04 does have a "Unicode" section, namely: #------------------------------------------------------------------------------ # $File: unicode,v 1.5 2009/09/19 16:28:13 christos Exp $ # Unicode: BOM prefixed text files - Adrian Havill # GRR: These types should be recognised in file_ascmagic so these # encodings can be treated by text patterns. # Missing types are already dealt with internally. # 0 string +/v8 Unicode text, UTF-7 0 string +/v9 Unicode text, UTF-7 0 string +/v+ Unicode text, UTF-7 0 string +/v/ Unicode text, UTF-7 0 string \335\163\146\163 Unicode text, UTF-8-EBCDIC 0 string \376\377\000\000 Unicode text, UTF-32, big-endian 0 string \377\376\000\000 Unicode text, UTF-32, little-endian 0 string \016\376\377 Unicode text, SCSU (Standard Compression Scheme for Unicode) Interestingly, the "animation" module of this new magic file conflicts with other possible UTF encodings: # MPA, M1A # updated by Joerg Jenderek # GRR the original test are too common for many DOS files, so test 32 <= kbits < = 448 # GRR this test is still too general as it catches a BOM of UTF-16 files (0xFFFE) # FIXME: Almost all little endian UTF-16 text with BOM are clobbered by these entries And, by the way, the "augmented" magic file we use (the one distributed as part of the KDE desktop distribution) already includes this section: # chemical/x-cif 50 0 string #\#CIF_1.1 >10 byte 9 chemical/x-cif >10 byte 10 chemical/x-cif >10 byte 13 chemical/x-cif It seems to me that without some reasonably reliable discriminator, John's endorsement of support for "local" encodings will allow files to leak out into the wider world where they can't at all easily be handled or even properly identified. (Though, as many have argued persuasively, "forbidding" them is not going to prevent such files from being created, and possibly even used fruitfully within local environments.) Remember that many CIFs will come to us in the end after passage across many heterogeneous systems. I referred in a previous post to my own daily working environment - Solaris, Linux and Windows systems linked by a variety of X servers, X emulators, NFS and SMB cross-mounted filesystems, clipboards communicating with diverse applications and OSes running different default locales... [Incidentally, hasn't SMB now been superseded by "CIFS" !] Perhaps I'm just perverse; but I doubt that I'm quite unique. We'll also see files shuttled between co-authors with different languages, locales, OSes, and exchanged via email, ftp, USB stick etc. "Corruptions" will inevitably be introduced in these interchanges - sometimes subtle ones. For example, outside the CIF world altogether, we see Greek characters change their identity when we run some files through a PDF -> PostScript -> PDF cycle (all using software from the same software house, Adobe). The reason has to do with differences in Windows and Mac encodings, and the failure of the Acrobat software to track and maintain the character mappings through such a cycle. Well, I'll stop here, because in spite of my best intentions I don't think I'm moving the debate along very much, and I apologise if everything here has already been so obvious as not to need saying. I'll defer further comment until I've learned if there are already standard text-encoding identifiers and transcoders. Regards Brian _________________________________________________________________________ Brian McMahon tel: +44 1244 342878 Research and Development Officer fax: +44 1244 314888 International Union of Crystallography e-mail: bm at iucr.org 5 Abbey Square, Chester CH1 2HU, England On Tue, Sep 14, 2010 at 10:58:39AM -0400, Herbert J. Bernstein wrote: > One, hopefully relevant, aside -- ascii files are not as > unambiguous as one might think. Depending on what localization > one has one one's computer, the code point 0x5c (one of the > characters in the first 127) will be shown as a reverse > solidus, a yen currency symbol or a won currency symbol. This > is a holdover from the days of national variants of the ISO > character set, and shows no signs of going away any time soon. > > This is _not_ the only such case, but it is one that impacts > most programming languages, including dREL, and existing CIF > files, including the PDB's mmCIF files. > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Tue, 14 Sep 2010, Herbert J. Bernstein wrote: > >> Dear Colleagues, >> >> To avoid any misunderstandings, rather than worrying about how >> we got to where we are, let us each just state a clear position. >> Here is mine: >> >> I favor CIF2 being stated in terms of UTF-8 for clarity, but >> not specifying any particular _mandatory_ encoding of a CIF2 file >> as long as there is a clearly agreed mechanism between the >> creator and consumer of a given CIF2 file as to how to faithfully >> transform the file between creator's and the consumer's encodings. >> >> I favor UTF-8 being the default encoding that any CIF2 creator >> should feel free to use without having to establish any prior >> agreement with consumers, and that all consumers should try >> to make arrangements to be able to read, either directly or >> via some conversion utility or service. If the consumers don't >> make such arrangements then there may be CIF2 files that they >> will not be able to read. If a producer creates a CIF2 in any >> encoding other than UTF8 then there may be consumers who have >> difficulty reading that CIF2. >> >> I favor the IUCr taking responsibility for collecting and >> disseminating information on particularly useful ways to go >> to and from UTF8 and/or other popular encodings. >> >> Regards, >> Herbert >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> >> On Tue, 14 Sep 2010, SIMON WESTRIP wrote: >> >>> I sense some common ground here with my previous post. >>> >>> The UTF8/16 pair could possibly be extended to any unicode encoding that >>> is >>> unambiguously/inherently identifiable? >>> The 'local' encodings then encompass everything else? >>> >>> However, I think we've yet to agree that anything but UTF8 is to be >>> allowed >>> at all. We have a draft spec that stipulates UTF8, >>> but I infer from this thread that there is scope to relax that >>> restriction. >>> The views seem to range from at least 'leaving the door open' >>> in recognition of the variety of encodings available, to advocating that >>> the encoding should not be part of the specification at all, and it will >>> be >>> down to developers to accommodate/influence user practice. I'm in favour >>> of >>> a default encoding or maybe any encoding that is inherently identifiable, >>> and providing a means to declare other encodings (however untrustworthy >>> the >>> declaration may be, it would at least be available to conscientious >>> users/developers), all documented in the spec. >>> >>> Please forgive me if this summary is off the mark; my conclusion is that >>> there's a willingness to accommodate multiple encodings >>> in this (albeit very small) group. Given that we are starting from the >>> position of having a single encoding (agreed upon after much earlier >>> debate), I cannot see us performing a complete U-turn to allow any >>> (potentially unrecognizable) encoding as in CIF1, i.e. without some >>> specification of a canonical encoding or mechanisms to identify/declare >>> the >>> encoding. On the other hand, I hope to see >>> a revised spec that isnt UTF8 only. >>> >>> To get to the point - is there any hope of reaching a compromise? >>> >>> Cheers >>> >>> Simon >>> >>> >>> ____________________________________________________________________________ >>> From: "Bollinger, John C" >>> To: Group for discussing encoding and content validation schemes for CIF2 >>> >>> Sent: Monday, 13 September, 2010 19:52:26 >>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. >>> . >>> >>> >>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote: >>> [...] >>>> To my mind, the encoding of plain CIF files remains an open issue. I >>>> do not view the mechanisms for managing file encoding that are >>>> provided by current OSs to be sufficiently robust, widespread or >>>> consistent that we can rely on developers or text editors respecting >>>> them [...]. >>> >>> I agree that the encoding of plain CIF files remains an open issue. >>> >>> I confess I find your concerns there somewhat vague, especially to the >>> extent that they apply within the confines of a single machine. Do your >>> concerns extend to that level? If so, can you provide an example or two >>> of >>> what you fear might go wrong in that context? >>> >>> As Herb recently wrote, "Multiple encodings are a fact of life when >>> working >>> with text." CIF2 looks like text, it feels like text, and despite some >>> exotic spice, it tastes like text -- even in UTF-8 only form. We cannot >>> pretend that we're dealing with anything other than text. We need to >>> accept, therefore, that no matter what we do, authors and programmers will >>> need to account for multiple encodings, one way or another. The format >>> specification cannot relieve either group of that responsibility. >>> >>> That doesn't necessarily mean, however, that CIF must follow the XML model >>> of being self-defining with regard to text encoding. Given CIF's various >>> uses, we gain little of practical value in this area by defining CIF2 as >>> UTF-8 only, and perhaps equally little by defining required decorations >>> for >>> expressing random encodings. Moreover, the best reading of CIF1 is that >>> it >>> relies on the *local* text conventions, whatever they may be, which is >>> quite >>> a different thing than handling all text conventions that might >>> conceivably >>> be employed. >>> >>> With that being the case, I don't think it needful for CIF2 in any given >>> environment to endorse foreign encoding conventions other than UTF-8. >>> CIF2 >>> reasonably could endorse UTF-16 as well, though, as that cannot be >>> confused >>> with any ASCII-compatible encoding. Allowing UTF-16 would open up useful >>> possibilities both for imgCIF and for future uses not yet conceived. >>> Additionally, since CIF is text I still think it important for CIF2 to >>> endorse the default text conventions of its operating environment. >>> >>> Could we agree on those three as allowed encodings? Consider, given that >>> combination of supported alternatives and no extra support from the spec, >>> how might various parties deal with the unavoidable encoding issue. Here >>> are some of the more reasonable alternatives I see: >>> >>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: >>> >>> Option a) accept and provide only UTF-8 and/or UTF-16 CIFs. The >>> responsibility to perform any needed transcoding is on the other party. >>> This is just as it might be with UTF-8-only. >>> >>> Option b) in addition to supporting UTF-8 and/or UTF-16, support >>> other encodings by allowing users to explicitly specify them as part of >>> the >>> submission/retrieval process. The processor / repository would either >>> ensure the CIF is properly labeled, or, better, transcode it to >>> UTF-8[/16]. >>> This also is just as it might be with UTF-8 only. >>> >>> 2. Programs and Libraries: >>> >>> Option a) On input, detect encoding by checking first for UTF-16, >>> assuming UTF-8 if not UTF-16, and falling back to default text conventions >>> if a UTF-8 decoding error is encountered. On output, encode as directed >>> by >>> the user (among the two/three options), defaulting to the input encoding >>> when that is available and feasible. These would be desirable behaviors >>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, >>> but they do exceed UTF-8-only requirements. >>> >>> Option b) Require input and produce output according to a fixed set >>> of conventions (whether local text conventions or UTF-8/16). The program >>> user is responsible for any needed transcoding. This would be sufficient >>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those >>> differ, however, in which text conventions would be assumed. >>> >>> 3. Users/Authors: >>> 3.1. Creating / editing CIFs >>> No change from current practice is needed, but users might choose >>> to >>> store CIFs in UTF-8[/16] form. This is just as it would likely be under >>> UTF-8 only. >>> >>> 3.2. Transferring CIFs >>> Unless an alternative agreement on encoding can be reached by some >>> means, the transferor must ensure the CIF is encoded in UTF-8[/16]. This >>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) >>> allowed. >>> >>> 3.3. Receiving CIFs >>> The receiver may reasonably demand that the CIF be provided in >>> UTF-8[/16] form. He should *expect* that form unless some alternative >>> agreement is established. Any desired transcoding from UTF-8[/16] to an >>> alternative encoding is the user's responsibility. Again, this is not >>> significantly different from the UTF-8 only case. >>> >>> >>> A driving force in many of those cases is the well-understood (especially >>> here!) fact that different systems cannot be relied upon to share text >>> conventions, thus leaving UTF-8[/16] as the only available general-purpose >>> medium of exchange. At the same time, local conventions are not forbidden >>> from use where they can be relied upon -- most notably, within the same >>> computer. Even if end-users, as a group, do not appreciate those details, >>> we can ensure via the spec that CIF2 implementers do. That's sufficient. >>> >>> So, if pretty much all my expected behavior under UTF-8[/16]+local is the >>> same as it would be under UTF-8-only, then why prefer the former? Because >>> under UTF-8[/16]+local, all the behavior described is conformant to the >>> spec, whereas under UTF-8 only, a significant proportion is not. If the >>> standard adequately covers these behaviors then we can expect more uniform >>> support. Moreover, this bears directly on community acceptance of the >>> spec. If flaunting the spec with respect to encoding becomes common, then >>> the spec will have failed, at least in that area. Having failed in one >>> area, it is more likely to fail in others. >>> >>> >>> Regards, >>> >>> John >>> -- >>> John C. Bollinger, Ph.D. >>> Department of Structural Biology >>> St. Jude Children's Research Hospital >>> >>> Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100915/51632f0f/attachment-0001.html From John.Bollinger at STJUDE.ORG Wed Sep 15 16:20:11 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Wed, 15 Sep 2010 10:20:11 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: <20100915123927.GA26246@emerald.iucr.org> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC7@SJMEMXMBS11.stjude.sjcrh.local> Hello Brian, On Wednesday, September 15, 2010 7:39 AM, Brian McMahon wrote: [...] >But, whatever the outcome, the IUCr will undoubtedly receive files >*intended* by the authors as CIF submissions, that come in a variety of >character-set encodings. For the most part, we will want to accept >these without asking the author what the encoding was, not least >because the typical author will have no idea (and increasingly, >our typical author will struggle to understand the questions we are >posing since English is not his or her native language - or perhaps we >will struggle to understand the reply). > >So my concerns are: > >(1) how easily can we determine the correct encoding with which the >file was generated; In general, it is not possible to do this. These practical matters bear on the issue: a) If the authors use only ASCII characters then in most cases the actual encoding either (i) is indistinguishable from and congruent with UTF-8 for the file's contents, or (ii) is autodetectable b) If the authors put literal non-ASCII characters in their CIF, then UTF-8 and UTF-16 variants (and UTF-32 variants, though these are rarely used) could be autodetected with excellent reliability, but these are typically *not* the default encoding in current computing environments. Other encodings cannot reliably be distinguished, though one might attempt to guess based on geographic origin of the CIF and/or natural language text in certain data items. That's not very satisfactory. >(2) how easily can we convert it into our canonical encoding(s) for >in-house production, archiving and delivery? If a file's encoding is known, then transcoding it is easy. The only potential issue is if the result encoding does not have codes for some of the input characters, but in practice, this is not an issue for UTF-8 (or UTF-16 or UTF-32) as the result encoding. If a file's encoding cannot reliably be determined, then correctly transcoding it is impossible. >First a few comments on that "canonical encoding(s)". Simon and I have >both been happy enough to consider UTF-8 as a lingua franca [...] > However, many of our existing >CIF applications may choke on a UTF-8 file, and we may need to >create working formats that are pure ASCII. If you need support for pure ASCII then you need some kind of general escape mechanism by which to represent non-ASCII characters in ASCII. Something like Python's "\uxxxx[x[x]]" syntax, perhaps. Such a scheme could work equally well for non-ASCII characters in data names as for those in values, but there may be secondary considerations for those in data names. > I would also prefer to >retain a single archival version of a CIF [...] >from which alternative encodings that we choose to support for >delivery from the archive can be generated on the fly. > >So, really, the desire would be to have standalone applications that >can convert between character encodings on the fly. Does anyone know >of the general availability of such tools? The more, reliable, >conversions that can be made, the more relaxed we are about accepting >multiple input encodings. I have to say that a very quick Google >search hasn't yet thrown up much encouragement here. I'm not sure about specific commercially / openly available transcoders, but it's a relatively easy problem. I can write a simple one in under half an hour that would handle a large proportion of what you want -- with the exception of the problem of representing non-ASCII characters in ASCII without data loss. That's not really a hard problem either, once a specific solution is chosen, but it will require a custom program. >Now, back to (1). In similar vein, do you know of any standalone >utilities that help in determining a text-file character encoding? The most prominent contender in this space appears to be Mozilla's encoding detection algorithm, which is available in library form and in a few programs. I do not have personal experience with any of them. All the algorithms and utilities I have researched are focused on HTML pages and rely on the input containing text in a natural language associated with the encoding -- the more natural language text, the better. I don't think any of them are well suited to CIF, nor especially to Acta Cryst submissions (because the text must be English). >One utility we use heavily in the submission system is "file" >(http://freshmeat.net/projects/file - we currently use version 4.26 >with an augmented and slightly modified magic file). 'File' implements an heuristic approach based on characteristic signatures associated with many file types. It is much more reliable for some file types than for others. Without going into detail, 'file' is simply not up to the task of discerning among various text encodings, notwithstanding its recognition of signatures for some varieties of Unicode text. >It seems to me that without some reasonably reliable discriminator, >John's endorsement of support for "local" encodings will allow files >to leak out into the wider world where they can't at all easily be >handled or even properly identified. (Though, as many have argued >persuasively, "forbidding" them is not going to prevent such files >from being created, and possibly even used fruitfully within local >environments.) Indeed, I claim that there is *no* way to prevent authors from sending "CIFs" encoded in any particular way out into the world. It already happens. Nothing the standard can say will make it stop. The most the standard can do is to declare that such files aren't actually CIFs, but that wouldn't help anybody very much. What *can* happen, however, is that recipients of such files can reject them if the encoding is ambiguous. Were I setting policy for Acta Crystallographica with respect to CIF2, I would require CIF2 submissions to be encoded in UTF-8, or perhaps alternatively in UTF-16 if that ends up allowed by the standard. If IUCr wishes to be relaxed about _enforcement_ of such a policy in order to better serve authors, then fine, but that's a tricky proposition. I expect that in this area it will be much easier to tell authors "do this" than to after the fact determine "what did you do?". Chester will face that policy decision regardless of the standard's ultimate position on encoding. >Remember that many CIFs will come to us in the end after passage across >many heterogeneous systems. [...] >We'll >also see files shuttled between co-authors with different languages, >locales, OSes, and exchanged via email, ftp, USB stick etc. >"Corruptions" will inevitably be introduced in these interchanges - >sometimes subtle ones. [...] In no way do I discount those issues, but none of them can be solved by CIF2. James and I differ on how much the standard can even influence those areas, but I think little. Whatever influence it can exert, however, is at least as great under my UTF-8[/16]+local proposal as it is under UTF-8 only, because neither grants blanket acceptance to encodings other than UTF-8 and maybe UTF-16, yet UTF-8[/16]+local also focuses attention on the fact that local conventions do differ. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From bm at iucr.org Thu Sep 16 14:17:53 2010 From: bm at iucr.org (Brian McMahon) Date: Thu, 16 Sep 2010 14:17:53 +0100 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC7@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC7@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <20100916131753.GA18504@emerald.iucr.org> Thanks to Herbert, John and Simon for responding. I'm sorry if it seems like once again round an endless loop, but your replies have helped me to settle on the way I would like to see things move forward. For what it's worth: *** I favour the specification *recommending* a magic string to begin a file: an optional BOM followed by the 11 characters #\#CIF_2.0 I favour the specification *recommending* that this initial comment should be extended with an indication of the character encoding where this is not ASCII. I suggest the specification's discussion of the form this will take, as well as any other comments on character-set encoding, be presented in a distinct section of the specification (Part 3, or an Annexe or Appendix). These are recommendations, not requirements, 1. to include existing CIF1.0 and CIF1.1 instances as valid CIF input streams (whether "decorated" or not; 2. because you can only ever take this meta-information as well-intentioned hints. *** I like the idea of a checksum, but I think it's premature to require any particular formulation at this revision of the specification. *** I favour this new "Part 3" of the specification providing some general commentary on the nature of text files and transcoding issues. It should present UTF-8 as a "concrete" instantiation, and stipulate a suitable tag for incorporation in the "magic number" comment, let us say something like . It should explain the importance of developers following the "recommendations", and should caution against (but not prohibit) gratuitous proliferation of encodings. It should identify an additional resource hosted on the COMCIFS web site that provides guidance to developers. Use of the term "concrete" here harks back to the SGML specification. SGML is actually a metastandard for document markup languages, and in principle permits many different ways of tagging markup. But in describing just one "concrete" example, based on angle brackets, it encouraged the universal adoption of such tags right through HTML and XML. *** John said: > "Were I setting policy for Acta Crystallographica with respect to CIF2, > I would require CIF2 submissions to be encoded in UTF-8 ... If > IUCr wishes to be relaxed about _enforcement_ of such a policy in > order to better serve authors, then fine, but that's a tricky > proposition. I have some concerns about "enforceability" - an end-user (author) may simply not know how to comply with a requirement to supply a document in a specified encoding. However, the IUCr Managing Editor would accept a policy that required authors whose CIFs we had "difficulty in reading" to use a particular tool, namely publCIF. *** The "additional resource" I referred to could contain among other things: a list of organisations (IUCr journals, PDB, CCDC, individual synchrotron facilities) and their policies on accepting or outputting specific character-set encodings; a list of preferred encoding tags (initially just and perhaps , but extended in response to requests from specific developers); best-practice recommendations. I would prefer these to evolve from community discussions and practical requirements, rather than appear to be imposed by fiat of COMCIFS or IUCr - so maybe this should be a "cif-developers" rather than "COMCIFS" website. *** This approach tries to close off the formal specification while allowing controlled extensions. Essentially my "additional resource" becomes the framework for establishing protocols for conversion between different character-set encodings and serializations. For instance, Herbert replied to my comments on needing a pure ASCII representation in-house: > There is no way to make a "pure ascii version" of a general UTF-8 > file without adopting some reserved characters strings at the lexical > level -- \U... or &#...; or somesuch as used in many other systems, > but with such an extension, it is easy. That's perfectly understood, and I would expect that we (Acta) would devise an informal scheme to allow us to do so for whatever purposes we needed. We wouldn't expect that to be an integral part of the CIF-2 standard. On the other hand, if it became clear that other people were having difficulty in processing UTF-8 CIFs, we could formalise what we had done with a new encoding tag, post that on our cif-developers resource: Encoding scheme Details Reference Crystallography Journals http://........ ASCII-fication of Unicode characters and serve CIFs on request with the initial header #\#CIF_2.0 (I understand that this is different from character-set transcoding because it involves additional processing at the lexical level, so it may not be an appropriate thing to bundle these together in the same way. That's open to later discussion, but my point is that we're at least setting up a system allowing the community to exchange information about practical representation conversions, and so reduce the likelihood of uncontrolled chaos.) Regards Brian From jamesrhester at gmail.com Thu Sep 16 14:53:31 2010 From: jamesrhester at gmail.com (James Hester) Date: Thu, 16 Sep 2010 23:53:31 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <53458.94290.qm@web87002.mail.ird.yahoo.com> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> <53458.94290.qm@web87002.mail.ird.yahoo.com> Message-ID: Hi Brian: I think that John B and Simon have answered your questions adequately (ie you can forget about reliable autodetection of non-Unicode encodings). Chester is somewhat shielded from encoding mixups by virtue of the fact that you are in contact with the author of the CIF, who will have opportunities to catch encoding errors at some stage prior to the manuscript being finalised. That said, the less potential for mixup between multiple authors prior to submission, the better. For my part, I think the IUCr could handle manuscript submissions as follows: (i) CheckCIF should report non-UTF8 encoding as a top-level warning, with the warning message pointing to an IUCr-maintained webpage which describes how to save/convert files to UTF8 encoding for a range of popular editors (ii) The standard should give as little encouragement to non-UTF8 encodings as possible, to reduce the number of non-UTF8 submissions in the first place (iii) UTF8 introduction can be staged relatively slowly, starting from allowing it in a few non-essential datanames (e.g. defining _author_name_native_script or somesuch). Let's remember that on day 1 everything can still be ASCII as the dictionaries will be able to restrict character sets to ASCII (iv) Authors are not required to choose an encoding upon submission, but if non-UTF8 is detected, the authors will automatically be presented with a PDF version of their CIF manuscript and advised to check carefully, especially non-ASCII characters (Greek symbols!). On Thu, Sep 16, 2010 at 12:11 AM, SIMON WESTRIP wrote: > Hi Brian: I dont know of any standard text-encoding identifiers and > transcoders. > > There are certainly SDKs out there that provide text codecs to read/write > data; > the trick is identifying the original encoding in order to select the codec. > ?Interactive applications might resort to prompting the > user to confirm the encoding by presenting them with a view of the text and > a list of encodings - the > user can toggle through the encodings until the document is rendered > correctly (e.g. MS Word does this). > Obviously this is not ideal, but is something I've been thinking about as > part of the web upload process. > > In addition, there's documentation on heuristic approaches to detecting > encoding (as employed by browsers - indeed Mozilla makes its source > available). > I dont think this sort of autodetection will prove useful though, and > actually may scupper an interactive encoding confirmation mechanism as > described above! > > Cheers > > Simon > > > ________________________________ > From: Brian McMahon > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Wednesday, 15 September, 2010 13:39:27 > Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . > > I have said little or nothing on this list so far, because I'm > not sure that I can add anything that's of concrete use. I've read > the many contributions, all of them carefully thought through, and > I still see both sides (actually, all sides) of the arguments. I > am disinterested in the eventual outcome (but not "uninterested"). > > But, whatever the outcome, the IUCr will undoubtedly receive files > *intended* by the authors as CIF submissions, that come in a variety of > character-set encodings. For the most part, we will want to accept > these without asking the author what the encoding was, not least > because the typical author will have no idea (and increasingly, > our typical author will struggle to understand the questions we are > posing since English is not his or her native language - or perhaps we > will struggle to understand the reply). > > So my concerns are: > > (1) how easily can we determine the correct encoding with which the > file was generated; > > (2) how easily can we convert it into our canonical encoding(s) for > in-house production, archiving and delivery? > > First a few comments on that "canonical encoding(s)". Simon and I have > both been happy enough to consider UTF-8 as a lingua franca, since we > perceive it as a reasonably widespread vehicle for carrying a large > (multilingual) character set, and that is widely supported by many > generic text processors and platforms. However, many of our existing > CIF applications may choke on a UTF-8 file, and we may need to > create working formats that are pure ASCII. I would also prefer to > retain a single archival version of a CIF (well, ideally several > identical copies for redundancy, but nonetheless a single *version*), > from which alternative encodings that we choose to support for > delivery from the archive can be generated on the fly. > > So, really, the desire would be to have standalone applications that > can convert between character encodings on the fly. Does anyone know > of the general availability of such tools? The more, reliable, > conversions that can be made, the more relaxed we are about accepting > multiple input encodings. I have to say that a very quick Google > search hasn't yet thrown up much encouragement here. > > Now, back to (1). In similar vein, do you know of any standalone > utilities that help in determining a text-file character encoding? > > [I'm happy to be educated, ideally off-list, in whether > Content-Encoding negotiation in web forms can help here, since many > of our CIF submissions come by that route, but I'm more interested in > the general question of how you determine the encoding of a text file > that you just happen to find sitting on the filesystem.] > > One utility we use heavily in the submission system is "file" > (http://freshmeat.net/projects/file - we currently use version 4.26 > with an augmented and slightly modified magic file). This is rather > quiet about different character encodings, though I notice the magic > file distributed with the more recent version 5.04 does have a > "Unicode" section, namely: > > > #------------------------------------------------------------------------------ > ? ? # $File: unicode,v 1.5 2009/09/19 16:28:13 christos Exp $ > ? ? # Unicode:? BOM prefixed text files - Adrian Havill > > ? ? # GRR: These types should be recognised in file_ascmagic so these > ? ? # encodings can be treated by text patterns. > ? ? # Missing types are already dealt with internally. > ? ? # > ? ? 0? ? ? string? +/v8? ? ? ? ? ? ? ? ? ? Unicode text, UTF-7 > ? ? 0? ? ? string? +/v9? ? ? ? ? ? ? ? ? ? Unicode text, UTF-7 > ? ? 0? ? ? string? +/v+? ? ? ? ? ? ? ? ? ? Unicode text, UTF-7 > ? ? 0? ? ? string? +/v/? ? ? ? ? ? ? ? ? ? Unicode text, UTF-7 > ? ? 0? ? ? string? \335\163\146\163? ? ? ? Unicode text, UTF-8-EBCDIC > ? ? 0? ? ? string? \376\377\000\000? ? ? ? Unicode text, UTF-32, big-endian > ? ? 0? ? ? string? \377\376\000\000? ? ? ? Unicode text, UTF-32, > little-endian > ? ? 0? ? ? string? \016\376\377? ? ? ? ? ? Unicode text, SCSU (Standard > Compression Scheme for Unicode) > > Interestingly, the "animation" module of this new magic file > conflicts with other possible UTF encodings: > > ? ? # MPA, M1A > ? ? # updated by Joerg Jenderek > ? ? # GRR the original test are too common for many DOS files, so test 32 <= > kbits < > ? ? = 448 > ? ? # GRR this test is still too general as it catches a BOM of UTF-16 files > (0xFFFE) > ? ? # FIXME: Almost all little endian UTF-16 text with BOM are clobbered by > these entries > > > And, by the way, the "augmented" magic file we use (the one distributed as > part of the KDE desktop distribution) already includes this section: > > ? ? # chemical/x-cif 50 > ? ? 0??? string??? #\#CIF_1.1 > ? ? >10??? byte??? 9??? chemical/x-cif > ? ? >10??? byte??? 10??? chemical/x-cif > ? ? >10??? byte??? 13??? chemical/x-cif > > > > It seems to me that without some reasonably reliable discriminator, > John's endorsement of support for "local" encodings will allow files > to leak out into the wider world where they can't at all easily be > handled or even properly identified. (Though, as many have argued > persuasively, "forbidding" them is not going to prevent such files > from being created, and possibly even used fruitfully within local > environments.) > > Remember that many CIFs will come to us in the end after passage across > many heterogeneous systems. I referred in a previous post to my own > daily working environment - Solaris, Linux and Windows systems linked > by a variety of X servers, X emulators, NFS and SMB cross-mounted > filesystems, clipboards communicating with diverse applications > and OSes running different default locales... > [Incidentally, hasn't SMB now been superseded by "CIFS" !] > > Perhaps I'm just perverse; but I doubt that I'm quite unique. We'll > also see files shuttled between co-authors with different languages, > locales, OSes, and exchanged via email, ftp, USB stick etc. > "Corruptions" will inevitably be introduced in these interchanges - > sometimes subtle ones. For example, outside the CIF world altogether, > we see Greek characters change their identity when we run some files > through a PDF -> PostScript -> PDF cycle (all using software from the > same software house, Adobe). The reason has to do with differences in > Windows and Mac encodings, and the failure of the Acrobat software to > track and maintain the character mappings through such a cycle. > > Well, I'll stop here, because in spite of my best intentions I don't > think I'm moving the debate along very much, and I apologise if > everything here has already been so obvious as not to need saying. > > I'll defer further comment until I've learned if there are already > standard text-encoding identifiers and transcoders. > > Regards > Brian > _________________________________________________________________________ > Brian McMahon? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? tel: +44 1244 342878 > Research and Development Officer? ? ? ? ? ? ? ? ? ? fax: +44 1244 314888 > International Union of Crystallography? ? ? ? ? ? e-mail:? bm at iucr.org > 5 Abbey Square, Chester CH1 2HU, England > > > On Tue, Sep 14, 2010 at 10:58:39AM -0400, Herbert J. Bernstein wrote: >> One, hopefully relevant, aside -- ascii files are not as >> unambiguous as one might think.? Depending on what localization >> one has one one's computer, the code point 0x5c (one of the >> characters in the first 127) will be shown as a reverse >> solidus, a yen currency symbol or a won currency symbol.? This >> is a holdover from the days of national variants of the ISO >> character set, and shows no signs of going away any time soon. >> >> This is _not_ the only such case, but it is one that impacts >> most programming languages, including dREL, and existing CIF >> files, including the PDB's mmCIF files. >> ===================================================== >>? Herbert J. Bernstein, Professor of Computer Science >>? ? Dowling College, Kramer Science Center, KSC 121 >>? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >> >>? ? ? ? ? ? ? ? ? +1-631-244-3035 >>? ? ? ? ? ? ? ? ? yaya at dowling.edu >> ===================================================== >> >> On Tue, 14 Sep 2010, Herbert J. Bernstein wrote: >> >>> Dear Colleagues, >>> >>>? To avoid any misunderstandings, rather than worrying about how >>> we got to where we are, let us each just state a clear position. >>> Here is mine: >>> >>>? I favor CIF2 being stated in terms of UTF-8 for clarity, but >>> not specifying any particular _mandatory_ encoding of a CIF2 file >>> as long as there is a clearly agreed mechanism between the >>> creator and consumer of a given CIF2 file as to how to faithfully >>> transform the file between creator's and the consumer's encodings. >>> >>>? I favor UTF-8 being the default encoding that any CIF2 creator >>> should feel free to use without having to establish any prior >>> agreement with consumers, and that all consumers should try >>> to make arrangements to be able to read, either directly or >>> via some conversion utility or service.? If the consumers don't >>> make such arrangements then there may be CIF2 files that they >>> will not be able to read.? If a producer creates a CIF2 in any >>> encoding other than UTF8 then there may be consumers who have >>> difficulty reading that CIF2. >>> >>>? I favor the IUCr taking responsibility for collecting and >>> disseminating information on particularly useful ways to go >>> to and from UTF8 and/or other popular encodings. >>> >>>? Regards, >>>? ? Herbert >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>>? Dowling College, Kramer Science Center, KSC 121 >>>? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>> >>>? ? ? ? ? ? ? ? +1-631-244-3035 >>>? ? ? ? ? ? ? ? yaya at dowling.edu >>> ===================================================== >>> >>> On Tue, 14 Sep 2010, SIMON WESTRIP wrote: >>> >>>> I sense some common ground here with my previous post. >>>> >>>> The UTF8/16 pair could possibly be extended to any unicode encoding that >>>> is >>>> unambiguously/inherently identifiable? >>>> The 'local' encodings then encompass everything else? >>>> >>>> However, I think we've yet to agree that anything but UTF8 is to be >>>> allowed >>>> at all. We have a draft spec that stipulates UTF8, >>>> but I infer from this thread that there is scope to relax that >>>> restriction. >>>> The views seem to range from at least 'leaving the door open' >>>> in recognition of the variety of encodings available, to advocating that >>>> the encoding should not be part of the specification at all, and it will >>>> be >>>> down to developers to accommodate/influence user practice. I'm in favour >>>> of >>>> a default encoding or maybe any encoding that is inherently >>>> identifiable, >>>> and providing a means to declare other encodings (however untrustworthy >>>> the >>>> declaration may be, it would at least be available to conscientious >>>> users/developers), all documented in the spec. >>>> >>>> Please forgive me if this summary is off the mark; my conclusion is that >>>> there's a willingness to accommodate multiple encodings >>>> in this (albeit very small) group. Given that we are starting from the >>>> position of having a single encoding (agreed upon after much earlier >>>> debate), I cannot see us performing a complete U-turn to allow any >>>> (potentially unrecognizable) encoding as in CIF1, i.e. without some >>>> specification of a canonical encoding or mechanisms to identify/declare >>>> the >>>> encoding. On the other hand, I hope to see >>>> a revised spec that isnt UTF8 only. >>>> >>>> To get to the point - is there any hope of reaching a compromise? >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> >>>> >>>> ____________________________________________________________________________ >>>> From: "Bollinger, John C" >>>> To: Group for discussing encoding and content validation schemes for >>>> CIF2 >>>> >>>> Sent: Monday, 13 September, 2010 19:52:26 >>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. >>>> .. >>>> . >>>> >>>> >>>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote: >>>> [...] >>>>> To my mind, the encoding of plain CIF files remains an open issue.? I >>>>> do not view the mechanisms for managing file encoding that are >>>>> provided by current OSs to be sufficiently robust, widespread or >>>>> consistent that we can rely on developers or text editors respecting >>>>> them [...]. >>>> >>>> I agree that the encoding of plain CIF files remains an open issue. >>>> >>>> I confess I find your concerns there somewhat vague, especially to the >>>> extent that they apply within the confines of a single machine.? Do your >>>> concerns extend to that level?? If so, can you provide an example or two >>>> of >>>> what you fear might go wrong in that context? >>>> >>>> As Herb recently wrote, "Multiple encodings are a fact of life when >>>> working >>>> with text."? CIF2 looks like text, it feels like text, and despite some >>>> exotic spice, it tastes like text -- even in UTF-8 only form.? We cannot >>>> pretend that we're dealing with anything other than text.? We need to >>>> accept, therefore, that no matter what we do, authors and programmers >>>> will >>>> need to account for multiple encodings, one way or another.? The format >>>> specification cannot relieve either group of that responsibility. >>>> >>>> That doesn't necessarily mean, however, that CIF must follow the XML >>>> model >>>> of being self-defining with regard to text encoding.? Given CIF's >>>> various >>>> uses, we gain little of practical value in this area by defining CIF2 as >>>> UTF-8 only, and perhaps equally little by defining required decorations >>>> for >>>> expressing random encodings.? Moreover, the best reading of CIF1 is that >>>> it >>>> relies on the *local* text conventions, whatever they may be, which is >>>> quite >>>> a different thing than handling all text conventions that might >>>> conceivably >>>> be employed. >>>> >>>> With that being the case, I don't think it needful for CIF2 in any given >>>> environment to endorse foreign encoding conventions other than UTF-8. >>>> CIF2 >>>> reasonably could endorse UTF-16 as well, though, as that cannot be >>>> confused >>>> with any ASCII-compatible encoding.? Allowing UTF-16 would open up >>>> useful >>>> possibilities both for imgCIF and for future uses not yet conceived. >>>> Additionally, since CIF is text I still think it important for CIF2 to >>>> endorse the default text conventions of its operating environment. >>>> >>>> Could we agree on those three as allowed encodings?? Consider, given >>>> that >>>> combination of supported alternatives and no extra support from the >>>> spec, >>>> how might various parties deal with the unavoidable encoding issue. >>>> Here >>>> are some of the more reasonable alternatives I see: >>>> >>>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and >>>> PDB: >>>> >>>> Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.? The >>>> responsibility to perform any needed transcoding is on the other party. >>>> This is just as it might be with UTF-8-only. >>>> >>>> Option b) in addition to supporting UTF-8 and/or UTF-16, support >>>> other encodings by allowing users to explicitly specify them as part of >>>> the >>>> submission/retrieval process.? The processor / repository would either >>>> ensure the CIF is properly labeled, or, better, transcode it to >>>> UTF-8[/16]. >>>> This also is just as it might be with UTF-8 only. >>>> >>>> 2. Programs and Libraries: >>>> >>>> Option a) On input, detect encoding by checking first for UTF-16, >>>> assuming UTF-8 if not UTF-16, and falling back to default text >>>> conventions >>>> if a UTF-8 decoding error is encountered.? On output, encode as directed >>>> by >>>> the user (among the two/three options), defaulting to the input encoding >>>> when that is available and feasible.? These would be desirable behaviors >>>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 >>>> environment, >>>> but they do exceed UTF-8-only requirements. >>>> >>>> Option b) Require input and produce output according to a fixed set >>>> of conventions (whether local text conventions or UTF-8/16).? The >>>> program >>>> user is responsible for any needed transcoding.? This would be >>>> sufficient >>>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those >>>> differ, however, in which text conventions would be assumed. >>>> >>>> 3. Users/Authors: >>>> 3.1. Creating / editing CIFs >>>> No change from current practice is needed, but users might choose >>>> to >>>> store CIFs in UTF-8[/16] form.? This is just as it would likely be under >>>> UTF-8 only. >>>> >>>> 3.2. Transferring CIFs >>>> Unless an alternative agreement on encoding can be reached by some >>>> means, the transferor must ensure the CIF is encoded in UTF-8[/16]. >>>> This >>>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) >>>> allowed. >>>> >>>> 3.3. Receiving CIFs >>>> The receiver may reasonably demand that the CIF be provided in >>>> UTF-8[/16] form.? He should *expect* that form unless some alternative >>>> agreement is established.? Any desired transcoding from UTF-8[/16] to an >>>> alternative encoding is the user's responsibility.? Again, this is not >>>> significantly different from the UTF-8 only case. >>>> >>>> >>>> A driving force in many of those cases is the well-understood >>>> (especially >>>> here!) fact that different systems cannot be relied upon to share text >>>> conventions, thus leaving UTF-8[/16] as the only available >>>> general-purpose >>>> medium of exchange.? At the same time, local conventions are not >>>> forbidden >>>> from use where they can be relied upon -- most notably, within the same >>>> computer.? Even if end-users, as a group, do not appreciate those >>>> details, >>>> we can ensure via the spec that CIF2 implementers do.? That's >>>> sufficient. >>>> >>>> So, if pretty much all my expected behavior under UTF-8[/16]+local is >>>> the >>>> same as it would be under UTF-8-only, then why prefer the former? >>>> Because >>>> under UTF-8[/16]+local, all the behavior described is conformant to the >>>> spec, whereas under UTF-8 only, a significant proportion is not.? If the >>>> standard adequately covers these behaviors then we can expect more >>>> uniform >>>> support.? Moreover, this bears directly on community acceptance of the >>>> spec.? If flaunting the spec with respect to encoding becomes common, >>>> then >>>> the spec will have failed, at least in that area.? Having failed in one >>>> area, it is more likely to fail in others. >>>> >>>> >>>> Regards, >>>> >>>> John >>>> -- >>>> John C. Bollinger, Ph.D. >>>> Department of Structural Biology >>>> St. Jude Children's Research Hospital >>>> >>>> Email Disclaimer:? www.stjude.org/emaildisclaimer > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From John.Bollinger at STJUDE.ORG Thu Sep 16 16:48:06 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Thu, 16 Sep 2010 10:48:06 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> <53458.94290.qm@web87002.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDCC@SJMEMXMBS11.stjude.sjcrh.local> James, On Thursday, September 16, 2010 8:54 AM, James Hester wrote: [...] >For my part, I think the IUCr could handle manuscript submissions as follows: For the most part I think your suggestions are reasonable (and so I omit them), but I hope you will clarify one of them: >(iii) UTF8 introduction can be staged relatively slowly, starting from >allowing it in a few non-essential datanames (e.g. defining >_author_name_native_script or somesuch). Let's remember that on day 1 >everything can still be ASCII as the dictionaries will be able to >restrict character sets to ASCII Are you suggesting that the character encoding of individual data values be configurable in the dictionary? I suspect and hope that where you wrote "UTF8", you meant something more like "Unicode" -- i.e. the set of allowed (literal) characters, not their encoding. Is that right? If UTF-8 emerges as the only permitted encoding for CIF2 then this will be a mainly semantic difference, but it nevertheless has implications for software design and behavior. If UTF-8 does *not* emerge as the only permitted encoding for CIF2 then this will be a tremendous difference. Best, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From John.Bollinger at STJUDE.ORG Thu Sep 16 17:19:13 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Thu, 16 Sep 2010 11:19:13 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .. . In-Reply-To: <20100916131753.GA18504@emerald.iucr.org> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC7@SJMEMXMBS11.stjude.sjcrh.local> <20100916131753.GA18504@emerald.iucr.org> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDCD@SJMEMXMBS11.stjude.sjcrh.local> Hi Brian, On Thursday, September 16, 2010 8:18 AM, you wrote: [...] >I favour the specification *recommending* a magic string to begin a >file: an optional BOM followed by the 11 characters > >#\#CIF_2.0 Can you expand on that a bit? Ignoring all considerations of character encoding, CIF 2.0 syntax is not 100% backwards compatible with CIF 1 syntax. (Handling of quoted strings is the most prominent area of incompatibility, but there are others.) If a CIF processor is presented with a file having no version identification comment, then which syntax would you want it to assume? [...] >These are recommendations, not requirements, > >1. to include existing CIF1.0 and CIF1.1 instances as valid CIF input >streams (whether "decorated" or not; Making the version comment optional in CIF2, as you suggest, would make *most* well-formed CIF1 instances also be well-formed CIF2 instances -- but not all of them. Making CIF2 accommodate all of CIF1 would require substantial changes to those parts of the draft that we have considered settled. Is it that important to you? Thanks, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From bm at iucr.org Thu Sep 16 18:13:37 2010 From: bm at iucr.org (Brian McMahon) Date: Thu, 16 Sep 2010 18:13:37 +0100 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDCD@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC7@SJMEMXMBS11.stjude.sjcrh.local> <20100916131753.GA18504@emerald.iucr.org> <8F77913624F7524AACD2A92EAF3BFA5416659DEDCD@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <20100916171337.GA27644@emerald.iucr.org> John > If a CIF processor is presented with a file having no version > identification comment, then which syntax would you want it to assume? I'm thinking that's an implementation decision for the author of that particular CIF processor. Probably most authors should assume it's a CIF2: as you go on to note, most well-formed CIF1s will be fully conformant CIF2s. If the author is sufficiently motivated, he can devise relatively graceful ways to handle resultant syntax errors (safest is probably to abort but issue a suggestion that the input file is using a feature no longer supported; depending on the application, he may provide a switch to process as a CIF1). If not sufficiently motivated, the input should be handled in whatever way he has chosen to handle other syntax errors. Declaring a file to be CIF2 doesn't guarantee it won't have any syntax errors :-) The latter might well annoy users who don't understand what the problem is, but an enhanced vcif would, one hopes, be available to help explain what's gone wrong. > Making CIF2 accommodate all of CIF1 would require substantial changes > to those parts of the draft that we have considered settled. Is > it that important to you? No, I don't want to revisit those previous discussions. I accept that formally CIF2 is not 100% backwards compatible with CIF1. I'm also keen that we encourage the adoption of structured comments as routine good practice, but I'm reluctant to make them mandatory, largely for the reason Herbert gave some time ago. If an author sends you a CIF which is perfectly sound but lacks such a header, and you send it back demanding one, he will doubtless supply one - but who can say whether it's credible? This isn't a game-stopper for me: if we were to vote on this specific point, I'd say "make it optional", but if the majority view were otherwise, fair enough. It probably wouldn't stop IUCr from creating in-house applications that tried to process such "defective" CIFs anyway! Regards Brian On Thu, Sep 16, 2010 at 11:19:13AM -0500, Bollinger, John C wrote: > Hi Brian, > > On Thursday, September 16, 2010 8:18 AM, you wrote: > [...] > >I favour the specification *recommending* a magic string to begin a > >file: an optional BOM followed by the 11 characters > > > >#\#CIF_2.0 > > Can you expand on that a bit? Ignoring all considerations of character encoding, CIF 2.0 syntax is not 100% backwards compatible with CIF 1 syntax. (Handling of quoted strings is the most prominent area of incompatibility, but there are others.) If a CIF processor is presented with a file having no version identification comment, then which syntax would you want it to assume? > > [...] > > >These are recommendations, not requirements, > > > >1. to include existing CIF1.0 and CIF1.1 instances as valid CIF input > >streams (whether "decorated" or not; > > Making the version comment optional in CIF2, as you suggest, would make *most* well-formed CIF1 instances also be well-formed CIF2 instances -- but not all of them. Making CIF2 accommodate all of CIF1 would require substantial changes to those parts of the draft that we have considered settled. Is it that important to you? > > > Thanks, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding From jamesrhester at gmail.com Thu Sep 16 22:32:04 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 17 Sep 2010 07:32:04 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDCC@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> <53458.94290.qm@web87002.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDCC@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Hi John: you are correct, I indeed meant Unicode, not UTF8, and certainly would not want to allow control over encoding to be defined on a per-item basis. I am working on a response to your most recent proposal, but considerable research is required. I'll try not to make it a blockbuster this time! On Fri, Sep 17, 2010 at 1:48 AM, Bollinger, John C wrote: > James, > > On Thursday, September 16, 2010 8:54 AM, James Hester wrote: > [...] >>For my part, I think the IUCr could handle manuscript submissions as follows: > > For the most part I think your suggestions are reasonable (and so I omit them), but I hope you will clarify one of them: > >>(iii) UTF8 introduction can be staged relatively slowly, starting from >>allowing it in a few non-essential datanames (e.g. defining >>_author_name_native_script or somesuch). ?Let's remember that on day 1 >>everything can still be ASCII as the dictionaries will be able to >>restrict character sets to ASCII > > Are you suggesting that the character encoding of individual data values be configurable in the dictionary? ?I suspect and hope that where you wrote "UTF8", you meant something more like "Unicode" -- i.e. the set of allowed (literal) characters, not their encoding. ?Is that right? > > If UTF-8 emerges as the only permitted encoding for CIF2 then this will be a mainly semantic difference, but it nevertheless has implications for software design and behavior. ?If UTF-8 does *not* emerge as the only permitted encoding for CIF2 then this will be a tremendous difference. > > > Best, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: ?www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From jamesrhester at gmail.com Fri Sep 17 08:41:48 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 17 Sep 2010 17:41:48 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Hi John: good to see further constructive suggestions. Regarding your UTF8/16 + local proposal: I think I'd be willing to accept UTF16 in addition to UTF8 (see below). Regarding local encoding, note this blog posting from a Microsoft .Net developer, entitled "Don't Use Encoding.Default" http://blogs.msdn.com/b/shawnste/archive/2005/03/15/don-t-use-encoding-default.aspx Indeed, all of the developer-oriented material that I have looked at concerning Microsoft platforms recommends that the developer consciously *chooses* a Unicode-based encoding where possible, that is, ignores any local defaults. In fact, it is rather difficult to find any instructions as to how to determine the platform's "local" encoding. By reading Python source code, I found two Microsoft API functions, "GetACP" and "GetOEMCP", mentioned above, that can be used to determine the default/preferred encoding as an ANSI code page (see http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx). The online documentation for both functions contains the following bland comment: " The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use UTF-8 or UTF-16 when possible." My concern precisely. And: these files with local encoding still need some sort of mechanism to allow reliable transmission. And what about remote filesystem mounts for shared files? If one computer has a different local encoding and stores a file on its "local" filesystem, the next computer to access that "local" file may have a different "local" encoding and get it wrong. And so on. Frankly, I still see no merit in including local encodings in CIF2 at all. If the rest of you disagree, I won't argue about it further, but instead will attempt to mitigate the damage by supporting the following moves: (i) compliant CIF processors are *not* required to accept files in local encoding; (ii) CIF developer documentation outlines the reasons that "local" encoding is a bad idea (iii) the IUCr and databases are urged to make submitters check round-trip files if they have received files in non UTF8/UTF16 form (iv) the IUCr and databases encourage UTF8 submission. (v) CIF developer documentation outlines the techniques for ascertaining the preferred method of determining local encoding in a variety of languages and platforms. (I have added an addendum on local encodings with more information if anybody is interested) On Tue, Sep 14, 2010 at 4:52 AM, Bollinger, John C wrote: > > On Sunday, September 12, 2010 11:26 PM, James Hester wrote: > [...] >>To my mind, the encoding of plain CIF files remains an open issue. ?I >>do not view the mechanisms for managing file encoding that are >>provided by current OSs to be sufficiently robust, widespread or >>consistent that we can rely on developers or text editors respecting >>them [...]. > > I agree that the encoding of plain CIF files remains an open issue. > > I confess I find your concerns there somewhat vague, especially to the extent that they apply within the confines of a single > machine. ?Do your concerns extend to that level? ?If so, can you provide an example or two of what you fear might go wrong in that > context? A concrete example: a scientist in a multilingual country (e.g. Ukrainian/Russian/English in Ukraine) is used to switching locales to get legacy programs (ie those that rely on "default" encoding!) to display and/or input text properly. CIF files written in "local" encoding using one locale will not be read correctly in a different locale on the same machine. I note the following sentence in Microsoft's guide to encodings at http://msdn.microsoft.com/en-us/library/ms404377.aspx: "However, when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically either UTF8Encoding or UnicodeEncoding". I am simply following this recommendation, except that I think we can save our developers some angst by making the appropriate choice for them, so that they don't have to contend with those developers that haven't thought about the issues. > As Herb recently wrote, "Multiple encodings are a fact of life when working with text." ?CIF2 looks like text, it feels like text, and > despite some exotic spice, it tastes like text -- even in UTF-8 only form. ?We cannot pretend that we're dealing with anything other > than text. ?We need to accept, therefore, that no matter what we do, authors and programmers will need to account for multiple > encodings, one way or another. ?The format specification cannot relieve either group of that responsibility. And multiple encodings will continue to be a fact of life if we actively encourage their proliferation. We can at least reduce the amount that programmers need to consider multiple encodings by not building the problem into the specification. Then programmers only need to contend with non-conformant behaviour, to which a reasonable approach is gentle, informative rejection of the file. I acknowledge that there seems to be a difference in perceptions as to how widespread non-conformance will be (I think it will be negligible and manageable with a little education). > That doesn't necessarily mean, however, that CIF must follow the XML model of being self-defining with regard to text encoding. >?Given CIF's various uses, we gain little of practical value in this area by defining CIF2 as UTF-8 only, and perhaps equally little by > defining required decorations for expressing random encodings. ?Moreover, the best reading of CIF1 is that it relies on the *local* > text conventions, whatever they may be, which is quite a different thing than handling all text conventions that might conceivably > be employed. > > With that being the case, I don't think it needful for CIF2 in any given environment to endorse foreign encoding conventions other > than UTF-8. ?CIF2 reasonably could endorse UTF-16 as well, though, as that cannot be confused with any ASCII-compatible > encoding. ?Allowing UTF-16 would open up useful possibilities both for imgCIF and for future uses not yet conceived. ?Additionally, > since CIF is text I still think it important for CIF2 to endorse the default text conventions of its operating environment. If Microsoft documents are to be believed, they would rather developers *didn't* try to figure out what the default encoding is. Perhaps CIF2 should instead endorse the position of just about everybody writing about encodings instead, including the producers of the operating environment..."choose UTF8 if you have a choice"? > Could we agree on those three as allowed encodings? ?Consider, given that combination of supported alternatives and no extra > support from the spec, how might various parties deal with the unavoidable encoding issue. ?Here are some of the more reasonable > alternatives I see: > > 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: > > ? ? ? ?Option a) accept and provide only UTF-8 and/or UTF-16 CIFs. ?The responsibility to perform any needed transcoding is on the other party. ?This is just as it might be with UTF-8-only. > > ? ? ? ?Option b) in addition to supporting UTF-8 and/or UTF-16, support other encodings by allowing users to explicitly specify them > as part of the submission/retrieval process. ?The processor / repository would either ensure the CIF is properly labeled, or, better, > transcode it to UTF-8[/16]. ?This also is just as it might be with UTF-8 only. As discussed before, users are not necessarily going to know what their local encoding is, making the selection untrustworthy. Only option (a) is viable. > 2. Programs and Libraries: > > ? ? ? ?Option a) On input, detect encoding by checking first for UTF-16, assuming UTF-8 if not UTF-16, and falling back to default > text conventions if a UTF-8 decoding error is encountered. ?On output, encode as directed by the user (among the two/three > options), defaulting to the input encoding when that is available and feasible. ?These would be desirable behaviors even in the > UTF-8 only case, especially in a mixed CIF1/CIF2 environment, but they do exceed UTF-8-only requirements. I don't think the user would necessarily know which encoding to prefer if offered a choice. I believe the safest route is to output in the same encoding as the input, which at least avoids introducing errors if the local encoding is different to what the previous program thought it was and then the resulting errors are preserved when transcoding to UTF8/16. So option (a) is not viable > ? ? ? ?Option b) Require input and produce output according to a fixed set of conventions (whether local text conventions or > UTF-8/16). ?The program user is responsible for any needed transcoding. ?This would be sufficient for the CIF2, UTF-8 only case, > and is typical in the CIF1 case; those differ, however, in which text conventions would be assumed. This is acceptable in that it doesn't make anything worse by producing incorrect UTF8/16 text due to use of incorrect local encoding. When the time comes to transcode to UTF8, some user interaction for checking of the encoding is necessary, so should not be done silently. > 3. Users/Authors: > 3.1. Creating / editing CIFs > ? ? ? ?No change from current practice is needed, but users might choose to store CIFs in UTF-8[/16] form. ?This is just as it would > likely be under UTF-8 only. I assume by "current practice" you mean editing files in "local" encoding? > 3.2. Transferring CIFs > ? ? ? ?Unless an alternative agreement on encoding can be reached by some means, the transferor must ensure the CIF is encoded in UTF-8[/16]. ?This differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed. Note of course that I consider that a CIF is transferred every time it is written to a filesystem, under which definition local encoding would not be allowed. In any case, I would tighten up this requirement to be UTF8 unless both parties agree on UTF16. > 3.3. Receiving CIFs > ? ? ? ?The receiver may reasonably demand that the CIF be provided in UTF-8[/16] form. ?He should *expect* that form unless some > alternative agreement is established. ?Any desired transcoding from UTF-8[/16] to an alternative encoding is the user's > responsibility. ?Again, this is not significantly different from the UTF-8 only case. > > > A driving force in many of those cases is the well-understood (especially here!) fact that different systems cannot be relied upon to > share text conventions, thus leaving UTF-8[/16] as the only available general-purpose medium of exchange. ?At the same time, > local conventions are not forbidden from use where they can be relied upon -- most notably, within the same computer. ?Even if > end-users, as a group, do not appreciate those details, we can ensure via the spec that CIF2 implementers do. ?That's sufficient. As I've said said in my addendum, with guidance, most CIF2 programs could probably come up with consistent identification of the local encoding on any given day. Whether that corresponds to the same encoding used for any given CIF file on the "local" filesystem is another thing, depending on what the code page was on the day it was written and whether it was even written by the same system (ie shared mounts). So, saying that local text conventions can be relied up within the one computer is a bit of a stretch as I've discussed above. I agree that we only care about the implementers in this case. > So, if pretty much all my expected behavior under UTF-8[/16]+local is the same as it would be under UTF-8-only, then why prefer > the former? ?Because under UTF-8[/16]+local, all the behavior described is conformant to the spec, whereas under UTF-8 only, a > significant proportion is not. ?If the standard adequately covers these behaviors then we can expect more uniform support. >?Moreover, this bears directly on community acceptance of the spec. ?If flaunting the spec with respect to encoding becomes > common, then the spec will have failed, at least in that area. ?Having failed in one area, it is more likely to fail in others. We disagree on the "significant proportion". I think (with perhaps as little hard evidence as you? Or do you know something I don't?) that very few CIF2 programmers will want to support the default encoding, especially given the difficulties described above, and those users with a penchant for editing CIF files will learn very quickly how to choose UTF8 in a drop-down menu if said programs provide an error message pointing to an IUCr webpage (for example). I have few objections (now) to including UTF16, provided that any files in UTF16 encoding are explicitly negotiated as such. My original objection to UTF16 was based on users with an ASCII-compatible workflow opening a CIF2 file for viewing or editing and seeing junk. If such files only appear on these users' systems by deliberate request, this is not such a big deal. In all other aspects UTF16 satisfies my original requirements, most obviously identifiability. I would still stack the dice in favour of UTF8, however. James ======================================== Addendum on local encoding (not germane to above argument in the end): Before accepting "local", we need to be sure that we know that "local encoding" is a well-defined concept. For "local encoding" to be a well-defined concept, we would require that programmers using different programming languages will be able to independently determine which encoding is the local encoding from within their various programs. If different local programs do not agree on what the local encoding is, one program will write files in one "local" encoding, which is then input by another program assuming a different "local" encoding, and all sorts of confusion ensues, especially after the second program thoughtfully transcodes to UTF8. (Note that programs will usually have no way of telling if they have correctly determined what the "local" encoding is, as the CIF file itself will parse fine in any ASCII-compatible encoding). My preliminary investigations suggest that even Windows manages to be more or less consistent on the "single local encoding" front, via use of the GetACP() function (used by at least CPython and Gnu Java). MacOS has a system default encoding, and Unix variants use the LANG variable. Fortran 2003 has an ENCODING=DEFAULT option which in gfortran simply does nothing (ie passes the bytes in a character string directly as is to disk), so a Fortran program wishing to offer the local option would need to implement the encoding machinery themselves. Anyway, I would not immediately exclude "local encoding" for being ill-defined. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From John.Bollinger at STJUDE.ORG Fri Sep 17 16:33:23 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Fri, 17 Sep 2010 10:33:23 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local> Hi James, On Friday, September 17, 2010 2:42 AM, James Hester wrote: [...] >Regarding your UTF8/16 + local proposal: I think I'd be willing to >accept UTF16 in addition to UTF8 (see below). I do favor supporting UTF-16 in addition to UTF-8, so I'm pleased you're willing to agree to that, but that's not the central theme of the proposal. Nevertheless, it feels like we're coming close to a resolution. My apologies if the rest of my response is long-winded; the key points are (i) Are we ready to / do we need to vote on the local encodings question? (ii) With some caveats, I support your mitigating responses to allowing local encodings. And so... > Regarding local >encoding, note this blog posting from a Microsoft .Net developer, >entitled "Don't Use Encoding.Default" >http://blogs.msdn.com/b/shawnste/archive/2005/03/15/don-t-use-encoding-default.aspx I hadn't seen that particular post before, but many Java people, too, regard explicitly specifying text encodings as the best practice. Partly from that background sprang my support for various tagging proposals that have been floated here. Unfortunately, that train has long since left us behind on the platform. New standard notwithstanding, I don't see an opportunity to effect an abrupt shift in program and user behavior -- specifically, the behavior of using default text conventions implicitly and routinely. If we formally require UTF-8/16, it can only be with the understanding that many users and programs will ignore that requirement altogether. I don't find that at all appealing or useful, and I do not support it. I think we will achieve more consistent CIF2 software, and we will better influence programmers and users, by standardizing the use of default text conventions with CIF2. I would be content to deprecate such use. I would favor non-normative commentary in the spec that explains the issue and discourages reliance on default text encoding. I would also favor publicizing resources describing how to convert local text to UTF-8 (or -16), and creating such resources if necessary. I want to see people using UTF-8/16 for their CIFs, but I don't want to cut them off, standards-wise, when they don't. [...] >In fact, it is rather difficult to >find any instructions as to how to determine the platform's "local" >encoding. The point of default conventions is that you don't have to determine what they are, you just use them. In fact, in some programming environments, there is no easy way to do otherwise. For example, to the best of my knowledge, there is no way to write a standard-conformant Fortran 95 program that portably reads text from a file in anything but the default encoding. >" The ANSI code pages can be different on different computers, or can >be changed for a single computer, leading to data corruption. For the >most consistent results, applications should use UTF-8 or UTF-16 *when >possible*." (Emphasis added.) I second that advice, and I would be happy to have non-normative comments to that effect in the CIF2 standard. The situation for the standard, however, is not the same as for a program. It is valuable to standardize even practices that we frown upon when we have every reason to expect that such practices will continue. >My concern precisely. And: these files with local encoding still need >some sort of mechanism to allow reliable transmission. And what about >remote filesystem mounts for shared files? If one computer has a >different local encoding and stores a file on its "local" filesystem, >the next computer to access that "local" file may have a different >"local" encoding and get it wrong. The mechanism for reliable transmission is to transcode, if necessary, to UTF-8/16, and transmit the result. This is exactly the same mechanism that would be available for reliable transmission if UTF-8 were the only standardized encoding (under which case I include transmission of non-UTF-8 almost-CIFs). The mechanism is the same for reliably sharing CIFs among environments where compatibility of default conventions is uncertain. I see no reason to believe that users' decisions whether to employ that mechanism will be driven by anything other than practical considerations, the standard's position notwithstanding. I would expect some programmers to be more influenced by the standard, but in the end they are faced with the same practical considerations. > And so on. Frankly, I still see no >merit in including local encodings in CIF2 at all. I value standardizing behavior that we all (I think) expect will be common, even though that behavior isn't ideal. In that way I expect to support well-defined and consistent responses to that behavior (mainly in software). Given that I have said so before without persuading you, we will have to agree to disagree here. >If the rest of >you disagree, I won't argue about it further, Is that a call for a vote? > but instead will attempt >to mitigate the damage by supporting the following moves: > >(i) compliant CIF processors are *not* required to accept files in >local encoding; It is inconsistent to allow local text conventions in the file format definition, but to permit conformant processors to reject them. Additionally, I oppose inclusion of any explicit requirements on CIF processors, preferring instead to rely on the format specification to define what conformant processors must do. I could, however, accept defining separate flavors of CIF distinguished by these encoding distinctions, so that programs could conform to one, the other, or both. I'm not sure I like that, but I think I could agree to it if it helps us wrap this up. >(ii) CIF developer documentation outlines the reasons that "local" >encoding is a bad idea I support that fully. >(iii) the IUCr and databases are urged to make submitters check >round-trip files if they have received files in non UTF8/UTF16 form I think that's a good idea. >(iv) the IUCr and databases encourage UTF8 submission. Absolutely. As I have written before, I think it would be an even better idea for IUCr and databases to *require* that CIFs be submitted to them in UTF-8/16 form (or even in UTF-8 form exclusively), but there are legitimate reasons why they might not want to adopt such a policy. >(v) CIF developer documentation outlines the techniques for >ascertaining the preferred method of determining local encoding in a >variety of languages and platforms. Ok. As I wrote above, the whole point of default encodings is you don't need to figure it out. By definition, it's what you get when you don't specify (or when you generically ask for defaults). On the other hand, there might be special cases (or ordinary ones that I have not considered) where you do have to figure it out after all. Information about how to do so is relevant. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Fri Sep 17 19:34:08 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 17 Sep 2010 14:34:08 -0400 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local > References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local > Message-ID: It may help this discussion to refer to the CIF 1.1 syntax specification, which says: Character set 22. Characters within a CIF are restricted to certain printable or white-space characters. Specifically, these are the ones located in the ASCII character set at decimal positions 09 (HT or horizontal tab), 10 (LF or line feed), 13 (CR or carriage return) and the letters, numerals and punctuation marks at positions 32-126. The ASCII characters at decimal positions 11 (VT or vertical tab) and 12 (FF or form feed), often included in library implementations as white space characters, are explicitly excluded from the CIF character set at this revision. 23. The reference to the ASCII character set is specifically to identify characters in an established and widely available standard. It is understood that CIFs may be constructed and maintained on computer platforms that implement other character-set encodings. However, for maximum portability only the characters identified in the section above may be used. Other printable characters, even if available in an accessible character set such as Unicode, must be indicated by some encoding mechanism using only the permitted characters. At this revision, only the encoding convention detailed in paragraphs 30-37 of the document Common semantic features is recognised for this purpose. To end this promptly and get on with actually using CIF2, I formally propose to a vote on the following wording, which combines what has already been put forth in "CIF Changes to the specification 05 July 2010" with the beginning of the CIF 1.1 syntax specification paragraph 23, and that we leave all the remaining details on how best to deal with multiple character encodings for future discussion. =============================================================== Proposed position on CIF2 character encodings submitted to COMCIFS for a vote as an interim agreement on what can be agreed thus far, subject to extension and refinement in the future. =============================================================== Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U+xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8 as the preferred concrete representation of the information in a CIF2 document. Reference to ASCII characters means characters U+0000 through U+007F, or, equivalently the first 128 characters of the ISO-8859-1 (LATIN-1) character set. Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). Reference to whitespace means the characters ASCII space (U+0020), ASCII horizontal tab (U+0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. References to Unicode and UTF-8 are specifically to identify characters and a concrete representation of those characters in an established and widely available standard. It is understood that CIF2 documents may be constructed and maintained on computer that implements other character encodings. However, for maximum portability only the clearly identified equivalents to the Unicode characters identified above and below should be used and use of UTF-8 for a concrete representation is highly recommended. A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is, #\#CIF_2.0 followed immediately by whitespace. The addition of further information to assist in disambiguation among multiple characters sets is under discussion. Encodings, such a UTF-16, which prefix a file by a BOM (byte-order-message) or other encoding disambiguation prefix are not precluded. In such a case, the magic code should follow the encoding disambiguation prefix. In keeping with XML restrictions we allow the characters U+0009 U+000A U+000D U+0020 -- U+007E U+00A0 -- U+D7FF U+E000 -- U+FDCF U+FDF0 -- U+FFFD U+10000 -- U+10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 - F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. CIF2 processors are required to treat , and as newline characters, by normalising them to on read. No other characters or character sequences may represent newline. In particular, CIF2 processors should not interpret the Unicode characters U+2028 (line separator) or U+2029 (paragraph separator) as newline. -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From simonwestrip at btinternet.com Fri Sep 17 22:12:31 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Fri, 17 Sep 2010 14:12:31 -0700 (PDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local > Message-ID: <865926.34904.qm@web87007.mail.ird.yahoo.com> On first reading, I am tempted to support this, recognizing that this issue is going to require a deal of effort to accommodate current practice whatever is specified, and hoping that considerable support (documentation and software utilities) will be available to both users and developers (e.g. along the lines of Brian's recent contribution). However, I would like UTF-8 to be established as the default encoding that all CIF processors should be able to handle and should assume in the absence of any pointers to the contrary (but then this could be championed within the supporting material if it cannot find a place in the formal specification). Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Friday, 17 September, 2010 19:34:08 Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . It may help this discussion to refer to the CIF 1.1 syntax specification, which says: Character set 22. Characters within a CIF are restricted to certain printable or white-space characters. Specifically, these are the ones located in the ASCII character set at decimal positions 09 (HT or horizontal tab), 10 (LF or line feed), 13 (CR or carriage return) and the letters, numerals and punctuation marks at positions 32-126. The ASCII characters at decimal positions 11 (VT or vertical tab) and 12 (FF or form feed), often included in library implementations as white space characters, are explicitly excluded from the CIF character set at this revision. 23. The reference to the ASCII character set is specifically to identify characters in an established and widely available standard. It is understood that CIFs may be constructed and maintained on computer platforms that implement other character-set encodings. However, for maximum portability only the characters identified in the section above may be used. Other printable characters, even if available in an accessible character set such as Unicode, must be indicated by some encoding mechanism using only the permitted characters. At this revision, only the encoding convention detailed in paragraphs 30-37 of the document Common semantic features is recognised for this purpose. To end this promptly and get on with actually using CIF2, I formally propose to a vote on the following wording, which combines what has already been put forth in "CIF Changes to the specification 05 July 2010" with the beginning of the CIF 1.1 syntax specification paragraph 23, and that we leave all the remaining details on how best to deal with multiple character encodings for future discussion. =============================================================== Proposed position on CIF2 character encodings submitted to COMCIFS for a vote as an interim agreement on what can be agreed thus far, subject to extension and refinement in the future. =============================================================== Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U+xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8 as the preferred concrete representation of the information in a CIF2 document. Reference to ASCII characters means characters U+0000 through U+007F, or, equivalently the first 128 characters of the ISO-8859-1 (LATIN-1) character set. Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). Reference to whitespace means the characters ASCII space (U+0020), ASCII horizontal tab (U+0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. References to Unicode and UTF-8 are specifically to identify characters and a concrete representation of those characters in an established and widely available standard. It is understood that CIF2 documents may be constructed and maintained on computer that implements other character encodings. However, for maximum portability only the clearly identified equivalents to the Unicode characters identified above and below should be used and use of UTF-8 for a concrete representation is highly recommended. A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is, #\#CIF_2.0 followed immediately by whitespace. The addition of further information to assist in disambiguation among multiple characters sets is under discussion. Encodings, such a UTF-16, which prefix a file by a BOM (byte-order-message) or other encoding disambiguation prefix are not precluded. In such a case, the magic code should follow the encoding disambiguation prefix. In keeping with XML restrictions we allow the characters U+0009 U+000A U+000D U+0020 -- U+007E U+00A0 -- U+D7FF U+E000 -- U+FDCF U+FDF0 -- U+FFFD U+10000 -- U+10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 - F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. CIF2 processors are required to treat , and as newline characters, by normalising them to on read. No other characters or character sequences may represent newline. In particular, CIF2 processors should not interpret the Unicode characters U+2028 (line separator) or U+2029 (paragraph separator) as newline. -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100917/1f0612a2/attachment-0001.html From John.Bollinger at STJUDE.ORG Fri Sep 17 22:42:24 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Fri, 17 Sep 2010 16:42:24 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local > Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDD5@SJMEMXMBS11.stjude.sjcrh.local> On Friday, September 17, 2010 1:34 PM, Herbert J. Bernstein wrote >=============================================================== > >Proposed position on CIF2 character encodings submitted to >COMCIFS for a vote as an interim agreement on what can be >agreed thus far, subject to extension and refinement in >the future. > >=============================================================== [...] CIF2 files are standard variable length text files [...] References to Unicode and UTF-8 are specifically to identify characters and a concrete representation of those characters in an established and widely available standard. It is understood that CIF2 documents may be constructed and maintained on computer that implements other character encodings. [...] I understand you are anxious to finish standardizing, but I cannot support that language. It is easily susceptible to at least two conflicting interpretations, one of which is too broad, and the other too narrow: 1) Any text encoding whatever may be used for any CIF, anywhere. 2) Only the encoding(s) recognized by default for "text" in some particular context is supported for CIFs in that context. (This precludes UTF-8 in at least some places.) If there are other reasonable interpretations then that only makes the problem worse. That CIF1 suffers from the same ambiguity (albeit with less impact owing to its restricted character set) does not change anything. I, too, want to wrap this up, and I think we are close to doing so. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Fri Sep 17 23:06:01 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 17 Sep 2010 18:06:01 -0400 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDD5@SJMEMXMBS11.stjude.sjcrh.local > References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDD5@SJMEMXMBS11.stjude.sjcrh.local > Message-ID: We can keep debating complex changes to what has already been well-established CIF practice, and delay implementation of CIF2 longer, or even forever, or we can work from where we are now and let these issues be resolved in their own time while we get experience with CIF2. I am formally asking for a vote of all COMCIFS voting members on the proposed wording as it stands. -- Herbert At 4:42 PM -0500 9/17/10, Bollinger, John C wrote: >On Friday, September 17, 2010 1:34 PM, Herbert J. Bernstein wrote > >>=============================================================== >> >>Proposed position on CIF2 character encodings submitted to >>COMCIFS for a vote as an interim agreement on what can be >>agreed thus far, subject to extension and refinement in >>the future. >> >>=============================================================== >[...] >CIF2 files are standard variable length text files >[...] >References to Unicode and UTF-8 are specifically to identify characters >and a concrete representation of those characters in an established and >widely available standard. It is understood that CIF2 documents may >be constructed and maintained on computer that implements other character >encodings. >[...] > >I understand you are anxious to finish standardizing, but I cannot >support that language. It is easily susceptible to at least two >conflicting interpretations, one of which is too broad, and the >other too narrow: > >1) Any text encoding whatever may be used for any CIF, anywhere. > >2) Only the encoding(s) recognized by default for "text" in some >particular context is supported for CIFs in that context. (This >precludes UTF-8 in at least some places.) > >If there are other reasonable interpretations then that only makes >the problem worse. That CIF1 suffers from the same ambiguity >(albeit with less impact owing to its restricted character set) does >not change anything. > > >I, too, want to wrap this up, and I think we are close to doing so. > > >Regards, > >John >-- >John C. Bollinger, Ph.D. >Department of Structural Biology >St. Jude Children's Research Hospital > > >Email Disclaimer: www.stjude.org/emaildisclaimer > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From John.Bollinger at STJUDE.ORG Fri Sep 17 23:42:59 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Fri, 17 Sep 2010 17:42:59 -0500 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. .. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDD5@SJMEMXMBS11.stjude.sjcrh.local > Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDD6@SJMEMXMBS11.stjude.sjcrh.local> On Friday, September 17, 2010 5:06 PM, Herbert J. Bernstein wrote: >I am formally asking for a vote of all COMCIFS voting members on the proposed >wording as it stands. It is your privilege to do so, but if you genuinely want a formal COMCIFS vote -- as opposed to a vote of the participants in this discussion -- then I do not believe the motion is in order in this forum. If you would be satisfied with a vote of the discussion participants, then I vote NO. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Sat Sep 18 02:39:13 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 17 Sep 2010 21:39:13 -0400 Subject: [Cif2-encoding] Request for a vote on a motion In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDD6@SJMEMXMBS11.stjude.sjcrh.local > References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDD5@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDD6@SJMEMXMBS11.stjude.sjcrh.local > Message-ID: Point well taken. To ensure that the voting members of COMCIFS see this message in their role as voting members of COMCIFS, I am copying it to that list as well. Dear Colleagues on COMCIFS. There has been a long and interesting discussion of various alternatives for handling or not handling various encodings of text for CIF2 and/or recasting CIF as a UTF-8 based binary format. These discussions have not achieved anything approximating agreement on these issues, but we do seem to have agreement on the important of extending CIF to handle Unicode, and on the desirability of UTF8 serving the role that ASCII has long held as a preferred default encoding for CIF. In my opinion, it is a mistake to hold release of CIF2 hostage to what is, frankly, a minor side-issue. CIF has long-established practices with respect to the role of ASCII and text in CIF that have been formally adopted by COMCIFS in the form of the current CIF 1.1 syntax specification. The appended proposed resolution combines the currently adopted CIF practice with respect to ASCII and text with the minimal changes necessary to follow those same practices with respect to Unicode and UTF8 accepting what has been proposed to the community as a whole as the changes that will make CIF2. In the 2+ months since that proposal, nobody has objected to anything in that document other than to the way in which UTF8 was proposed. The motion below simply combines was has not caused objection with the practices previously adopted by COMCIFS and defacto in use for many years. Until we have agreement on something else -- which seems likely to take years more of debate -- I urge all concerned to accept this imperfect motion as the best we can do for now so CIF2 can be used. Please consult sections 22 and 23 of the CIF 1.1 syntax specification and the "CIF Changes to the specification 05 July 2010", and then please consider the appended motion submitted for formal vote by COMCIFS. If anybody likes the basic idea of a minimally disruptive change to what has already been agreed for CIF2, but wants some minimal wording changes, the right way to do that is just to proposed the specific wording changes you propose as an amendment to this motion to be voted on first, but please, let us get something agreed to promptly. Elephants get born is less time than CIF2 is taking. =============================================================== Proposed position on CIF2 character encodings submitted to COMCIFS for a vote as an interim agreement on what can be agreed thus far, subject to extension and refinement in the future. =============================================================== Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U+xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8 as the preferred concrete representation of the information in a CIF2 document. Reference to ASCII characters means characters U+0000 through U+007F, or, equivalently the first 128 characters of the ISO-8859-1 (LATIN-1) character set. Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). Reference to whitespace means the characters ASCII space (U+0020), ASCII horizontal tab (U+0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. References to Unicode and UTF-8 are specifically to identify characters and a concrete representation of those characters in an established and widely available standard. It is understood that CIF2 documents may be constructed and maintained on computer that implements other character encodings. However, for maximum portability only the clearly identified equivalents to the Unicode characters identified above and below should be used and use of UTF-8 for a concrete representation is highly recommended. A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is, #\#CIF_2.0 followed immediately by whitespace. The addition of further information to assist in disambiguation among multiple characters sets is under discussion. Encodings, such a UTF-16, which prefix a file by a BOM (byte-order-message) or other encoding disambiguation prefix are not precluded. In such a case, the magic code should follow the encoding disambiguation prefix. In keeping with XML restrictions we allow the characters U+0009 U+000A U+000D U+0020 -- U+007E U+00A0 -- U+D7FF U+E000 -- U+FDCF U+FDF0 -- U+FFFD U+10000 -- U+10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 - F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. CIF2 processors are required to treat , and as newline characters, by normalising them to on read. No other characters or character sequences may represent newline. In particular, CIF2 processors should not interpret the Unicode characters U+2028 (line separator) or U+2029 (paragraph separator) as newline. At 5:42 PM -0500 9/17/10, Bollinger, John C wrote: >On Friday, September 17, 2010 5:06 PM, Herbert J. Bernstein wrote: >>I am formally asking for a vote of all COMCIFS voting members on the proposed >>wording as it stands. > >It is your privilege to do so, but if you genuinely want a formal >COMCIFS vote -- as opposed to a vote of the participants in this >discussion -- then I do not believe the motion is in order in this >forum. > >If you would be satisfied with a vote of the discussion >participants, then I vote NO. > > >Regards, > >John >-- >John C. Bollinger, Ph.D. >Department of Structural Biology >St. Jude Children's Research Hospital > > >Email Disclaimer: www.stjude.org/emaildisclaimer > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From jamesrhester at gmail.com Mon Sep 20 15:22:19 2010 From: jamesrhester at gmail.com (James Hester) Date: Tue, 21 Sep 2010 00:22:19 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> <53458.94290.qm@web87002.mail.ird.yahoo.com> Message-ID: Brian, I believe Chester also distributes 'template' CIF files. I would suggest that any CIF2 template files distributed by the IUCr included commented Unicode text along the lines of both Herbert's suggestion a long time ago (I think it was a sequence of accented letter 'o's) and perhaps a set of Greek letters. If programmers (eg Shelx) picked up this template as well as authors, you might find your job of recovering from encoding errors somewhat easier. You would put an ASCII comment prior to the letters to indicate what should be visible in a Unicode-aware editor, and mention that incorrect display is only a problem if the data contain non-ASCII characters. I stress that this is purely a heuristic, probabilistic approach for your own use and not suitable for inclusion in the standard itself for these reasons. I plan to address your original proposal soon. James. On Thu, Sep 16, 2010 at 11:53 PM, James Hester wrote: > Hi Brian: I think that John B and Simon have answered your questions > adequately (ie you can forget about reliable autodetection of > non-Unicode encodings). ?Chester is somewhat shielded from encoding > mixups by virtue of the fact that you are in contact with the author > of the CIF, who will have opportunities to catch encoding errors at > some stage prior to the manuscript being finalised. ?That said, the > less potential for mixup between multiple authors prior to submission, > the better. > > For my part, I think the IUCr could handle manuscript submissions as follows: > > (i) CheckCIF should report non-UTF8 encoding as a top-level warning, > with the warning message pointing to an IUCr-maintained webpage which > describes how to save/convert files to UTF8 encoding for a range of > popular editors > (ii) The standard should give as little encouragement to non-UTF8 > encodings as possible, to reduce the number of non-UTF8 submissions in > the first place > (iii) UTF8 introduction can be staged relatively slowly, starting from > allowing it in a few non-essential datanames (e.g. defining > _author_name_native_script or somesuch). ?Let's remember that on day 1 > everything can still be ASCII as the dictionaries will be able to > restrict character sets to ASCII > (iv) Authors are not required to choose an encoding upon submission, > but if non-UTF8 is detected, the authors will automatically be > presented with a PDF version of their CIF manuscript and advised to > check carefully, especially non-ASCII characters (Greek symbols!). > > On Thu, Sep 16, 2010 at 12:11 AM, SIMON WESTRIP > wrote: >> Hi Brian: I dont know of any standard text-encoding identifiers and >> transcoders. >> >> There are certainly SDKs out there that provide text codecs to read/write >> data; >> the trick is identifying the original encoding in order to select the codec. >> ?Interactive applications might resort to prompting the >> user to confirm the encoding by presenting them with a view of the text and >> a list of encodings - the >> user can toggle through the encodings until the document is rendered >> correctly (e.g. MS Word does this). >> Obviously this is not ideal, but is something I've been thinking about as >> part of the web upload process. >> >> In addition, there's documentation on heuristic approaches to detecting >> encoding (as employed by browsers - indeed Mozilla makes its source >> available). >> I dont think this sort of autodetection will prove useful though, and >> actually may scupper an interactive encoding confirmation mechanism as >> described above! >> >> Cheers >> >> Simon >> >> >> ________________________________ >> From: Brian McMahon >> To: Group for discussing encoding and content validation schemes for CIF2 >> >> Sent: Wednesday, 15 September, 2010 13:39:27 >> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . >> >> I have said little or nothing on this list so far, because I'm >> not sure that I can add anything that's of concrete use. I've read >> the many contributions, all of them carefully thought through, and >> I still see both sides (actually, all sides) of the arguments. I >> am disinterested in the eventual outcome (but not "uninterested"). >> >> But, whatever the outcome, the IUCr will undoubtedly receive files >> *intended* by the authors as CIF submissions, that come in a variety of >> character-set encodings. For the most part, we will want to accept >> these without asking the author what the encoding was, not least >> because the typical author will have no idea (and increasingly, >> our typical author will struggle to understand the questions we are >> posing since English is not his or her native language - or perhaps we >> will struggle to understand the reply). >> >> So my concerns are: >> >> (1) how easily can we determine the correct encoding with which the >> file was generated; >> >> (2) how easily can we convert it into our canonical encoding(s) for >> in-house production, archiving and delivery? >> >> First a few comments on that "canonical encoding(s)". Simon and I have >> both been happy enough to consider UTF-8 as a lingua franca, since we >> perceive it as a reasonably widespread vehicle for carrying a large >> (multilingual) character set, and that is widely supported by many >> generic text processors and platforms. However, many of our existing >> CIF applications may choke on a UTF-8 file, and we may need to >> create working formats that are pure ASCII. I would also prefer to >> retain a single archival version of a CIF (well, ideally several >> identical copies for redundancy, but nonetheless a single *version*), >> from which alternative encodings that we choose to support for >> delivery from the archive can be generated on the fly. >> >> So, really, the desire would be to have standalone applications that >> can convert between character encodings on the fly. Does anyone know >> of the general availability of such tools? The more, reliable, >> conversions that can be made, the more relaxed we are about accepting >> multiple input encodings. I have to say that a very quick Google >> search hasn't yet thrown up much encouragement here. >> >> Now, back to (1). In similar vein, do you know of any standalone >> utilities that help in determining a text-file character encoding? >> >> [I'm happy to be educated, ideally off-list, in whether >> Content-Encoding negotiation in web forms can help here, since many >> of our CIF submissions come by that route, but I'm more interested in >> the general question of how you determine the encoding of a text file >> that you just happen to find sitting on the filesystem.] >> >> One utility we use heavily in the submission system is "file" >> (http://freshmeat.net/projects/file - we currently use version 4.26 >> with an augmented and slightly modified magic file). This is rather >> quiet about different character encodings, though I notice the magic >> file distributed with the more recent version 5.04 does have a >> "Unicode" section, namely: >> >> >> #------------------------------------------------------------------------------ >> ? ? # $File: unicode,v 1.5 2009/09/19 16:28:13 christos Exp $ >> ? ? # Unicode:? BOM prefixed text files - Adrian Havill >> >> ? ? # GRR: These types should be recognised in file_ascmagic so these >> ? ? # encodings can be treated by text patterns. >> ? ? # Missing types are already dealt with internally. >> ? ? # >> ? ? 0? ? ? string? +/v8? ? ? ? ? ? ? ? ? ? Unicode text, UTF-7 >> ? ? 0? ? ? string? +/v9? ? ? ? ? ? ? ? ? ? Unicode text, UTF-7 >> ? ? 0? ? ? string? +/v+? ? ? ? ? ? ? ? ? ? Unicode text, UTF-7 >> ? ? 0? ? ? string? +/v/? ? ? ? ? ? ? ? ? ? Unicode text, UTF-7 >> ? ? 0? ? ? string? \335\163\146\163? ? ? ? Unicode text, UTF-8-EBCDIC >> ? ? 0? ? ? string? \376\377\000\000? ? ? ? Unicode text, UTF-32, big-endian >> ? ? 0? ? ? string? \377\376\000\000? ? ? ? Unicode text, UTF-32, >> little-endian >> ? ? 0? ? ? string? \016\376\377? ? ? ? ? ? Unicode text, SCSU (Standard >> Compression Scheme for Unicode) >> >> Interestingly, the "animation" module of this new magic file >> conflicts with other possible UTF encodings: >> >> ? ? # MPA, M1A >> ? ? # updated by Joerg Jenderek >> ? ? # GRR the original test are too common for many DOS files, so test 32 <= >> kbits < >> ? ? = 448 >> ? ? # GRR this test is still too general as it catches a BOM of UTF-16 files >> (0xFFFE) >> ? ? # FIXME: Almost all little endian UTF-16 text with BOM are clobbered by >> these entries >> >> >> And, by the way, the "augmented" magic file we use (the one distributed as >> part of the KDE desktop distribution) already includes this section: >> >> ? ? # chemical/x-cif 50 >> ? ? 0??? string??? #\#CIF_1.1 >> ? ? >10??? byte??? 9??? chemical/x-cif >> ? ? >10??? byte??? 10??? chemical/x-cif >> ? ? >10??? byte??? 13??? chemical/x-cif >> >> >> >> It seems to me that without some reasonably reliable discriminator, >> John's endorsement of support for "local" encodings will allow files >> to leak out into the wider world where they can't at all easily be >> handled or even properly identified. (Though, as many have argued >> persuasively, "forbidding" them is not going to prevent such files >> from being created, and possibly even used fruitfully within local >> environments.) >> >> Remember that many CIFs will come to us in the end after passage across >> many heterogeneous systems. I referred in a previous post to my own >> daily working environment - Solaris, Linux and Windows systems linked >> by a variety of X servers, X emulators, NFS and SMB cross-mounted >> filesystems, clipboards communicating with diverse applications >> and OSes running different default locales... >> [Incidentally, hasn't SMB now been superseded by "CIFS" !] >> >> Perhaps I'm just perverse; but I doubt that I'm quite unique. We'll >> also see files shuttled between co-authors with different languages, >> locales, OSes, and exchanged via email, ftp, USB stick etc. >> "Corruptions" will inevitably be introduced in these interchanges - >> sometimes subtle ones. For example, outside the CIF world altogether, >> we see Greek characters change their identity when we run some files >> through a PDF -> PostScript -> PDF cycle (all using software from the >> same software house, Adobe). The reason has to do with differences in >> Windows and Mac encodings, and the failure of the Acrobat software to >> track and maintain the character mappings through such a cycle. >> >> Well, I'll stop here, because in spite of my best intentions I don't >> think I'm moving the debate along very much, and I apologise if >> everything here has already been so obvious as not to need saying. >> >> I'll defer further comment until I've learned if there are already >> standard text-encoding identifiers and transcoders. >> >> Regards >> Brian >> _________________________________________________________________________ >> Brian McMahon? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? tel: +44 1244 342878 >> Research and Development Officer? ? ? ? ? ? ? ? ? ? fax: +44 1244 314888 >> International Union of Crystallography? ? ? ? ? ? e-mail:? bm at iucr.org >> 5 Abbey Square, Chester CH1 2HU, England >> >> >> On Tue, Sep 14, 2010 at 10:58:39AM -0400, Herbert J. Bernstein wrote: >>> One, hopefully relevant, aside -- ascii files are not as >>> unambiguous as one might think.? Depending on what localization >>> one has one one's computer, the code point 0x5c (one of the >>> characters in the first 127) will be shown as a reverse >>> solidus, a yen currency symbol or a won currency symbol.? This >>> is a holdover from the days of national variants of the ISO >>> character set, and shows no signs of going away any time soon. >>> >>> This is _not_ the only such case, but it is one that impacts >>> most programming languages, including dREL, and existing CIF >>> files, including the PDB's mmCIF files. >>> ===================================================== >>>? Herbert J. Bernstein, Professor of Computer Science >>>? ? Dowling College, Kramer Science Center, KSC 121 >>>? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>> >>>? ? ? ? ? ? ? ? ? +1-631-244-3035 >>>? ? ? ? ? ? ? ? ? yaya at dowling.edu >>> ===================================================== >>> >>> On Tue, 14 Sep 2010, Herbert J. Bernstein wrote: >>> >>>> Dear Colleagues, >>>> >>>>? To avoid any misunderstandings, rather than worrying about how >>>> we got to where we are, let us each just state a clear position. >>>> Here is mine: >>>> >>>>? I favor CIF2 being stated in terms of UTF-8 for clarity, but >>>> not specifying any particular _mandatory_ encoding of a CIF2 file >>>> as long as there is a clearly agreed mechanism between the >>>> creator and consumer of a given CIF2 file as to how to faithfully >>>> transform the file between creator's and the consumer's encodings. >>>> >>>>? I favor UTF-8 being the default encoding that any CIF2 creator >>>> should feel free to use without having to establish any prior >>>> agreement with consumers, and that all consumers should try >>>> to make arrangements to be able to read, either directly or >>>> via some conversion utility or service.? If the consumers don't >>>> make such arrangements then there may be CIF2 files that they >>>> will not be able to read.? If a producer creates a CIF2 in any >>>> encoding other than UTF8 then there may be consumers who have >>>> difficulty reading that CIF2. >>>> >>>>? I favor the IUCr taking responsibility for collecting and >>>> disseminating information on particularly useful ways to go >>>> to and from UTF8 and/or other popular encodings. >>>> >>>>? Regards, >>>>? ? Herbert >>>> ===================================================== >>>> Herbert J. Bernstein, Professor of Computer Science >>>>? Dowling College, Kramer Science Center, KSC 121 >>>>? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >>>> >>>>? ? ? ? ? ? ? ? +1-631-244-3035 >>>>? ? ? ? ? ? ? ? yaya at dowling.edu >>>> ===================================================== >>>> >>>> On Tue, 14 Sep 2010, SIMON WESTRIP wrote: >>>> >>>>> I sense some common ground here with my previous post. >>>>> >>>>> The UTF8/16 pair could possibly be extended to any unicode encoding that >>>>> is >>>>> unambiguously/inherently identifiable? >>>>> The 'local' encodings then encompass everything else? >>>>> >>>>> However, I think we've yet to agree that anything but UTF8 is to be >>>>> allowed >>>>> at all. We have a draft spec that stipulates UTF8, >>>>> but I infer from this thread that there is scope to relax that >>>>> restriction. >>>>> The views seem to range from at least 'leaving the door open' >>>>> in recognition of the variety of encodings available, to advocating that >>>>> the encoding should not be part of the specification at all, and it will >>>>> be >>>>> down to developers to accommodate/influence user practice. I'm in favour >>>>> of >>>>> a default encoding or maybe any encoding that is inherently >>>>> identifiable, >>>>> and providing a means to declare other encodings (however untrustworthy >>>>> the >>>>> declaration may be, it would at least be available to conscientious >>>>> users/developers), all documented in the spec. >>>>> >>>>> Please forgive me if this summary is off the mark; my conclusion is that >>>>> there's a willingness to accommodate multiple encodings >>>>> in this (albeit very small) group. Given that we are starting from the >>>>> position of having a single encoding (agreed upon after much earlier >>>>> debate), I cannot see us performing a complete U-turn to allow any >>>>> (potentially unrecognizable) encoding as in CIF1, i.e. without some >>>>> specification of a canonical encoding or mechanisms to identify/declare >>>>> the >>>>> encoding. On the other hand, I hope to see >>>>> a revised spec that isnt UTF8 only. >>>>> >>>>> To get to the point - is there any hope of reaching a compromise? >>>>> >>>>> Cheers >>>>> >>>>> Simon >>>>> >>>>> >>>>> >>>>> ____________________________________________________________________________ >>>>> From: "Bollinger, John C" >>>>> To: Group for discussing encoding and content validation schemes for >>>>> CIF2 >>>>> >>>>> Sent: Monday, 13 September, 2010 19:52:26 >>>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. >>>>> .. >>>>> . >>>>> >>>>> >>>>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote: >>>>> [...] >>>>>> To my mind, the encoding of plain CIF files remains an open issue.? I >>>>>> do not view the mechanisms for managing file encoding that are >>>>>> provided by current OSs to be sufficiently robust, widespread or >>>>>> consistent that we can rely on developers or text editors respecting >>>>>> them [...]. >>>>> >>>>> I agree that the encoding of plain CIF files remains an open issue. >>>>> >>>>> I confess I find your concerns there somewhat vague, especially to the >>>>> extent that they apply within the confines of a single machine.? Do your >>>>> concerns extend to that level?? If so, can you provide an example or two >>>>> of >>>>> what you fear might go wrong in that context? >>>>> >>>>> As Herb recently wrote, "Multiple encodings are a fact of life when >>>>> working >>>>> with text."? CIF2 looks like text, it feels like text, and despite some >>>>> exotic spice, it tastes like text -- even in UTF-8 only form.? We cannot >>>>> pretend that we're dealing with anything other than text.? We need to >>>>> accept, therefore, that no matter what we do, authors and programmers >>>>> will >>>>> need to account for multiple encodings, one way or another.? The format >>>>> specification cannot relieve either group of that responsibility. >>>>> >>>>> That doesn't necessarily mean, however, that CIF must follow the XML >>>>> model >>>>> of being self-defining with regard to text encoding.? Given CIF's >>>>> various >>>>> uses, we gain little of practical value in this area by defining CIF2 as >>>>> UTF-8 only, and perhaps equally little by defining required decorations >>>>> for >>>>> expressing random encodings.? Moreover, the best reading of CIF1 is that >>>>> it >>>>> relies on the *local* text conventions, whatever they may be, which is >>>>> quite >>>>> a different thing than handling all text conventions that might >>>>> conceivably >>>>> be employed. >>>>> >>>>> With that being the case, I don't think it needful for CIF2 in any given >>>>> environment to endorse foreign encoding conventions other than UTF-8. >>>>> CIF2 >>>>> reasonably could endorse UTF-16 as well, though, as that cannot be >>>>> confused >>>>> with any ASCII-compatible encoding.? Allowing UTF-16 would open up >>>>> useful >>>>> possibilities both for imgCIF and for future uses not yet conceived. >>>>> Additionally, since CIF is text I still think it important for CIF2 to >>>>> endorse the default text conventions of its operating environment. >>>>> >>>>> Could we agree on those three as allowed encodings?? Consider, given >>>>> that >>>>> combination of supported alternatives and no extra support from the >>>>> spec, >>>>> how might various parties deal with the unavoidable encoding issue. >>>>> Here >>>>> are some of the more reasonable alternatives I see: >>>>> >>>>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and >>>>> PDB: >>>>> >>>>> Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.? The >>>>> responsibility to perform any needed transcoding is on the other party. >>>>> This is just as it might be with UTF-8-only. >>>>> >>>>> Option b) in addition to supporting UTF-8 and/or UTF-16, support >>>>> other encodings by allowing users to explicitly specify them as part of >>>>> the >>>>> submission/retrieval process.? The processor / repository would either >>>>> ensure the CIF is properly labeled, or, better, transcode it to >>>>> UTF-8[/16]. >>>>> This also is just as it might be with UTF-8 only. >>>>> >>>>> 2. Programs and Libraries: >>>>> >>>>> Option a) On input, detect encoding by checking first for UTF-16, >>>>> assuming UTF-8 if not UTF-16, and falling back to default text >>>>> conventions >>>>> if a UTF-8 decoding error is encountered.? On output, encode as directed >>>>> by >>>>> the user (among the two/three options), defaulting to the input encoding >>>>> when that is available and feasible.? These would be desirable behaviors >>>>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 >>>>> environment, >>>>> but they do exceed UTF-8-only requirements. >>>>> >>>>> Option b) Require input and produce output according to a fixed set >>>>> of conventions (whether local text conventions or UTF-8/16).? The >>>>> program >>>>> user is responsible for any needed transcoding.? This would be >>>>> sufficient >>>>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those >>>>> differ, however, in which text conventions would be assumed. >>>>> >>>>> 3. Users/Authors: >>>>> 3.1. Creating / editing CIFs >>>>> No change from current practice is needed, but users might choose >>>>> to >>>>> store CIFs in UTF-8[/16] form.? This is just as it would likely be under >>>>> UTF-8 only. >>>>> >>>>> 3.2. Transferring CIFs >>>>> Unless an alternative agreement on encoding can be reached by some >>>>> means, the transferor must ensure the CIF is encoded in UTF-8[/16]. >>>>> This >>>>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) >>>>> allowed. >>>>> >>>>> 3.3. Receiving CIFs >>>>> The receiver may reasonably demand that the CIF be provided in >>>>> UTF-8[/16] form.? He should *expect* that form unless some alternative >>>>> agreement is established.? Any desired transcoding from UTF-8[/16] to an >>>>> alternative encoding is the user's responsibility.? Again, this is not >>>>> significantly different from the UTF-8 only case. >>>>> >>>>> >>>>> A driving force in many of those cases is the well-understood >>>>> (especially >>>>> here!) fact that different systems cannot be relied upon to share text >>>>> conventions, thus leaving UTF-8[/16] as the only available >>>>> general-purpose >>>>> medium of exchange.? At the same time, local conventions are not >>>>> forbidden >>>>> from use where they can be relied upon -- most notably, within the same >>>>> computer.? Even if end-users, as a group, do not appreciate those >>>>> details, >>>>> we can ensure via the spec that CIF2 implementers do.? That's >>>>> sufficient. >>>>> >>>>> So, if pretty much all my expected behavior under UTF-8[/16]+local is >>>>> the >>>>> same as it would be under UTF-8-only, then why prefer the former? >>>>> Because >>>>> under UTF-8[/16]+local, all the behavior described is conformant to the >>>>> spec, whereas under UTF-8 only, a significant proportion is not.? If the >>>>> standard adequately covers these behaviors then we can expect more >>>>> uniform >>>>> support.? Moreover, this bears directly on community acceptance of the >>>>> spec.? If flaunting the spec with respect to encoding becomes common, >>>>> then >>>>> the spec will have failed, at least in that area.? Having failed in one >>>>> area, it is more likely to fail in others. >>>>> >>>>> >>>>> Regards, >>>>> >>>>> John >>>>> -- >>>>> John C. Bollinger, Ph.D. >>>>> Department of Structural Biology >>>>> St. Jude Children's Research Hospital >>>>> >>>>> Email Disclaimer:? www.stjude.org/emaildisclaimer >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From jamesrhester at gmail.com Thu Sep 23 01:22:47 2010 From: jamesrhester at gmail.com (James Hester) Date: Thu, 23 Sep 2010 10:22:47 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: <20100916131753.GA18504@emerald.iucr.org> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <930138.36485.qm@web87008.mail.ird.yahoo.com> <20100915123927.GA26246@emerald.iucr.org> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC7@SJMEMXMBS11.stjude.sjcrh.local> <20100916131753.GA18504@emerald.iucr.org> Message-ID: I haven't commented directly on this proposal, so here are my comments. 1. In spirit, this proposal reads as a supplement to Herbert's proposal to do things as for CIF1. This is reinforced by the dropping of the mandatory CIF2 header. 2. I am not overly concerned about not requiring the CIF2 header. Attempts to read a CIF1/2 file using both syntaxes can be attempted, and I can't imagine any pathological case that would give differing parse results in the case that both CIF2 and CIF1 syntaxes gave successful parses. This outcome is largely because we have retained whitespace separation between tokens, something we hadn't yet decided when we decided on the CIF2 header. 3. This proposal does not a priori do anything to resolve the concrete problem faced by programmers as to what encodings to expect in input CIF files. With the exception of UTF8 and UTF16, they must rely on unreliable external information. 4. There is similarly no guarantee that the encoding tag corresponds to the real encoding, and the only way to confirm the tag is to run it past the author again. While this is easy for a publisher accepting manuscripts to do, it is not a generally applicable approach. Have I misunderstood something, and we are in fact simply producing a standard for publishing use? 5. I believe the Chester office's potential problems can be solved by applying the simple insight that they have access to the original authors: therefore, when a non-UTF8 or UTF16 file is received, they can either try heuristic autodetection with feedback from the author, or simply request the author to sort it out, with a link to a helpful webpage. I therefore do not believe that initial inability to produce UTF8/16 on the part of some authors is sufficient cause to reject the UTF8/16 (+local?) proposal. 6. Regarding a list of preferred encodings in the annexe: each time this list is expanded, all current CIF software becomes unable to read and write files in at least one of the preferred encodings. With time, this list will be of little use, as there is no guarantee that an encoding that is on the list is one that a given software package will understand. Therefore, this list has to be essentially fixed over a reasonably long timescale (say 5 years) and additions would ideally correspond with a bump of the standard version number. You might just as well specify a fixed set in the standard itself. Dividing encoding tags according to journals creates even more confusion and balkanisation. 7. "Reducing the likelihood of uncontrolled chaos". In my opinion, this likelihood is lowest with a fixed set of encodings. Any support for open-slather encoding in the standard will increase the likelihood of chaos. If we end up accepting Brain and Herbert's proposals, of course each CIF-handling entity will be forced to attempt encoding management in ways that Brian suggests, and I'll wager that there will be more headaches for Chester and everyone else dealing with it than if they adopted strategies as described in point (5). Unless the community spontaneously adopts UTF8. On Thu, Sep 16, 2010 at 11:17 PM, Brian McMahon wrote: > Thanks to Herbert, John and Simon for responding. I'm sorry if it > seems like once again round an endless loop, but your replies have > helped me to settle on the way I would like to see things move > forward. For what it's worth: > > *** > > I favour the specification *recommending* a magic string to begin a > file: an optional BOM followed by the 11 characters > > #\#CIF_2.0 > > I favour the specification *recommending* that this initial comment > should be extended with an indication of the character > encoding where this is not ASCII. I suggest the specification's > discussion of the form this will take, as well as any other comments > on character-set encoding, be presented in a distinct section of the > specification (Part 3, or an Annexe or Appendix). > > These are recommendations, not requirements, > > 1. to include existing CIF1.0 and CIF1.1 instances as valid CIF input > streams (whether "decorated" or not; > > 2. because you can only ever take this meta-information as > well-intentioned hints. > > *** > > I like the idea of a checksum, but I think it's premature to require > any particular formulation at this revision of the specification. > > *** > > I favour this new "Part 3" of the specification providing some general > commentary on the nature of text files and transcoding issues. ?It > should present UTF-8 as a "concrete" instantiation, and stipulate a > suitable tag for incorporation in the "magic number" comment, let us > say something like . It should explain the importance of > developers following the "recommendations", and should caution against > (but not prohibit) gratuitous proliferation of encodings. It should > identify an additional resource hosted on the COMCIFS web site that > provides guidance to developers. > > Use of the term "concrete" here harks back to the SGML specification. > SGML is actually a metastandard for document markup languages, and in > principle permits many different ways of tagging markup. But in > describing just one "concrete" example, based on angle brackets, it > encouraged the universal adoption of such tags right through HTML and > XML. > > *** > > John said: > >> "Were I setting policy for Acta Crystallographica with respect to CIF2, >> I would require CIF2 submissions to be encoded in UTF-8 ... If >> IUCr wishes to be relaxed about _enforcement_ of such a policy in >> order to better serve authors, then fine, but that's a tricky >> proposition. > > I have some concerns about "enforceability" - an end-user (author) may > simply not know how to comply with a requirement to supply a document > in a specified encoding. However, the IUCr Managing Editor would > accept a policy that required authors whose CIFs we had "difficulty > in reading" to use a particular tool, namely publCIF. > > *** > > The "additional resource" I referred to could contain among other > things: > > a list of organisations (IUCr journals, PDB, CCDC, individual synchrotron > facilities) and their policies on accepting or outputting specific > character-set encodings; > > a list of preferred encoding tags (initially just and perhaps > , but extended in response to requests from specific > developers); > > best-practice recommendations. > > I would prefer these to evolve from community discussions and > practical requirements, rather than appear to be imposed by fiat of > COMCIFS or IUCr - so maybe this should be a "cif-developers" rather > than "COMCIFS" website. > > *** > > This approach tries to close off the formal specification while > allowing controlled extensions. Essentially my "additional resource" > becomes the framework for establishing protocols for conversion > between different character-set encodings and serializations. > > For instance, Herbert replied to my comments on needing a pure ASCII > representation in-house: > >> There is no way to make a "pure ascii version" of a general UTF-8 >> file without adopting some reserved characters strings at the lexical >> level -- \U... or &#...; or somesuch as used in many other systems, >> but with such an extension, it is easy. > > That's perfectly understood, and I would expect that we (Acta) would > devise an informal scheme to allow us to do so for whatever purposes we > needed. We wouldn't expect that to be an integral part of the CIF-2 > standard. On the other hand, if it became clear that other people were > having difficulty in processing UTF-8 CIFs, we could formalise what we > had done with a new encoding tag, post that on our cif-developers > resource: > > ? Encoding scheme ? ? ? Details ? ? ? ? ? ? ? ? ? ?Reference > ? ? Crystallography Journals ? http://........ > ? ? ? ? ? ? ? ? ? ? ? ? ASCII-fication of > ? ? ? ? ? ? ? ? ? ? ? ? Unicode characters > > and serve CIFs on request with the initial header > > #\#CIF_2.0 > > (I understand that this is different from character-set transcoding > because it involves additional processing at the lexical level, so it > may not be an appropriate thing to bundle these together in the same > way. That's open to later discussion, but my point is that we're > at least setting up a system allowing the community to exchange > information about practical representation conversions, and so reduce > the likelihood of uncontrolled chaos.) > > > Regards > Brian > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From jamesrhester at gmail.com Thu Sep 23 01:37:48 2010 From: jamesrhester at gmail.com (James Hester) Date: Thu, 23 Sep 2010 10:37:48 +1000 Subject: [Cif2-encoding] How we wrap this up Message-ID: Dear CIF2 encoding participants, As Herbert has indicated, we are starting to run out of time for resolution of the encoding issue. I believe that we have now explored the various proposals sufficiently to all have a good understanding of the consequences and advantages of each approach. So, after a round of final comments, I propose that we vote on the general scheme that we recommend. We can then flesh out the details of the particular scheme that we have settled on, and take this completed proposal to the DDLm group for their approval, following which we will present the entire CIF2 syntax document to COMCIFS for a formal vote. The proposals that I believe are still on the table are: 1. Herbert's 'as for CIF1 proposal' recently posted here and to COMCIFS. 2. Herbert's 'as for CIF1 proposal', together with Brian's proposal (if you agree that they are compatible) 2. UTF8-only as in the original draft 3. UTF8 + UTF16 4. UTF8, UTF16 + "local" I have not included the hashcode proposal as I believe it no longer has any supporters. We would need to conduct a preferential vote. I stress that this is purely to determine the recommendation of this working group, and is not in any way binding on COMCIFS. James. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Thu Sep 23 03:39:23 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Wed, 22 Sep 2010 22:39:23 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: Message-ID: Please characterize my proposal as 'as for CIF1 proposal with UTF8 in place of ASCII' ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 23 Sep 2010, James Hester wrote: > Dear CIF2 encoding participants, > > As Herbert has indicated, we are starting to run out of time for > resolution of the encoding issue. I believe that we have now explored > the various proposals sufficiently to all have a good understanding of > the consequences and advantages of each approach. So, after a round > of final comments, I propose that we vote on the general scheme that > we recommend. We can then flesh out the details of the particular > scheme that we have settled on, and take this completed proposal to > the DDLm group for their approval, following which we will present the > entire CIF2 syntax document to COMCIFS for a formal vote. > > The proposals that I believe are still on the table are: > > 1. Herbert's 'as for CIF1 proposal' recently posted here and to COMCIFS. > 2. Herbert's 'as for CIF1 proposal', together with Brian's proposal > (if you agree that they are compatible) > 2. UTF8-only as in the original draft > 3. UTF8 + UTF16 > 4. UTF8, UTF16 + "local" > > I have not included the hashcode proposal as I believe it no longer > has any supporters. > > We would need to conduct a preferential vote. I stress that this is > purely to determine the recommendation of this working group, and is > not in any way binding on COMCIFS. > > James. > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From jamesrhester at gmail.com Thu Sep 23 06:01:39 2010 From: jamesrhester at gmail.com (James Hester) Date: Thu, 23 Sep 2010 15:01:39 +1000 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: Message-ID: Indeed, point taken. On Thu, Sep 23, 2010 at 12:39 PM, Herbert J. Bernstein wrote: > Please characterize my proposal as > > 'as for CIF1 proposal with UTF8 in place of ASCII' > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? ?Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > ===================================================== > > On Thu, 23 Sep 2010, James Hester wrote: > >> Dear CIF2 encoding participants, >> >> As Herbert has indicated, we are starting to run out of time for >> resolution of the encoding issue. ?I believe that we have now explored >> the various proposals sufficiently to all have a good understanding of >> the consequences and advantages of each approach. ?So, after a round >> of final comments, I propose that we vote on the general scheme that >> we recommend. ?We can then flesh out the details of the particular >> scheme that we have settled on, and take this completed proposal to >> the DDLm group for their approval, following which we will present the >> entire CIF2 syntax document to COMCIFS for a formal vote. >> >> The proposals that I believe are still on the table are: >> >> 1. Herbert's 'as for CIF1 proposal' recently posted here and to COMCIFS. >> 2. Herbert's 'as for CIF1 proposal', together with Brian's proposal >> (if you agree that they are compatible) >> 2. UTF8-only as in the original draft >> 3. UTF8 + UTF16 >> 4. UTF8, UTF16 + "local" >> >> I have not included the hashcode proposal as I believe it no longer >> has any supporters. >> >> We would need to conduct a preferential vote. ?I stress that this is >> purely to determine the recommendation of this working group, and is >> not in any way binding on COMCIFS. >> >> James. >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From simonwestrip at btinternet.com Thu Sep 23 11:45:37 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Thu, 23 Sep 2010 03:45:37 -0700 (PDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: Message-ID: <63870.31508.qm@web87006.mail.ird.yahoo.com> OK, final comments before this is wrapped up (hopefully): 1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently posted here and to COMCIFS. 2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together with Brian's *recommendations* 3. UTF8-only as in the original draft 4. UTF8 + UTF16 5. UTF8, UTF16 + "local" These can be broken down to: 'any encoding' (1, 2, and 5) 'specified encoding' (3 and 4) Note I put 5 in the 'any encoding' category as I think 'local' could be interpretted as any encoding. The 'any encoding' approach is to me unsatisfactory when considering that CIF is a data-exchange format and should be specified in terms that allow the consumer to know exactly what to expect (i.e. no uncertainty in encoding). 'Specified encodings' can be seen as restrictive, especially if there is only one. A list of specified encodings could be seen as inflexible and perhaps arbitrary (e.g. why isnt UTF32 on the list...). If encoding is to be specified, it could be in terms of UTF8 + any Unicode encoding that is inherently identifiable (which in reality boils down to the UTF family). In either case, a degree of work will be required to accommodate user practice and the legacy of CIF1. If the 'any encoding' approach is taken, I believe there should be a wealth of supporting material for both users and developers to encourage the use of a default encoding (i.e. UTF8). Hence my recent support for something along the lines of (2) above. This approach avoids mandating some of the less-satisfactory schemes we have been discussing (e.g. declaration of encoding), but at least makes them available to conscientious developers. Equally, if CIF2 adopts 'specified encodings', there should be a wealth of supporting material for both users and developers to enable transcoding. The pedant in me would like to see 'specified encodings' (preferably UTF8 default + any inherently identifiable Unicode encoding), but if the 'any encoding' approach is to be taken, I think it has to be described as Herbert proposes, with any schemes for identifying the encoding left out of the 'specification' (let the specification reflect the uncertainty that is the encoding of a CIF :-) Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Thursday, 23 September, 2010 1:37:48 Subject: [Cif2-encoding] How we wrap this up Dear CIF2 encoding participants, As Herbert has indicated, we are starting to run out of time for resolution of the encoding issue. I believe that we have now explored the various proposals sufficiently to all have a good understanding of the consequences and advantages of each approach. So, after a round of final comments, I propose that we vote on the general scheme that we recommend. We can then flesh out the details of the particular scheme that we have settled on, and take this completed proposal to the DDLm group for their approval, following which we will present the entire CIF2 syntax document to COMCIFS for a formal vote. The proposals that I believe are still on the table are: 1. Herbert's 'as for CIF1 proposal' recently posted here and to COMCIFS. 2. Herbert's 'as for CIF1 proposal', together with Brian's proposal (if you agree that they are compatible) 2. UTF8-only as in the original draft 3. UTF8 + UTF16 4. UTF8, UTF16 + "local" I have not included the hashcode proposal as I believe it no longer has any supporters. We would need to conduct a preferential vote. I stress that this is purely to determine the recommendation of this working group, and is not in any way binding on COMCIFS. James. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100923/8d365d08/attachment-0001.html From John.Bollinger at STJUDE.ORG Thu Sep 23 15:02:25 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Thu, 23 Sep 2010 09:02:25 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <63870.31508.qm@web87006.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local> On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently posted here and to COMCIFS. >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together with Brian's *recommendations* >3. UTF8-only as in the original draft >4. UTF8 + UTF16 >5. UTF8, UTF16 + "local" > >These can be broken down to: > >'any encoding' (1, 2, and 5) > >'specified encoding' (3 and 4) > >Note I put 5 in the 'any encoding' category as I think 'local' could be interpretted as any encoding. I agree that 'local' could be interpreted as "any encoding", but I choose to view it as "context-dependent". Thus a file that is CIF-conformant on one computer might not be CIF-conformant on another. Some will find that unsatisfactory. In my view, however, it is the best interpretation of CIF1's provisions; its purpose is thus to ensure that *all* well-formed CIF1 files are also well-formed CIF2 files (a context-dependent question). Lest I appear to overstate the case, I acknowledge that the UTF8-only and UTF-8 + UTF-16 proposals would have the result that a large majority of well-formed CIF1 files are also well-formed CIF2 files. The variations of Herb?s proposal probably also make all well-formed CIF1 files well-formed CIF2 files, but I disfavor them on different grounds (mostly that they are too open to differing interpretations). [...] >In either case, a degree of work will be required to accommodate user practice and the legacy of CIF1. I think the entire question reduces to which accommodations for the CIF1 legacy are assured by CIF2 vs. which will constitute non-standard extensions. I don?t think that individual responses, from Chester for example, are likely to depend much on which option is adopted, but I do think the overall consistency of responses will be affected. Thus I favor precision of the specification and coverage of the likely uses, in hope of achieving the greatest consistency of response. I doubt this has swayed anyone's opinion, so please consider it an advance explanation for my upcoming vote (inasmuch as I rely on James's previous assurance that voting rights in this context are not restricted to COMCIFS members). Best Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From jamesrhester at gmail.com Thu Sep 23 15:07:52 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 24 Sep 2010 00:07:52 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: In this email I try to pin down what supporting local encoding might imply. I think it is fair to say that John is advocating including "local" encoding in the list of CIF2 encodings because: (i) this will be the default encoding assumed by text editors (ii) there will be a significant tendency among programmers not to specify encoding when reading/writing CIF files I was surprised to read that we were worried about programmers not getting the message, having assumed up until now that we were concerned only about ordinary users not coming to grips with non-local encoding. Anyway, let's put ourselves in the programmer's shoes on a system for which local encoding is not UTF8/16: Programmer A wants to support UTF8, UTF16 and local. When reading a CIF file, she *must* first try UTF8, then UTF16, and only then local, because a UTF8 file will most probably read in without error as a file in local encoding. However, this programmer is not one of those identified in (ii), because she is actively setting UTF8 and 16 as input encodings. Programmer B wants no business with setting encodings, and so supports only reading/writing local encoding. His program will unfortunately also read in UTF8 files assuming local encoding. The program thus behaves correctly only if the user always remembers to either produce or transcode CIF files to local encoding, assuming that the user has read the documentation for the program sufficiently to know that this is even an issue. As an added bonus, this user has to know what the local encoding is, as the programmer is presumably not making any effort to find out and communicate it (as this would actually be more work than just specifying the damn encoding already). I believe that this is an unworkable situation, and not one that we should facilitate. My point being that reason (ii) (lazy programmers) is not a good justification for keeping local encoding on the list of acceptable encodings. I have never seen reason (i) as sufficient justification. In any case, I do not think that there will be many Programmer Bs. Note the following points: (a) Dealing with encoding in most common languages is simple. Note that even modern Fortran can handle UTF8 - see the code snippets at http://coding.derkeiler.com/Archive/Fortran/comp.lang.fortran/2008-08/msg00395.html and http://gcc.gnu.org/onlinedocs/gfortran/SELECTED_005fCHAR_005fKIND.html (b) If UTF8 is to be supported for reading, files have to be opened explicitly in UTF8, so the programmer is already explicitly specifying encoding (c) The audience of programmers for CIF is (unfortunately) rather small. They can all be reached very easily for active education on how and why UTF8 encoding is specified. And the lack of local encoding can be managed simply: UTF8 encoding and "local" will almost always coincide in the ASCII space, so the absence of local encoding in the acceptable encoding list is invisible on day one of CIF2. Introduction of non-ASCII characters into CIFs can then be managed from Chester through gradual introduction of non-ASCII dataname values, first in non-critical places. Chester can monitor the proportion of incorrectly encoded files received and calibrate a response. However you all assess the above points, I think it is clear that John and I will have to agree to disagree on the value of local encodings. The root cause, I think, is differing perceptions of programmer responsiveness to the standard. I appreciate John's efforts to find a compromise, but I believe we have exhaused our avenues in this direction. Well, I'm ready to vote. Would anybody else like to make any final points before we call for a vote? James. On Sat, Sep 18, 2010 at 1:33 AM, Bollinger, John C wrote: [...] > Unfortunately, that train has long since left us behind on the platform. ?New standard notwithstanding, I don't see an opportunity to > effect an abrupt shift in program and user behavior -- specifically, the behavior of using default text conventions implicitly and > routinely. ?If we formally require UTF-8/16, it can only be with the understanding that many users and programs will ignore that > requirement altogether. ?I don't find that at all appealing or useful, and I do not support it. > > I think we will achieve more consistent CIF2 software, and we will better influence programmers and users, by standardizing the > use of default text conventions with CIF2. ?I would be content to deprecate such use. ?I would favor non-normative commentary in > the spec that explains the issue and discourages reliance on default text encoding. ?I would also favor publicizing resources > describing how to convert local text to UTF-8 (or -16), and creating such resources if necessary. ?I want to see people using > UTF-8/16 for their CIFs, but I don't want to cut them off, standards-wise, when they don't. > > [...] >>In fact, it is rather difficult to >>find any instructions as to how to determine the platform's "local" >>encoding. > > The point of default conventions is that you don't have to determine what they are, you just use them. ?In fact, in some > programming environments, there is no easy way to do otherwise. ?For example, to the best of my knowledge, there is no way to > write a standard-conformant Fortran 95 program that portably reads text from a file in anything but the default encoding. If you don't know what your input encoding is, how do you transcode to UTF8? [...] > The mechanism for reliable transmission is to transcode, if necessary, to UTF-8/16, and transmit the result. ?This is exactly the > same mechanism that would be available for reliable transmission if UTF-8 were the only standardized encoding (under which case > I include transmission of non-UTF-8 almost-CIFs). ?The mechanism is the same for reliably sharing CIFs among environments > where compatibility of default conventions is uncertain. ?I see no reason to believe that users' decisions whether to employ that > mechanism will be driven by anything other than practical considerations, the standard's position notwithstanding. ?I would expect > some programmers to be more influenced by the standard, but in the end they are faced with the same practical considerations. > >> ?And so on. Frankly, I still see no >>merit in including local encodings in CIF2 at all. > > I value standardizing behavior that we all (I think) expect will be common, even though that behavior isn't ideal. ?In that way I expect > to support well-defined and consistent responses to that behavior (mainly in software). ?Given that I have said so before without > persuading you, we will have to agree to disagree here. > >> but instead will attempt >>to mitigate the damage by supporting the following moves: >> >>(i) compliant CIF processors are *not* required to accept files in >>local encoding; > > It is inconsistent to allow local text conventions in the file format definition, but to permit conformant processors to reject them. >?Additionally, I oppose inclusion of any explicit requirements on CIF processors, preferring instead to rely on the format > specification to define what conformant processors must do. ?I could, however, accept defining separate flavors of CIF > distinguished by these encoding distinctions, so that programs could conform to one, the other, or both. ?I'm not sure I like that, > but I think I could agree to it if it helps us wrap this up. > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From simonwestrip at btinternet.com Thu Sep 23 21:13:53 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Thu, 23 Sep 2010 20:13:53 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <80062.82001.qm@web87012.mail.ird.yahoo.com> Faced with the options: 1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently posted here and to COMCIFS. 2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together with Brian's *recommendations* 3. UTF8-only as in the original draft 4. UTF8 + UTF16 5. UTF8, UTF16 + "local" I have to vote for (4). When it comes down to it, I believe that the specification of a 'standard' should not be based on uncertainty, and as 'any encoding' presents uncertainty, it should not be in the standard. I might be accused of changing my position (I have recently expressed support for flexibilty and even a qualified acceptance of the 'as for CIF1 proposal with UTF8 in place of ASCII'), but part of the value of these discussions is to question your own views in the light of other's perspectives. Indeed, I have found these discussions extremely informative and am now in a far better position to handle the realities of introducing non-ASCII CIFs, whatever the final COMCIFS decision. Cheers Simon ________________________________ From: "Bollinger, John C" To: Group for discussing encoding and content validation schemes for CIF2 Sent: Thursday, 23 September, 2010 15:02:25 Subject: Re: [Cif2-encoding] How we wrap this up On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently posted >here and to COMCIFS. >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together with >Brian's *recommendations* >3. UTF8-only as in the original draft >4. UTF8 + UTF16 >5. UTF8, UTF16 + "local" > >These can be broken down to: > >'any encoding' (1, 2, and 5) > >'specified encoding' (3 and 4) > >Note I put 5 in the 'any encoding' category as I think 'local' could be >interpretted as any encoding. I agree that 'local' could be interpreted as "any encoding", but I choose to view it as "context-dependent". Thus a file that is CIF-conformant on one computer might not be CIF-conformant on another. Some will find that unsatisfactory. In my view, however, it is the best interpretation of CIF1's provisions; its purpose is thus to ensure that *all* well-formed CIF1 files are also well-formed CIF2 files (a context-dependent question). Lest I appear to overstate the case, I acknowledge that the UTF8-only and UTF-8 + UTF-16 proposals would have the result that a large majority of well-formed CIF1 files are also well-formed CIF2 files. The variations of Herb?s proposal probably also make all well-formed CIF1 files well-formed CIF2 files, but I disfavor them on different grounds (mostly that they are too open to differing interpretations). [...] >In either case, a degree of work will be required to accommodate user practice >and the legacy of CIF1. I think the entire question reduces to which accommodations for the CIF1 legacy are assured by CIF2 vs. which will constitute non-standard extensions. I don?t think that individual responses, from Chester for example, are likely to depend much on which option is adopted, but I do think the overall consistency of responses will be affected. Thus I favor precision of the specification and coverage of the likely uses, in hope of achieving the greatest consistency of response. I doubt this has swayed anyone's opinion, so please consider it an advance explanation for my upcoming vote (inasmuch as I rely on James's previous assurance that voting rights in this context are not restricted to COMCIFS members). Best Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100923/39f5e2e2/attachment-0001.html From yaya at bernstein-plus-sons.com Thu Sep 23 21:31:24 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 23 Sep 2010 16:31:24 -0400 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <80062.82001.qm@web87012.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> Message-ID: Votes: In terms of the requested preference voting, I vote in declining order of preference 1, then 2, then (big gap) 5, then 4, then 3. On absolute voting up or down in COMCIFS, I will accept 1 or 2, but will lobby against and vote strongly against 3, 4, and 5. Explanation: I am not opposed to Brian recommendations. The only reason I would vote for 1 over 2 is that I fear Brian's recommendation would generate yet more debate over the precise details and I believe we have more than run out of time to get something concrete ready for the IUCr meeting. I am very strongly opposed to 3, 4 and 5 because I believe they will cause confusion and delay in adoption of CIF2, while choices 1 and 2 keep the practices the community and the IUCr have lived with successfully for many years, simply applying then to UTF8 instead of ASCII. People may not understand what they are doing in that mode, but they manage to successfully submit CIFs to the IUCr that way, and we don't have software ready to support anything else. -- Herbert At 8:13 PM +0000 9/23/10, SIMON WESTRIP wrote: >Faced with the options: > >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' >recently posted here and to COMCIFS. >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', >together with Brian's *recommendations* >3. UTF8-only as in the original draft >4. UTF8 + UTF16 >5. UTF8, UTF16 + "local" > >I have to vote for (4). > >When it comes down to it, I believe that the specification of a >'standard' should not be based on uncertainty, >and as 'any encoding' presents uncertainty, it should not be in the standard. > >I might be accused of changing my position (I have recently >expressed support for flexibilty and even a qualified >acceptance of the 'as for CIF1 proposal with UTF8 in place of >ASCII'), but part of the value of these discussions >is to question your own views in the light of other's perspectives. >Indeed, I have found these discussions >extremely informative and am now in a far better position to handle >the realities of introducing non-ASCII CIFs, >whatever the final COMCIFS decision. > >Cheers > >Simon > > > >From: "Bollinger, John C" >To: Group for discussing encoding and content validation schemes for >CIF2 >Sent: Thursday, 23 September, 2010 15:02:25 >Subject: Re: [Cif2-encoding] How we wrap this up > >On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' >>recently posted here and to COMCIFS. >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', >>together with Brian's *recommendations* >>3. UTF8-only as in the original draft >>4. UTF8 + UTF16 >>5. UTF8, UTF16 + "local" >> >>These can be broken down to: >> >>'any encoding' (1, 2, and 5) >> >>'specified encoding' (3 and 4) >> >>Note I put 5 in the 'any encoding' category as I think 'local' >>could be interpretted as any encoding. > >I agree that 'local' could be interpreted as "any encoding", but I >choose to view it as "context-dependent". Thus a file that is >CIF-conformant on one computer might not be CIF-conformant on >another. Some will find that unsatisfactory. In my view, however, >it is the best interpretation of CIF1's provisions; its purpose is >thus to ensure that *all* well-formed CIF1 files are also >well-formed CIF2 files (a context-dependent question). Lest I >appear to overstate the case, I acknowledge that the UTF8-only and >UTF-8 + UTF-16 proposals would have the result that a large majority >of well-formed CIF1 files are also well-formed CIF2 files. The >variations of Herb's proposal probably also make all well-formed >CIF1 files well-formed CIF2 files, but I disfavor them on different >grounds (mostly that they are too open to differing interpretations). > >[...] > >>In either case, a degree of work will be required to accommodate >>user practice and the legacy of CIF1. > >I think the entire question reduces to which accommodations for the >CIF1 legacy are assured by CIF2 vs. which will constitute >non-standard extensions. I don't think that individual responses, >from Chester for example, are likely to depend much on which option >is adopted, but I do think the overall consistency of responses will >be affected. Thus I favor precision of the specification and >coverage of the likely uses, in hope of achieving the greatest >consistency of response. > >I doubt this has swayed anyone's opinion, so please consider it an >advance explanation for my upcoming vote (inasmuch as I rely on >James's previous assurance that voting rights in this context are >not restricted to COMCIFS members). > > >Best Regards, > >John >-- >John C. Bollinger, Ph.D. >Department of Structural Biology >St. Jude Children's Research Hospital > > >Email Disclaimer: >www.stjude.org/emaildisclaimer >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From simonwestrip at btinternet.com Thu Sep 23 22:03:58 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Thu, 23 Sep 2010 21:03:58 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> Message-ID: <162941.37460.qm@web87004.mail.ird.yahoo.com> Just because I'm still at my desk - and despite the fact that I told myself I would not contribute further beyond my vote - it might be worth mentioning that the IUCr are already experiencing problems related to encoding issues (in their web services), and the occurence of such problems is most likely to increase when CIFs can contain non-ASCII text. Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Thursday, 23 September, 2010 21:31:24 Subject: Re: [Cif2-encoding] How we wrap this up Votes: In terms of the requested preference voting, I vote in declining order of preference 1, then 2, then (big gap) 5, then 4, then 3. On absolute voting up or down in COMCIFS, I will accept 1 or 2, but will lobby against and vote strongly against 3, 4, and 5. Explanation: I am not opposed to Brian recommendations. The only reason I would vote for 1 over 2 is that I fear Brian's recommendation would generate yet more debate over the precise details and I believe we have more than run out of time to get something concrete ready for the IUCr meeting. I am very strongly opposed to 3, 4 and 5 because I believe they will cause confusion and delay in adoption of CIF2, while choices 1 and 2 keep the practices the community and the IUCr have lived with successfully for many years, simply applying then to UTF8 instead of ASCII. People may not understand what they are doing in that mode, but they manage to successfully submit CIFs to the IUCr that way, and we don't have software ready to support anything else. -- Herbert At 8:13 PM +0000 9/23/10, SIMON WESTRIP wrote: >Faced with the options: > >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' >recently posted here and to COMCIFS. >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', >together with Brian's *recommendations* >3. UTF8-only as in the original draft >4. UTF8 + UTF16 >5. UTF8, UTF16 + "local" > >I have to vote for (4). > >When it comes down to it, I believe that the specification of a >'standard' should not be based on uncertainty, >and as 'any encoding' presents uncertainty, it should not be in the standard. > >I might be accused of changing my position (I have recently >expressed support for flexibilty and even a qualified >acceptance of the 'as for CIF1 proposal with UTF8 in place of >ASCII'), but part of the value of these discussions >is to question your own views in the light of other's perspectives. >Indeed, I have found these discussions >extremely informative and am now in a far better position to handle >the realities of introducing non-ASCII CIFs, >whatever the final COMCIFS decision. > >Cheers > >Simon > > > >From: "Bollinger, John C" >To: Group for discussing encoding and content validation schemes for >CIF2 >Sent: Thursday, 23 September, 2010 15:02:25 >Subject: Re: [Cif2-encoding] How we wrap this up > >On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' >>recently posted here and to COMCIFS. >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', >>together with Brian's *recommendations* >>3. UTF8-only as in the original draft >>4. UTF8 + UTF16 >>5. UTF8, UTF16 + "local" >> >>These can be broken down to: >> >>'any encoding' (1, 2, and 5) >> >>'specified encoding' (3 and 4) >> >>Note I put 5 in the 'any encoding' category as I think 'local' >>could be interpretted as any encoding. > >I agree that 'local' could be interpreted as "any encoding", but I >choose to view it as "context-dependent". Thus a file that is >CIF-conformant on one computer might not be CIF-conformant on >another. Some will find that unsatisfactory. In my view, however, >it is the best interpretation of CIF1's provisions; its purpose is >thus to ensure that *all* well-formed CIF1 files are also >well-formed CIF2 files (a context-dependent question). Lest I >appear to overstate the case, I acknowledge that the UTF8-only and >UTF-8 + UTF-16 proposals would have the result that a large majority >of well-formed CIF1 files are also well-formed CIF2 files. The >variations of Herb's proposal probably also make all well-formed >CIF1 files well-formed CIF2 files, but I disfavor them on different >grounds (mostly that they are too open to differing interpretations). > >[...] > >>In either case, a degree of work will be required to accommodate >>user practice and the legacy of CIF1. > >I think the entire question reduces to which accommodations for the >CIF1 legacy are assured by CIF2 vs. which will constitute >non-standard extensions. I don't think that individual responses, >from Chester for example, are likely to depend much on which option >is adopted, but I do think the overall consistency of responses will >be affected. Thus I favor precision of the specification and >coverage of the likely uses, in hope of achieving the greatest >consistency of response. > >I doubt this has swayed anyone's opinion, so please consider it an >advance explanation for my upcoming vote (inasmuch as I rely on >James's previous assurance that voting rights in this context are >not restricted to COMCIFS members). > > >Best Regards, > >John >-- >John C. Bollinger, Ph.D. >Department of Structural Biology >St. Jude Children's Research Hospital > > >Email Disclaimer: >www.stjude.org/emaildisclaimer >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100923/f55fadb5/attachment.html From yaya at bernstein-plus-sons.com Thu Sep 23 22:39:24 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 23 Sep 2010 17:39:24 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <162941.37460.qm@web87004.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> Message-ID: Dear Simon, That is precisely the point -- there is a serious and growing problems with encodings. The strict UTF8 proposal then makes it a universal problem for everybody using CIF, and we do _not_ have a coherent means setup to deal with it. The substitution of UTF8 for ASCII in the CIF1 spec does not, in and of itself make anything worse for anybody currently receiving 128 character ASCII -- it is identical, and it does not force users working in other systems that the IUCr journals are currently coping with to jump into the boiling water, they can keep doing whatever they are currently doing that is currently working for them and the IUCr. All the journals have to do until something that is actually supports not-lower-128-ASCII is ready is to tell people that for the jounrnals they will still have to use Brian's reverse solidus escape codes for anything else -- nothing major changes for most people. If and when there really is a coherent scheme to support more native Unicode code points for journal submission with tested software, then we can do something more. Right now, proposals 3,4 and 5 will make things worse for large numbers of users and not really make anything better for the IUCr. It is too early in the UTF8 conversion process. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > Just because I'm still at my desk - and despite the fact that I told myself > I would not > contribute further beyond my vote - it might be worth mentioning that the > IUCr are already > experiencing problems related to encoding issues (in their web services), > and the occurence > of such problems is most likely to increase when CIFs can contain non-ASCII > text. > > Cheers > > Simon > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Thursday, 23 September, 2010 21:31:24 > Subject: Re: [Cif2-encoding] How we wrap this up > > Votes: > > In terms of the requested preference voting, I vote in declining order of > preference > > 1, then 2, then (big gap) 5, then 4, then 3. > > On absolute voting up or down in COMCIFS, I will accept 1 or 2, but will > lobby against and vote strongly against 3, 4, and 5. > > Explanation: > > I am not opposed to Brian recommendations.? The only reason I would vote > for 1 over 2 is that I fear Brian's recommendation would generate yet > more debate over the precise details and I believe we have more than > run out of time to get something concrete ready for the IUCr meeting. > > I am very strongly opposed to 3, 4 and 5 because I believe they will > cause confusion and delay in adoption of CIF2, while choices > 1 and 2 keep the practices the community and the IUCr have lived > with successfully for many years, simply applying then to UTF8 > instead of ASCII.? People may not understand what they are doing > in that mode, but they manage to successfully submit CIFs to the > IUCr that way, and we don't have software ready to support anything > else. > > ? -- Herbert > > At 8:13 PM +0000 9/23/10, SIMON WESTRIP wrote: > >Faced with the options: > > > >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > >recently posted here and to COMCIFS. > >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > >together with Brian's *recommendations* > >3. UTF8-only as in the original draft > >4. UTF8 + UTF16 > >5. UTF8, UTF16 + "local" > > > >I have to vote for (4). > > > >When it comes down to it, I believe that the specification of a > >'standard' should not be based on uncertainty, > >and as 'any encoding' presents uncertainty, it should not be in the > standard. > > > >I might be accused of changing my position (I have recently > >expressed support for flexibilty and even a qualified > >acceptance of the 'as for CIF1 proposal with UTF8 in place of > >ASCII'), but part of the value of these discussions > >is to question your own views in the light of other's perspectives. > >Indeed, I have found these discussions > >extremely informative and am now in a far better position to handle > >the realities of introducing non-ASCII CIFs, > >whatever the final COMCIFS decision. > > > >Cheers > > > >Simon > > > > > > > >From: "Bollinger, John C" > >To: Group for discussing encoding and content validation schemes for > >CIF2 > >Sent: Thursday, 23 September, 2010 15:02:25 > >Subject: Re: [Cif2-encoding] How we wrap this up > > > >On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > > > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > >>recently posted here and to COMCIFS. > >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > >>together with Brian's *recommendations* > >>3. UTF8-only as in the original draft > >>4. UTF8 + UTF16 > >>5. UTF8, UTF16 + "local" > >> > >>These can be broken down to: > >> > >>'any encoding' (1, 2, and 5) > >> > >>'specified encoding' (3 and 4) > >> > >>Note I put 5 in the 'any encoding' category as I think 'local' > >>could be interpretted as any encoding. > > > >I agree that 'local' could be interpreted as "any encoding", but I > >choose to view it as "context-dependent".? Thus a file that is > >CIF-conformant on one computer might not be CIF-conformant on > >another.? Some will find that unsatisfactory.? In my view, however, > >it is the best interpretation of CIF1's provisions; its purpose is > >thus to ensure that *all* well-formed CIF1 files are also > >well-formed CIF2 files (a context-dependent question).? Lest I > >appear to overstate the case, I acknowledge that the UTF8-only and > >UTF-8 + UTF-16 proposals would have the result that a large majority > >of well-formed CIF1 files are also well-formed CIF2 files.? The > >variations of Herb's proposal probably also make all well-formed > >CIF1 files well-formed CIF2 files, but I disfavor them on different > >grounds (mostly that they are too open to differing interpretations). > > > >[...] > > > >>In either case, a degree of work will be required to accommodate > >>user practice and the legacy of CIF1. > > > >I think the entire question reduces to which accommodations for the > >CIF1 legacy are assured by CIF2 vs. which will constitute > >non-standard extensions.? I don't think that individual responses, > >from Chester for example, are likely to depend much on which option > >is adopted, but I do think the overall consistency of responses will > >be affected.? Thus I favor precision of the specification and > >coverage of the likely uses, in hope of achieving the greatest > >consistency of response. > > > >I doubt this has swayed anyone's opinion, so please consider it an > >advance explanation for my upcoming vote (inasmuch as I rely on > >James's previous assurance that voting rights in this context are > >not restricted to COMCIFS members). > > > > > >Best Regards, > > > >John > >-- > >John C. Bollinger, Ph.D. > >Department of Structural Biology > >St. Jude Children's Research Hospital > > > > > >Email Disclaimer: > >www.stjude.org/emaildisclaimer > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iuc > r.org/mailman/listinfo/cif2-encoding > > > > > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > -- > ===================================================== > ? Herbert J. Bernstein, Professor of Computer Science > ? ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > From simonwestrip at btinternet.com Thu Sep 23 23:30:32 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Thu, 23 Sep 2010 15:30:32 -0700 (PDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> Message-ID: <780727.99055.qm@web87010.mail.ird.yahoo.com> I agree to some extent with what you say, Herbert, but I'm a bit more optomistic (for once) that the IUCr at least will be able to adapt to a 'specified encoding' system relatively quickly, and in the interim certainly not reject non-UTFx CIFs. I'm not convinced that whatever appears in the specification will have any influence on user practice, especially in the non-IUCr world; rather I think the success (or otherwise) of CIF2 will result from the software that implements it (as you suggest). I don't share your pessimism about the potential confusion of specifying UTF8 etc., and certainly don't think that a restricted encoding will be any more confusing than 'any encoding', given, as you say, "people may not understand what they are doing..." I suppose much of the difference in our views lies in our perception of user interest - I suspect there may even be overlap in this respect - but I'm perhaps less inclined to think that the final specification will have a marked influence on users: "they can keep doing whatever they are currently doing that is currently working for them" Anyway, its not me you have to convince :-), and its time I went to bed! Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Thursday, 23 September, 2010 22:39:24 Subject: Re: [Cif2-encoding] How we wrap this up Dear Simon, That is precisely the point -- there is a serious and growing problems with encodings. The strict UTF8 proposal then makes it a universal problem for everybody using CIF, and we do _not_ have a coherent means setup to deal with it. The substitution of UTF8 for ASCII in the CIF1 spec does not, in and of itself make anything worse for anybody currently receiving 128 character ASCII -- it is identical, and it does not force users working in other systems that the IUCr journals are currently coping with to jump into the boiling water, they can keep doing whatever they are currently doing that is currently working for them and the IUCr. All the journals have to do until something that is actually supports not-lower-128-ASCII is ready is to tell people that for the jounrnals they will still have to use Brian's reverse solidus escape codes for anything else -- nothing major changes for most people. If and when there really is a coherent scheme to support more native Unicode code points for journal submission with tested software, then we can do something more. Right now, proposals 3,4 and 5 will make things worse for large numbers of users and not really make anything better for the IUCr. It is too early in the UTF8 conversion process. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > Just because I'm still at my desk - and despite the fact that I told myself > I would not > contribute further beyond my vote - it might be worth mentioning that the > IUCr are already > experiencing problems related to encoding issues (in their web services), > and the occurence > of such problems is most likely to increase when CIFs can contain non-ASCII > text. > > Cheers > > Simon > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Thursday, 23 September, 2010 21:31:24 > Subject: Re: [Cif2-encoding] How we wrap this up > > Votes: > > In terms of the requested preference voting, I vote in declining order of > preference > > 1, then 2, then (big gap) 5, then 4, then 3. > > On absolute voting up or down in COMCIFS, I will accept 1 or 2, but will > lobby against and vote strongly against 3, 4, and 5. > > Explanation: > > I am not opposed to Brian recommendations. The only reason I would vote > for 1 over 2 is that I fear Brian's recommendation would generate yet > more debate over the precise details and I believe we have more than > run out of time to get something concrete ready for the IUCr meeting. > > I am very strongly opposed to 3, 4 and 5 because I believe they will > cause confusion and delay in adoption of CIF2, while choices > 1 and 2 keep the practices the community and the IUCr have lived > with successfully for many years, simply applying then to UTF8 > instead of ASCII. People may not understand what they are doing > in that mode, but they manage to successfully submit CIFs to the > IUCr that way, and we don't have software ready to support anything > else. > > -- Herbert > > At 8:13 PM +0000 9/23/10, SIMON WESTRIP wrote: > >Faced with the options: > > > >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > >recently posted here and to COMCIFS. > >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > >together with Brian's *recommendations* > >3. UTF8-only as in the original draft > >4. UTF8 + UTF16 > >5. UTF8, UTF16 + "local" > > > >I have to vote for (4). > > > >When it comes down to it, I believe that the specification of a > >'standard' should not be based on uncertainty, > >and as 'any encoding' presents uncertainty, it should not be in the > standard. > > > >I might be accused of changing my position (I have recently > >expressed support for flexibilty and even a qualified > >acceptance of the 'as for CIF1 proposal with UTF8 in place of > >ASCII'), but part of the value of these discussions > >is to question your own views in the light of other's perspectives. > >Indeed, I have found these discussions > >extremely informative and am now in a far better position to handle > >the realities of introducing non-ASCII CIFs, > >whatever the final COMCIFS decision. > > > >Cheers > > > >Simon > > > > > > > >From: "Bollinger, John C" > >To: Group for discussing encoding and content validation schemes for > >CIF2 > >Sent: Thursday, 23 September, 2010 15:02:25 > >Subject: Re: [Cif2-encoding] How we wrap this up > > > >On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > > > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > >>recently posted here and to COMCIFS. > >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > >>together with Brian's *recommendations* > >>3. UTF8-only as in the original draft > >>4. UTF8 + UTF16 > >>5. UTF8, UTF16 + "local" > >> > >>These can be broken down to: > >> > >>'any encoding' (1, 2, and 5) > >> > >>'specified encoding' (3 and 4) > >> > >>Note I put 5 in the 'any encoding' category as I think 'local' > >>could be interpretted as any encoding. > > > >I agree that 'local' could be interpreted as "any encoding", but I > >choose to view it as "context-dependent". Thus a file that is > >CIF-conformant on one computer might not be CIF-conformant on > >another. Some will find that unsatisfactory. In my view, however, > >it is the best interpretation of CIF1's provisions; its purpose is > >thus to ensure that *all* well-formed CIF1 files are also > >well-formed CIF2 files (a context-dependent question). Lest I > >appear to overstate the case, I acknowledge that the UTF8-only and > >UTF-8 + UTF-16 proposals would have the result that a large majority > >of well-formed CIF1 files are also well-formed CIF2 files. The > >variations of Herb's proposal probably also make all well-formed > >CIF1 files well-formed CIF2 files, but I disfavor them on different > >grounds (mostly that they are too open to differing interpretations). > > > >[...] > > > >>In either case, a degree of work will be required to accommodate > >>user practice and the legacy of CIF1. > > > >I think the entire question reduces to which accommodations for the > >CIF1 legacy are assured by CIF2 vs. which will constitute > >non-standard extensions. I don't think that individual responses, > >from Chester for example, are likely to depend much on which option > >is adopted, but I do think the overall consistency of responses will > >be affected. Thus I favor precision of the specification and > >coverage of the likely uses, in hope of achieving the greatest > >consistency of response. > > > >I doubt this has swayed anyone's opinion, so please consider it an > >advance explanation for my upcoming vote (inasmuch as I rely on > >James's previous assurance that voting rights in this context are > >not restricted to COMCIFS members). > > > > > >Best Regards, > > > >John > >-- > >John C. Bollinger, Ph.D. > >Department of Structural Biology > >St. Jude Children's Research Hospital > > > > > >Email Disclaimer: > >www.stjude.org/emaildisclaimer > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iuc > r.org/mailman/listinfo/cif2-encoding > > > > > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > -- > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100923/e55d3754/attachment.html From John.Bollinger at STJUDE.ORG Thu Sep 23 23:43:43 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Thu, 23 Sep 2010 17:43:43 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDDD@SJMEMXMBS11.stjude.sjcrh.local> Votes: I omit numbers to avoid any possibility of confusion between option numbers and ranks. In order from most preferred to least preferred, UTF-8 + UTF-16 + local UTF-8 + UTF-16 UTF-8 Herbert's text without Brian's recommendations Herbert's text with Brian's recommendations John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Fri Sep 24 02:03:50 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 23 Sep 2010 21:03:50 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <780727.99055.qm@web87010.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> Message-ID: I see not point in a final specification that users will ignore, and that will actually punish users who pay attention to it. That is not a useful standard, and very damaging to the CIF brand. We should be promolgating reasonable standards that we expect will in fact be adhered to, not ignored. In the present state of lack of software support and clear guidance, all the prescriptive UTF8 recommendations are unhelpful to users who read and pay attention to what the standard says. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > I agree to some extent with what you say, Herbert, but I'm > a bit more optomistic (for once) that the IUCr at least will be able to > adapt to > a 'specified encoding' system relatively quickly, and in the interim > certainly not reject non-UTFx CIFs. I'm not convinced that whatever > appears in the specification will have any influence on user practice, > especially in the non-IUCr world; rather I think the success (or otherwise) > of CIF2 will result from the software that implements it (as you suggest). > I don't share your pessimism about the potential confusion of specifying > UTF8 etc., > and certainly don't think that a restricted encoding will be any more > confusing than > 'any encoding', given, as you say, "people may not understand what they are > doing..." > > I suppose much of the difference in our views lies in our perception of user > interest - > I suspect there may even be overlap in this respect - but I'm perhaps less > inclined to > think that the final specification will have a marked influence on users: > "they can keep doing whatever they are currently doing that is currently > working for them" > > Anyway, its not me you have to convince :-), and its time I went to bed! > > Cheers > > Simon > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Thursday, 23 September, 2010 22:39:24 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > ? That is precisely the point -- there is a serious and growing > problems with encodings.? The strict UTF8 proposal then makes > it a universal problem for everybody using CIF, and we do _not_ > have a coherent means setup to deal with it.? The substitution > of UTF8 for ASCII in the CIF1 spec does not, in and of itself > make anything worse for anybody currently receiving 128 character > ASCII -- it is identical, and it does not force users working > in other systems that the IUCr journals are currently coping > with to jump into the boiling water, they can keep doing whatever > they are currently doing that is currently working for them > and the IUCr.? All the journals have to do until something that > is actually supports not-lower-128-ASCII is ready is to tell people that for > the jounrnals they will still have to use Brian's reverse solidus > escape codes for anything else -- nothing major changes for most > people.? If and when there really is a coherent scheme to support > more native Unicode code points for journal submission with tested > software, then we can do something more.? Right now, proposals > 3,4 and 5 will make things worse for large numbers of users > and not really make anything better for the IUCr.? It is too > early in the UTF8 conversion process. > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > > > Just because I'm still at my desk - and despite the fact that I told > myself > > I would not > > contribute further beyond my vote - it might be worth mentioning that the > > IUCr are already > > experiencing problems related to encoding issues (in their web services), > > and the occurence > > of such problems is most likely to increase when CIFs can contain > non-ASCII > > text. > > > > Cheers > > > > Simon > > > >___________________________________________________________________________ > _ > > From: Herbert J. Bernstein > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Thursday, 23 September, 2010 21:31:24 > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > Votes: > > > > In terms of the requested preference voting, I vote in declining order of > > preference > > > > 1, then 2, then (big gap) 5, then 4, then 3. > > > > On absolute voting up or down in COMCIFS, I will accept 1 or 2, but will > > lobby against and vote strongly against 3, 4, and 5. > > > > Explanation: > > > > I am not opposed to Brian recommendations.? The only reason I would vote > > for 1 over 2 is that I fear Brian's recommendation would generate yet > > more debate over the precise details and I believe we have more than > > run out of time to get something concrete ready for the IUCr meeting. > > > > I am very strongly opposed to 3, 4 and 5 because I believe they will > > cause confusion and delay in adoption of CIF2, while choices > > 1 and 2 keep the practices the community and the IUCr have lived > > with successfully for many years, simply applying then to UTF8 > > instead of ASCII.? People may not understand what they are doing > > in that mode, but they manage to successfully submit CIFs to the > > IUCr that way, and we don't have software ready to support anything > > else. > > > > ? -- Herbert > > > > At 8:13 PM +0000 9/23/10, SIMON WESTRIP wrote: > > >Faced with the options: > > > > > >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > >recently posted here and to COMCIFS. > > >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > >together with Brian's *recommendations* > > >3. UTF8-only as in the original draft > > >4. UTF8 + UTF16 > > >5. UTF8, UTF16 + "local" > > > > > >I have to vote for (4). > > > > > >When it comes down to it, I believe that the specification of a > > >'standard' should not be based on uncertainty, > > >and as 'any encoding' presents uncertainty, it should not be in the > > standard. > > > > > >I might be accused of changing my position (I have recently > > >expressed support for flexibilty and even a qualified > > >acceptance of the 'as for CIF1 proposal with UTF8 in place of > > >ASCII'), but part of the value of these discussions > > >is to question your own views in the light of other's perspectives. > > >Indeed, I have found these discussions > > >extremely informative and am now in a far better position to handle > > >the realities of introducing non-ASCII CIFs, > > >whatever the final COMCIFS decision. > > > > > >Cheers > > > > > >Simon > > > > > > > > > > > >From: "Bollinger, John C" > > >To: Group for discussing encoding and content validation schemes for > > >CIF2 > > >Sent: Thursday, 23 September, 2010 15:02:25 > > >Subject: Re: [Cif2-encoding] How we wrap this up > > > > > >On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > > > > > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > >>recently posted here and to COMCIFS. > > >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > >>together with Brian's *recommendations* > > >>3. UTF8-only as in the original draft > > >>4. UTF8 + UTF16 > > >>5. UTF8, UTF16 + "local" > > >> > > >>These can be broken down to: > > >> > > >>'any encoding' (1, 2, and 5) > > >> > > >>'specified encoding' (3 and 4) > > >> > > >>Note I put 5 in the 'any encoding' category as I think 'local' > > >>could be interpretted as any encoding. > > > > > >I agree that 'local' could be interpreted as "any encoding", but I > > >choose to view it as "context-dependent".? Thus a file that is > > >CIF-conformant on one computer might not be CIF-conformant on > > >another.? Some will find that unsatisfactory.? In my view, however, > > >it is the best interpretation of CIF1's provisions; its purpose is > > >thus to ensure that *all* well-formed CIF1 files are also > > >well-formed CIF2 files (a context-dependent question).? Lest I > > >appear to overstate the case, I acknowledge that the UTF8-only and > > >UTF-8 + UTF-16 proposals would have the result that a large majority > > >of well-formed CIF1 files are also well-formed CIF2 files.? The > > >variations of Herb's proposal probably also make all well-formed > > >CIF1 files well-formed CIF2 files, but I disfavor them on different > > >grounds (mostly that they are too open to differing interpretations). > > > > > >[...] > > > > > >>In either case, a degree of work will be required to accommodate > > >>user practice and the legacy of CIF1. > > > > > >I think the entire question reduces to which accommodations for the > > >CIF1 legacy are assured by CIF2 vs. which will constitute > > >non-standard extensions.? I don't think that individual responses, > > >from Chester for example, are likely to depend much on which option > > >is adopted, but I do think the overall consistency of responses will > > >be affected.? Thus I favor precision of the specification and > > >coverage of the likely uses, in hope of achieving the greatest > > >consistency of response. > > > > > >I doubt this has swayed anyone's opinion, so please consider it an > > >advance explanation for my upcoming vote (inasmuch as I rely on > > >James's previous assurance that voting rights in this context are > > >not restricted to COMCIFS members). > > > > > > > > >Best Regards, > > > > > >John > > >-- > > >John C. Bollinger, Ph.D. > > >Department of Structural Biology > > >St. Jude Children's Research Hospital > > > > > > > > >Email Disclaimer: > > >www.stjude.org/emaildisclaimer > > >_______________________________________________ > > >cif2-encoding mailing list > > >cif2-encoding at iucr.org > >>http://scripts.iuc > > > r.org/mailman/listinfo/cif2-encoding > > > > > > > > >_______________________________________________ > > >cif2-encoding mailing list > > >cif2-encoding at iucr.org > > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > -- > > ===================================================== > > ? Herbert J. Bernstein, Professor of Computer Science > > ? ? Dowling College, Kramer Science Center, KSC 121 > > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > > ? ? ? ? ? ? ? ? ? +1-631-244-3035 > > ? ? ? ? ? ? ? ? ? yaya at dowling.edu > > ===================================================== > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > From jamesrhester at gmail.com Fri Sep 24 07:53:29 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 24 Sep 2010 16:53:29 +1000 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <80062.82001.qm@web87012.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local> <80062.82001.qm@web87012.mail.ird.yahoo.com> Message-ID: Hi Simon: could you please give a list of options in order of preference? Thanks, James. On Fri, Sep 24, 2010 at 6:13 AM, SIMON WESTRIP wrote: > Faced with the options: > > 1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently > posted here and to COMCIFS. > 2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together > with Brian's *recommendations* > 3. UTF8-only as in the original draft > 4. UTF8 + UTF16 > 5. UTF8, UTF16 + "local" > > I have to vote for (4). > > When it comes down to it, I believe that the specification of a 'standard' > should not be based on uncertainty, > and as 'any encoding' presents uncertainty, it should not be in the > standard. > > I might be accused of changing my position (I have recently expressed > support for flexibilty and even a qualified > acceptance of the 'as for CIF1 proposal with UTF8 in place of ASCII'), but > part of the value of these discussions > is to question your own views in the light of other's perspectives. Indeed, > I have found these discussions > extremely informative and am now in a far better position to handle the > realities of introducing non-ASCII CIFs, > whatever the final COMCIFS decision. > > Cheers > > Simon > > > > ________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Thursday, 23 September, 2010 15:02:25 > Subject: Re: [Cif2-encoding] How we wrap this up > > On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently >> posted here and to COMCIFS. >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together >> with Brian's *recommendations* >>3. UTF8-only as in the original draft >>4. UTF8 + UTF16 >>5. UTF8, UTF16 + "local" >> >>These can be broken down to: >> >>'any encoding' (1, 2, and 5) >> >>'specified encoding' (3 and 4) >> >>Note I put 5 in the 'any encoding' category as I think 'local' could be >> interpretted as any encoding. > > I agree that 'local' could be interpreted as "any encoding", but I choose to > view it as "context-dependent".? Thus a file that is CIF-conformant on one > computer might not be CIF-conformant on another.? Some will find that > unsatisfactory.? In my view, however, it is the best interpretation of > CIF1's provisions; its purpose is thus to ensure that *all* well-formed CIF1 > files are also well-formed CIF2 files (a context-dependent question).? Lest > I appear to overstate the case, I acknowledge that the UTF8-only and UTF-8 + > UTF-16 proposals would have the result that a large majority of well-formed > CIF1 files are also well-formed CIF2 files.? The variations of Herb?s > proposal probably also make all well-formed CIF1 files well-formed CIF2 > files, but I disfavor them on different grounds (mostly that they are too > open to differing interpretations). > > [...] > >>In either case, a degree of work will be required to accommodate user >> practice and the legacy of CIF1. > > I think the entire question reduces to which accommodations for the CIF1 > legacy are assured by CIF2 vs. which will constitute non-standard > extensions.? I don?t think that individual responses, from Chester for > example, are likely to depend much on which option is adopted, but I do > think the overall consistency of responses will be affected.? Thus I favor > precision of the specification and coverage of the likely uses, in hope of > achieving the greatest consistency of response. > > I doubt this has swayed anyone's opinion, so please consider it an advance > explanation for my upcoming vote (inasmuch as I rely on James's previous > assurance that voting rights in this context are not restricted to COMCIFS > members). > > > Best Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer:? www.stjude.org/emaildisclaimer > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From simonwestrip at btinternet.com Fri Sep 24 08:05:54 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Fri, 24 Sep 2010 07:05:54 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> Message-ID: <526633.3484.qm@web87004.mail.ird.yahoo.com> I do not understand why a user who adhere's to a CIF2 standard that specifies an encoding will be 'punished'? What worries me about a specification that allows any encodng is that users who ignore any recommendations regarding a preferred encoding might experience difficulties when e.g. submitting their CIF to a journal/archive, even though they have adhered to the standard (unjustly punished). With regard to the lack of CIF2 software support, surely CIF2 in general is of little use to users, not just its encoding requirements. But perhaps you already have CIF2 software that can be dropped into existing workflows save for the fact that it would require modification to work with 'specified encodings'? ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Friday, 24 September, 2010 2:03:50 Subject: Re: [Cif2-encoding] How we wrap this up I see not point in a final specification that users will ignore, and that will actually punish users who pay attention to it. That is not a useful standard, and very damaging to the CIF brand. We should be promolgating reasonable standards that we expect will in fact be adhered to, not ignored. In the present state of lack of software support and clear guidance, all the prescriptive UTF8 recommendations are unhelpful to users who read and pay attention to what the standard says. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > I agree to some extent with what you say, Herbert, but I'm > a bit more optomistic (for once) that the IUCr at least will be able to > adapt to > a 'specified encoding' system relatively quickly, and in the interim > certainly not reject non-UTFx CIFs. I'm not convinced that whatever > appears in the specification will have any influence on user practice, > especially in the non-IUCr world; rather I think the success (or otherwise) > of CIF2 will result from the software that implements it (as you suggest). > I don't share your pessimism about the potential confusion of specifying > UTF8 etc., > and certainly don't think that a restricted encoding will be any more > confusing than > 'any encoding', given, as you say, "people may not understand what they are > doing..." > > I suppose much of the difference in our views lies in our perception of user > interest - > I suspect there may even be overlap in this respect - but I'm perhaps less > inclined to > think that the final specification will have a marked influence on users: > "they can keep doing whatever they are currently doing that is currently > working for them" > > Anyway, its not me you have to convince :-), and its time I went to bed! > > Cheers > > Simon > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Thursday, 23 September, 2010 22:39:24 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > That is precisely the point -- there is a serious and growing > problems with encodings. The strict UTF8 proposal then makes > it a universal problem for everybody using CIF, and we do _not_ > have a coherent means setup to deal with it. The substitution > of UTF8 for ASCII in the CIF1 spec does not, in and of itself > make anything worse for anybody currently receiving 128 character > ASCII -- it is identical, and it does not force users working > in other systems that the IUCr journals are currently coping > with to jump into the boiling water, they can keep doing whatever > they are currently doing that is currently working for them > and the IUCr. All the journals have to do until something that > is actually supports not-lower-128-ASCII is ready is to tell people that for > the jounrnals they will still have to use Brian's reverse solidus > escape codes for anything else -- nothing major changes for most > people. If and when there really is a coherent scheme to support > more native Unicode code points for journal submission with tested > software, then we can do something more. Right now, proposals > 3,4 and 5 will make things worse for large numbers of users > and not really make anything better for the IUCr. It is too > early in the UTF8 conversion process. > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > > > Just because I'm still at my desk - and despite the fact that I told > myself > > I would not > > contribute further beyond my vote - it might be worth mentioning that the > > IUCr are already > > experiencing problems related to encoding issues (in their web services), > > and the occurence > > of such problems is most likely to increase when CIFs can contain > non-ASCII > > text. > > > > Cheers > > > > Simon > > > >___________________________________________________________________________ > _ > > From: Herbert J. Bernstein > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Thursday, 23 September, 2010 21:31:24 > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > Votes: > > > > In terms of the requested preference voting, I vote in declining order of > > preference > > > > 1, then 2, then (big gap) 5, then 4, then 3. > > > > On absolute voting up or down in COMCIFS, I will accept 1 or 2, but will > > lobby against and vote strongly against 3, 4, and 5. > > > > Explanation: > > > > I am not opposed to Brian recommendations. The only reason I would vote > > for 1 over 2 is that I fear Brian's recommendation would generate yet > > more debate over the precise details and I believe we have more than > > run out of time to get something concrete ready for the IUCr meeting. > > > > I am very strongly opposed to 3, 4 and 5 because I believe they will > > cause confusion and delay in adoption of CIF2, while choices > > 1 and 2 keep the practices the community and the IUCr have lived > > with successfully for many years, simply applying then to UTF8 > > instead of ASCII. People may not understand what they are doing > > in that mode, but they manage to successfully submit CIFs to the > > IUCr that way, and we don't have software ready to support anything > > else. > > > > -- Herbert > > > > At 8:13 PM +0000 9/23/10, SIMON WESTRIP wrote: > > >Faced with the options: > > > > > >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > >recently posted here and to COMCIFS. > > >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > >together with Brian's *recommendations* > > >3. UTF8-only as in the original draft > > >4. UTF8 + UTF16 > > >5. UTF8, UTF16 + "local" > > > > > >I have to vote for (4). > > > > > >When it comes down to it, I believe that the specification of a > > >'standard' should not be based on uncertainty, > > >and as 'any encoding' presents uncertainty, it should not be in the > > standard. > > > > > >I might be accused of changing my position (I have recently > > >expressed support for flexibilty and even a qualified > > >acceptance of the 'as for CIF1 proposal with UTF8 in place of > > >ASCII'), but part of the value of these discussions > > >is to question your own views in the light of other's perspectives. > > >Indeed, I have found these discussions > > >extremely informative and am now in a far better position to handle > > >the realities of introducing non-ASCII CIFs, > > >whatever the final COMCIFS decision. > > > > > >Cheers > > > > > >Simon > > > > > > > > > > > >From: "Bollinger, John C" > > >To: Group for discussing encoding and content validation schemes for > > >CIF2 > > >Sent: Thursday, 23 September, 2010 15:02:25 > > >Subject: Re: [Cif2-encoding] How we wrap this up > > > > > >On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > > > > > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > >>recently posted here and to COMCIFS. > > >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > >>together with Brian's *recommendations* > > >>3. UTF8-only as in the original draft > > >>4. UTF8 + UTF16 > > >>5. UTF8, UTF16 + "local" > > >> > > >>These can be broken down to: > > >> > > >>'any encoding' (1, 2, and 5) > > >> > > >>'specified encoding' (3 and 4) > > >> > > >>Note I put 5 in the 'any encoding' category as I think 'local' > > >>could be interpretted as any encoding. > > > > > >I agree that 'local' could be interpreted as "any encoding", but I > > >choose to view it as "context-dependent". Thus a file that is > > >CIF-conformant on one computer might not be CIF-conformant on > > >another. Some will find that unsatisfactory. In my view, however, > > >it is the best interpretation of CIF1's provisions; its purpose is > > >thus to ensure that *all* well-formed CIF1 files are also > > >well-formed CIF2 files (a context-dependent question). Lest I > > >appear to overstate the case, I acknowledge that the UTF8-only and > > >UTF-8 + UTF-16 proposals would have the result that a large majority > > >of well-formed CIF1 files are also well-formed CIF2 files. The > > >variations of Herb's proposal probably also make all well-formed > > >CIF1 files well-formed CIF2 files, but I disfavor them on different > > >grounds (mostly that they are too open to differing interpretations). > > > > > >[...] > > > > > >>In either case, a degree of work will be required to accommodate > > >>user practice and the legacy of CIF1. > > > > > >I think the entire question reduces to which accommodations for the > > >CIF1 legacy are assured by CIF2 vs. which will constitute > > >non-standard extensions. I don't think that individual responses, > > >from Chester for example, are likely to depend much on which option > > >is adopted, but I do think the overall consistency of responses will > > >be affected. Thus I favor precision of the specification and > > >coverage of the likely uses, in hope of achieving the greatest > > >consistency of response. > > > > > >I doubt this has swayed anyone's opinion, so please consider it an > > >advance explanation for my upcoming vote (inasmuch as I rely on > > >James's previous assurance that voting rights in this context are > > >not restricted to COMCIFS members). > > > > > > > > >Best Regards, > > > > > >John > > >-- > > >John C. Bollinger, Ph.D. > > >Department of Structural Biology > > >St. Jude Children's Research Hospital > > > > > > > > >Email Disclaimer: > > >www.stjude.org/emaildisclaimer > > >_______________________________________________ > > >cif2-encoding mailing list > > >cif2-encoding at iucr.org > >>http://scripts.iuc > > > r.org/mailman/listinfo/cif2-encoding > > > > > > > > >_______________________________________________ > > >cif2-encoding mailing list > > >cif2-encoding at iucr.org > > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > -- > > ===================================================== > > Herbert J. Bernstein, Professor of Computer Science > > Dowling College, Kramer Science Center, KSC 121 > > Idle Hour Blvd, Oakdale, NY, 11769 > > > > +1-631-244-3035 > > yaya at dowling.edu > > ===================================================== > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100924/f46e07f5/attachment-0001.html From simonwestrip at btinternet.com Fri Sep 24 08:07:29 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Fri, 24 Sep 2010 07:07:29 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local> <80062.82001.qm@web87012.mail.ird.yahoo.com> Message-ID: <528730.52855.qm@web87014.mail.ird.yahoo.com> 4, 3, 2, 1, 5 Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Friday, 24 September, 2010 7:53:29 Subject: Re: [Cif2-encoding] How we wrap this up Hi Simon: could you please give a list of options in order of preference? Thanks, James. On Fri, Sep 24, 2010 at 6:13 AM, SIMON WESTRIP wrote: > Faced with the options: > > 1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently > posted here and to COMCIFS. > 2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together > with Brian's *recommendations* > 3. UTF8-only as in the original draft > 4. UTF8 + UTF16 > 5. UTF8, UTF16 + "local" > > I have to vote for (4). > > When it comes down to it, I believe that the specification of a 'standard' > should not be based on uncertainty, > and as 'any encoding' presents uncertainty, it should not be in the > standard. > > I might be accused of changing my position (I have recently expressed > support for flexibilty and even a qualified > acceptance of the 'as for CIF1 proposal with UTF8 in place of ASCII'), but > part of the value of these discussions > is to question your own views in the light of other's perspectives. Indeed, > I have found these discussions > extremely informative and am now in a far better position to handle the > realities of introducing non-ASCII CIFs, > whatever the final COMCIFS decision. > > Cheers > > Simon > > > > ________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Thursday, 23 September, 2010 15:02:25 > Subject: Re: [Cif2-encoding] How we wrap this up > > On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently >> posted here and to COMCIFS. >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together >> with Brian's *recommendations* >>3. UTF8-only as in the original draft >>4. UTF8 + UTF16 >>5. UTF8, UTF16 + "local" >> >>These can be broken down to: >> >>'any encoding' (1, 2, and 5) >> >>'specified encoding' (3 and 4) >> >>Note I put 5 in the 'any encoding' category as I think 'local' could be >> interpretted as any encoding. > > I agree that 'local' could be interpreted as "any encoding", but I choose to > view it as "context-dependent". Thus a file that is CIF-conformant on one > computer might not be CIF-conformant on another. Some will find that > unsatisfactory. In my view, however, it is the best interpretation of > CIF1's provisions; its purpose is thus to ensure that *all* well-formed CIF1 > files are also well-formed CIF2 files (a context-dependent question). Lest > I appear to overstate the case, I acknowledge that the UTF8-only and UTF-8 + > UTF-16 proposals would have the result that a large majority of well-formed > CIF1 files are also well-formed CIF2 files. The variations of Herb?s > proposal probably also make all well-formed CIF1 files well-formed CIF2 > files, but I disfavor them on different grounds (mostly that they are too > open to differing interpretations). > > [...] > >>In either case, a degree of work will be required to accommodate user >> practice and the legacy of CIF1. > > I think the entire question reduces to which accommodations for the CIF1 > legacy are assured by CIF2 vs. which will constitute non-standard > extensions. I don?t think that individual responses, from Chester for > example, are likely to depend much on which option is adopted, but I do > think the overall consistency of responses will be affected. Thus I favor > precision of the specification and coverage of the likely uses, in hope of > achieving the greatest consistency of response. > > I doubt this has swayed anyone's opinion, so please consider it an advance > explanation for my upcoming vote (inasmuch as I rely on James's previous > assurance that voting rights in this context are not restricted to COMCIFS > members). > > > Best Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100924/27a40b6b/attachment.html From bm at iucr.org Fri Sep 24 10:23:59 2010 From: bm at iucr.org (Brian McMahon) Date: Fri, 24 Sep 2010 10:23:59 +0100 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: Message-ID: <20100924092359.GA21694@emerald.iucr.org> My vote: Preference Option 1 2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together with Brian's *recommendations* 2 1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently posted here and to COMCIFS. 3 4. UTF8 + UTF16 4 3. UTF8-only as in the original draft 5 5. UTF8, UTF16 + "local" Rationale: I still feel this argument is at heart a "binary/text" dichotomy, where "binary" implies that one can prescribe specific byte-level representations of every distinct character; "text" implies that you're at the mercy of external libraries and mappings between encoding conventions - and those mappings are not always explicit or easy to identify. I sympathise greatly with James's desire for a prescriptive, "binary" approach, but its corollary is that a CIF application must take full responsibility for expressing any supported extended character set (I mean accented Latin letters, Greek characters, Cyrillic or Chinese alphabets). First off, I don't know how difficult that is technically. I would guess that rather than trying to handle arbitrary keyboard mappings, the natural approach would be to pick from a graphical character grid. (What are the implications for this of glyph rendering - does a CIF editor have to be compiled with its own large font library?) But that's a laborious method of authoring if relatively large amounts of "non-standard" text are involved, and the way that authors would prefer to work, surely, is by copying and pasting text from Word or some other tool of choice. Permitting that necessarily pollutes the "binary" approach with byte streams delivered by text-oriented applications. If I could be sure that publCIF, say, can be compiled with libraries that reliably transcode byte streams imported from clipboards and file import (across the mess of SMB/NFS mounts etc. that exist in the real world) - and equally reliably transcode its UTF8 encoded text to the author's locale-based clipboard, then I'd be more willing to promote option 3 to the top as the starting point at least for CIF 2.0 (but its "enforcement" does depend on the availability of such a robust CIF-editing tool). I prefer the UTF8 + UTF16 option over UTF8-only because of the real-world use case that Herbert has described before; and in existing imgCIF applications the UTF16 encoding is being done rather carefully and for a specific purpose. I put option 5 at the bottom because of the non-portability of a "local" encoding. Note, though, that whatever the outcome I would still favour the discussion of character set encodings to be presented as a Part 3 to the complete CIF2 spec. Best wishes Brian _________________________________________________________________________ Brian McMahon tel: +44 1244 342878 Research and Development Officer fax: +44 1244 314888 International Union of Crystallography e-mail: bm at iucr.org 5 Abbey Square, Chester CH1 2HU, England On Thu, Sep 23, 2010 at 10:37:48AM +1000, James Hester wrote: > Dear CIF2 encoding participants, > > As Herbert has indicated, we are starting to run out of time for > resolution of the encoding issue. I believe that we have now explored > the various proposals sufficiently to all have a good understanding of > the consequences and advantages of each approach. So, after a round > of final comments, I propose that we vote on the general scheme that > we recommend. We can then flesh out the details of the particular > scheme that we have settled on, and take this completed proposal to > the DDLm group for their approval, following which we will present the > entire CIF2 syntax document to COMCIFS for a formal vote. > > The proposals that I believe are still on the table are: > > 1. Herbert's 'as for CIF1 proposal' recently posted here and to COMCIFS. > 2. Herbert's 'as for CIF1 proposal', together with Brian's proposal > (if you agree that they are compatible) > 2. UTF8-only as in the original draft > 3. UTF8 + UTF16 > 4. UTF8, UTF16 + "local" > > I have not included the hashcode proposal as I believe it no longer > has any supporters. > > We would need to conduct a preferential vote. I stress that this is > purely to determine the recommendation of this working group, and is > not in any way binding on COMCIFS. > > James. > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Fri Sep 24 13:17:14 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 24 Sep 2010 08:17:14 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <526633.3484.qm@web87004.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> Message-ID: If he ignores the standard, in most cases all he has to do to comply with CIF2 is to run whatever applications he currently runs to produce CIF1 and, perhaps, in some cases, run a minor edit pass at the end, to convert for the minor syntactive differences and/or changed tags required to comply with CIF2 and the new dictionaries, but he is unlikely to have to do anything to deal with the messy business of whether his encoding is really a proper UTF8 encoding or not. The punishment if he tries to comply, is that he has to totally uproot and reconfigure the environment in which he produces CIFs from whatever he is currently doing to create an enviroment in which he can reliably create and, more importantly, transmit compliant UTF8 files. This can be very tricky if he does only a partial job, say fudging in one special application (yet to be written), because if he stays with his old system, all kinds of tools will keep trying to transcode whatever he has produced back to whatever his system considers a standard. Those of us who have files, applications and tools that have lived through several generations of macs are living proof of the problem. Macs now have excellent UTF8/16 unicode support, but every once in a while in working with a unicode file I find it has been strangely and unexpectedly converted to something else, and it can be really tricky to spot when the unaccented roman text part has been left untouched but just a few accented letters have gotten different accents. Mandating UTF8 is simply trying to shift a serious software problem from the central handlers of CIF (IUCr, PDB, etc.) to the external users. Most users will probably have the good sense to simply ignore the demand and leave the burden just where it is now. A few sophisticated users will probably adapt with no trouble, but the punishment for those users who blindly follow orders before we have a complete multiplatform supporting infrastructure in place by mandating UTF8 is severe, expensive and undeserved. Until and unless we have developed solid support, we will just be alienating people from CIF. I will continue to oppose such a move. Simon, I beg you to change your vote. Once we have the rest of CIF2 in place and supported, I will be happy to cooperate in trying to develop the software support we would need to make UTF8/UTF16 work well for users on Mac, Linux and Windows, but it is a big job that I do not believe can be done soon enough and well enough for options 3 through 5 to make sense right now. Please do not "make the perfect the enemy of the good". ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 24 Sep 2010, SIMON WESTRIP wrote: > I do not understand why a user who adhere's to a CIF2 standard > that specifies an encoding will be 'punished'? > What worries me about a specification that allows any encodng > is that users who ignore any recommendations regarding > a preferred encoding might experience difficulties when e.g. > submitting their CIF to a journal/archive, even though they > have adhered to the standard (unjustly punished). > > With regard to the lack of CIF2 software support, surely CIF2 > in general is of little use to users, not just its encoding requirements. > But perhaps you already have CIF2 software that can be dropped into existing > workflows save for the fact that it would require modification to work > with 'specified encodings'? > > > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 24 September, 2010 2:03:50 > Subject: Re: [Cif2-encoding] How we wrap this up > > I see not point in a final specification that users will > ignore, and that will actually punish users who > pay attention to it.? That is not a useful standard, > and very damaging to the CIF brand.? We should be > promolgating reasonable standards that we expect will > in fact be adhered to, not ignored.? In the present > state of lack of software support and clear guidance, > all the prescriptive UTF8 recommendations are unhelpful > to users who read and pay attention to what the standard > says. > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > > > I agree to some extent with what you say, Herbert, but I'm > > a bit more optomistic (for once) that the IUCr at least will be able to > > adapt to > > a 'specified encoding' system relatively quickly, and in the interim > > certainly not reject non-UTFx CIFs. I'm not convinced that whatever > > appears in the specification will have any influence on user practice, > > especially in the non-IUCr world; rather I think the success (or > otherwise) > > of CIF2 will result from the software that implements it (as you suggest). > > I don't share your pessimism about the potential confusion of specifying > > UTF8 etc., > > and certainly don't think that a restricted encoding will be any more > > confusing than > > 'any encoding', given, as you say, "people may not understand what they > are > > doing..." > > > > I suppose much of the difference in our views lies in our perception of > user > > interest - > > I suspect there may even be overlap in this respect - but I'm perhaps less > > inclined to > > think that the final specification will have a marked influence on users: > > "they can keep doing whatever they are currently doing that is currently > > working for them" > > > > Anyway, its not me you have to convince :-), and its time I went to bed! > > > > Cheers > > > > Simon > > > >___________________________________________________________________________ > _ > > From: Herbert J. Bernstein > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Thursday, 23 September, 2010 22:39:24 > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > Dear Simon, > > > > ? That is precisely the point -- there is a serious and growing > > problems with encodings.? The strict UTF8 proposal then makes > > it a universal problem for everybody using CIF, and we do _not_ > > have a coherent means setup to deal with it.? The substitution > > of UTF8 for ASCII in the CIF1 spec does not, in and of itself > > make anything worse for anybody currently receiving 128 character > > ASCII -- it is identical, and it does not force users working > > in other systems that the IUCr journals are currently coping > > with to jump into the boiling water, they can keep doing whatever > > they are currently doing that is currently working for them > > and the IUCr.? All the journals have to do until something that > > is actually supports not-lower-128-ASCII is ready is to tell people that > for > > the jounrnals they will still have to use Brian's reverse solidus > > escape codes for anything else -- nothing major changes for most > > people.? If and when there really is a coherent scheme to support > > more native Unicode code points for journal submission with tested > > software, then we can do something more.? Right now, proposals > > 3,4 and 5 will make things worse for large numbers of users > > and not really make anything better for the IUCr.? It is too > > early in the UTF8 conversion process. > > > > ===================================================== > > Herbert J. Bernstein, Professor of Computer Science > > ? Dowling College, Kramer Science Center, KSC 121 > > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > > ? ? ? ? ? ? ? ? +1-631-244-3035 > > ? ? ? ? ? ? ? ? yaya at dowling.edu > > ===================================================== > > > > On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > > > > > Just because I'm still at my desk - and despite the fact that I told > > myself > > > I would not > > > contribute further beyond my vote - it might be worth mentioning that > the > > > IUCr are already > > > experiencing problems related to encoding issues (in their web > services), > > > and the occurence > > > of such problems is most likely to increase when CIFs can contain > > non-ASCII > > > text. > > > > > > Cheers > > > > > > Simon > > > > >>__________________________________________________________________________ > _ > > _ > > > From: Herbert J. Bernstein > > > To: Group for discussing encoding and content validation schemes for > CIF2 > > > > > > Sent: Thursday, 23 September, 2010 21:31:24 > > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > > > Votes: > > > > > > In terms of the requested preference voting, I vote in declining order > of > > > preference > > > > > > 1, then 2, then (big gap) 5, then 4, then 3. > > > > > > On absolute voting up or down in COMCIFS, I will accept 1 or 2, but will > > > lobby against and vote strongly against 3, 4, and 5. > > > > > > Explanation: > > > > > > I am not opposed to Brian recommendations.? The only reason I would vote > > > for 1 over 2 is that I fear Brian's recommendation would generate yet > > > more debate over the precise details and I believe we have more than > > > run out of time to get something concrete ready for the IUCr meeting. > > > > > > I am very strongly opposed to 3, 4 and 5 because I believe they will > > > cause confusion and delay in adoption of CIF2, while choices > > > 1 and 2 keep the practices the community and the IUCr have lived > > > with successfully for many years, simply applying then to UTF8 > > > instead of ASCII.? People may not understand what they are doing > > > in that mode, but they manage to successfully submit CIFs to the > > > IUCr that way, and we don't have software ready to support anything > > > else. > > > > > > ? -- Herbert > > > > > > At 8:13 PM +0000 9/23/10, SIMON WESTRIP wrote: > > > >Faced with the options: > > > > > > > >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > > >recently posted here and to COMCIFS. > > > >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > > >together with Brian's *recommendations* > > > >3. UTF8-only as in the original draft > > > >4. UTF8 + UTF16 > > > >5. UTF8, UTF16 + "local" > > > > > > > >I have to vote for (4). > > > > > > > >When it comes down to it, I believe that the specification of a > > > >'standard' should not be based on uncertainty, > > > >and as 'any encoding' presents uncertainty, it should not be in the > > > standard. > > > > > > > >I might be accused of changing my position (I have recently > > > >expressed support for flexibilty and even a qualified > > > >acceptance of the 'as for CIF1 proposal with UTF8 in place of > > > >ASCII'), but part of the value of these discussions > > > >is to question your own views in the light of other's perspectives. > > > >Indeed, I have found these discussions > > > >extremely informative and am now in a far better position to handle > > > >the realities of introducing non-ASCII CIFs, > > > >whatever the final COMCIFS decision. > > > > > > > >Cheers > > > > > > > >Simon > > > > > > > > > > > > > > > >From: "Bollinger, John C" > > > >To: Group for discussing encoding and content validation schemes for > > > >CIF2 > > > >Sent: Thursday, 23 September, 2010 15:02:25 > > > >Subject: Re: [Cif2-encoding] How we wrap this up > > > > > > > >On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > > > > > > > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > > >>recently posted here and to COMCIFS. > > > >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > > >>together with Brian's *recommendations* > > > >>3. UTF8-only as in the original draft > > > >>4. UTF8 + UTF16 > > > >>5. UTF8, UTF16 + "local" > > > >> > > > >>These can be broken down to: > > > >> > > > >>'any encoding' (1, 2, and 5) > > > >> > > > >>'specified encoding' (3 and 4) > > > >> > > > >>Note I put 5 in the 'any encoding' category as I think 'local' > > > >>could be interpretted as any encoding. > > > > > > > >I agree that 'local' could be interpreted as "any encoding", but I > > > >choose to view it as "context-dependent".? Thus a file that is > > > >CIF-conformant on one computer might not be CIF-conformant on > > > >another.? Some will find that unsatisfactory.? In my view, however, > > > >it is the best interpretation of CIF1's provisions; its purpose is > > > >thus to ensure that *all* well-formed CIF1 files are also > > > >well-formed CIF2 files (a context-dependent question).? Lest I > > > >appear to overstate the case, I acknowledge that the UTF8-only and > > > >UTF-8 + UTF-16 proposals would have the result that a large majority > > > >of well-formed CIF1 files are also well-formed CIF2 files.? The > > > >variations of Herb's proposal probably also make all well-formed > > > >CIF1 files well-formed CIF2 files, but I disfavor them on different > > > >grounds (mostly that they are too open to differing interpretations). > > > > > > > >[...] > > > > > > > >>In either case, a degree of work will be required to accommodate > > > >>user practice and the legacy of CIF1. > > > > > > > >I think the entire question reduces to which accommodations for the > > > >CIF1 legacy are assured by CIF2 vs. which will constitute > > > >non-standard extensions.? I don't think that individual responses, > > > >from Chester for example, are likely to depend much on which option > > > >is adopted, but I do think the overall consistency of responses will > > > >be affected.? Thus I favor precision of the specification and > > > >coverage of the likely uses, in hope of achieving the greatest > > > >consistency of response. > > > > > > > >I doubt this has swayed anyone's opinion, so please consider it an > > > >advance explanation for my upcoming vote (inasmuch as I rely on > > > >James's previous assurance that voting rights in this context are > > > >not restricted to COMCIFS members). > > > > > > > > > > > >Best Regards, > > > > > > > >John > > > >-- > > > >John C. Bollinger, Ph.D. > > > >Department of Structural Biology > > > >St. Jude Children's Research Hospital > > > > > > > > > > > >Email Disclaimer: > > > >www.stjude.org/emaildisclaimer > > > >_______________________________________________ > > > >cif2-encoding mailing list > > > >cif2-encoding at iucr.org > >>>http://scripts.iu > c > > > > > r.org/mailman/listinfo/cif2-encoding > > > > > > > > > > > >_______________________________________________ > > > >cif2-encoding mailing list > > > >cif2-encoding at iucr.org > > > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > > > > -- > > > ===================================================== > > > ? Herbert J. Bernstein, Professor of Computer Science > > > ? ? Dowling College, Kramer Science Center, KSC 121 > > > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > > > > ? ? ? ? ? ? ? ? ? +1-631-244-3035 > > > ? ? ? ? ? ? ? ? ? yaya at dowling.edu > > > ===================================================== > > > _______________________________________________ > > > cif2-encoding mailing list > > > cif2-encoding at iucr.org > > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > > > > > > > From simonwestrip at btinternet.com Fri Sep 24 13:53:10 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Fri, 24 Sep 2010 12:53:10 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> Message-ID: <613218.81205.qm@web87011.mail.ird.yahoo.com> Dear Herbert Not for the first time, I find your arguement persuasive. Brian's vote and explanation have also raised some questions that I would like to look into. I will confirm or otherwise my vote as soon as possible, assuming that is OK with James and assuming that this round of votes might wrap this up. Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Friday, 24 September, 2010 13:17:14 Subject: Re: [Cif2-encoding] How we wrap this up If he ignores the standard, in most cases all he has to do to comply with CIF2 is to run whatever applications he currently runs to produce CIF1 and, perhaps, in some cases, run a minor edit pass at the end, to convert for the minor syntactive differences and/or changed tags required to comply with CIF2 and the new dictionaries, but he is unlikely to have to do anything to deal with the messy business of whether his encoding is really a proper UTF8 encoding or not. The punishment if he tries to comply, is that he has to totally uproot and reconfigure the environment in which he produces CIFs from whatever he is currently doing to create an enviroment in which he can reliably create and, more importantly, transmit compliant UTF8 files. This can be very tricky if he does only a partial job, say fudging in one special application (yet to be written), because if he stays with his old system, all kinds of tools will keep trying to transcode whatever he has produced back to whatever his system considers a standard. Those of us who have files, applications and tools that have lived through several generations of macs are living proof of the problem. Macs now have excellent UTF8/16 unicode support, but every once in a while in working with a unicode file I find it has been strangely and unexpectedly converted to something else, and it can be really tricky to spot when the unaccented roman text part has been left untouched but just a few accented letters have gotten different accents. Mandating UTF8 is simply trying to shift a serious software problem from the central handlers of CIF (IUCr, PDB, etc.) to the external users. Most users will probably have the good sense to simply ignore the demand and leave the burden just where it is now. A few sophisticated users will probably adapt with no trouble, but the punishment for those users who blindly follow orders before we have a complete multiplatform supporting infrastructure in place by mandating UTF8 is severe, expensive and undeserved. Until and unless we have developed solid support, we will just be alienating people from CIF. I will continue to oppose such a move. Simon, I beg you to change your vote. Once we have the rest of CIF2 in place and supported, I will be happy to cooperate in trying to develop the software support we would need to make UTF8/UTF16 work well for users on Mac, Linux and Windows, but it is a big job that I do not believe can be done soon enough and well enough for options 3 through 5 to make sense right now. Please do not "make the perfect the enemy of the good". ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 24 Sep 2010, SIMON WESTRIP wrote: > I do not understand why a user who adhere's to a CIF2 standard > that specifies an encoding will be 'punished'? > What worries me about a specification that allows any encodng > is that users who ignore any recommendations regarding > a preferred encoding might experience difficulties when e.g. > submitting their CIF to a journal/archive, even though they > have adhered to the standard (unjustly punished). > > With regard to the lack of CIF2 software support, surely CIF2 > in general is of little use to users, not just its encoding requirements. > But perhaps you already have CIF2 software that can be dropped into existing > workflows save for the fact that it would require modification to work > with 'specified encodings'? > > > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 24 September, 2010 2:03:50 > Subject: Re: [Cif2-encoding] How we wrap this up > > I see not point in a final specification that users will > ignore, and that will actually punish users who > pay attention to it. That is not a useful standard, > and very damaging to the CIF brand. We should be > promolgating reasonable standards that we expect will > in fact be adhered to, not ignored. In the present > state of lack of software support and clear guidance, > all the prescriptive UTF8 recommendations are unhelpful > to users who read and pay attention to what the standard > says. > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > > > I agree to some extent with what you say, Herbert, but I'm > > a bit more optomistic (for once) that the IUCr at least will be able to > > adapt to > > a 'specified encoding' system relatively quickly, and in the interim > > certainly not reject non-UTFx CIFs. I'm not convinced that whatever > > appears in the specification will have any influence on user practice, > > especially in the non-IUCr world; rather I think the success (or > otherwise) > > of CIF2 will result from the software that implements it (as you suggest). > > I don't share your pessimism about the potential confusion of specifying > > UTF8 etc., > > and certainly don't think that a restricted encoding will be any more > > confusing than > > 'any encoding', given, as you say, "people may not understand what they > are > > doing..." > > > > I suppose much of the difference in our views lies in our perception of > user > > interest - > > I suspect there may even be overlap in this respect - but I'm perhaps less > > inclined to > > think that the final specification will have a marked influence on users: > > "they can keep doing whatever they are currently doing that is currently > > working for them" > > > > Anyway, its not me you have to convince :-), and its time I went to bed! > > > > Cheers > > > > Simon > > > >___________________________________________________________________________ > _ > > From: Herbert J. Bernstein > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Thursday, 23 September, 2010 22:39:24 > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > Dear Simon, > > > > That is precisely the point -- there is a serious and growing > > problems with encodings. The strict UTF8 proposal then makes > > it a universal problem for everybody using CIF, and we do _not_ > > have a coherent means setup to deal with it. The substitution > > of UTF8 for ASCII in the CIF1 spec does not, in and of itself > > make anything worse for anybody currently receiving 128 character > > ASCII -- it is identical, and it does not force users working > > in other systems that the IUCr journals are currently coping > > with to jump into the boiling water, they can keep doing whatever > > they are currently doing that is currently working for them > > and the IUCr. All the journals have to do until something that > > is actually supports not-lower-128-ASCII is ready is to tell people that > for > > the jounrnals they will still have to use Brian's reverse solidus > > escape codes for anything else -- nothing major changes for most > > people. If and when there really is a coherent scheme to support > > more native Unicode code points for journal submission with tested > > software, then we can do something more. Right now, proposals > > 3,4 and 5 will make things worse for large numbers of users > > and not really make anything better for the IUCr. It is too > > early in the UTF8 conversion process. > > > > ===================================================== > > Herbert J. Bernstein, Professor of Computer Science > > Dowling College, Kramer Science Center, KSC 121 > > Idle Hour Blvd, Oakdale, NY, 11769 > > > > +1-631-244-3035 > > yaya at dowling.edu > > ===================================================== > > > > On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > > > > > Just because I'm still at my desk - and despite the fact that I told > > myself > > > I would not > > > contribute further beyond my vote - it might be worth mentioning that > the > > > IUCr are already > > > experiencing problems related to encoding issues (in their web > services), > > > and the occurence > > > of such problems is most likely to increase when CIFs can contain > > non-ASCII > > > text. > > > > > > Cheers > > > > > > Simon > > > > >>__________________________________________________________________________ > _ > > _ > > > From: Herbert J. Bernstein > > > To: Group for discussing encoding and content validation schemes for > CIF2 > > > > > > Sent: Thursday, 23 September, 2010 21:31:24 > > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > > > Votes: > > > > > > In terms of the requested preference voting, I vote in declining order > of > > > preference > > > > > > 1, then 2, then (big gap) 5, then 4, then 3. > > > > > > On absolute voting up or down in COMCIFS, I will accept 1 or 2, but will > > > lobby against and vote strongly against 3, 4, and 5. > > > > > > Explanation: > > > > > > I am not opposed to Brian recommendations. The only reason I would vote > > > for 1 over 2 is that I fear Brian's recommendation would generate yet > > > more debate over the precise details and I believe we have more than > > > run out of time to get something concrete ready for the IUCr meeting. > > > > > > I am very strongly opposed to 3, 4 and 5 because I believe they will > > > cause confusion and delay in adoption of CIF2, while choices > > > 1 and 2 keep the practices the community and the IUCr have lived > > > with successfully for many years, simply applying then to UTF8 > > > instead of ASCII. People may not understand what they are doing > > > in that mode, but they manage to successfully submit CIFs to the > > > IUCr that way, and we don't have software ready to support anything > > > else. > > > > > > -- Herbert > > > > > > At 8:13 PM +0000 9/23/10, SIMON WESTRIP wrote: > > > >Faced with the options: > > > > > > > >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > > >recently posted here and to COMCIFS. > > > >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > > >together with Brian's *recommendations* > > > >3. UTF8-only as in the original draft > > > >4. UTF8 + UTF16 > > > >5. UTF8, UTF16 + "local" > > > > > > > >I have to vote for (4). > > > > > > > >When it comes down to it, I believe that the specification of a > > > >'standard' should not be based on uncertainty, > > > >and as 'any encoding' presents uncertainty, it should not be in the > > > standard. > > > > > > > >I might be accused of changing my position (I have recently > > > >expressed support for flexibilty and even a qualified > > > >acceptance of the 'as for CIF1 proposal with UTF8 in place of > > > >ASCII'), but part of the value of these discussions > > > >is to question your own views in the light of other's perspectives. > > > >Indeed, I have found these discussions > > > >extremely informative and am now in a far better position to handle > > > >the realities of introducing non-ASCII CIFs, > > > >whatever the final COMCIFS decision. > > > > > > > >Cheers > > > > > > > >Simon > > > > > > > > > > > > > > > >From: "Bollinger, John C" > > > >To: Group for discussing encoding and content validation schemes for > > > >CIF2 > > > >Sent: Thursday, 23 September, 2010 15:02:25 > > > >Subject: Re: [Cif2-encoding] How we wrap this up > > > > > > > >On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > > > > > > > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > > >>recently posted here and to COMCIFS. > > > >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > > >>together with Brian's *recommendations* > > > >>3. UTF8-only as in the original draft > > > >>4. UTF8 + UTF16 > > > >>5. UTF8, UTF16 + "local" > > > >> > > > >>These can be broken down to: > > > >> > > > >>'any encoding' (1, 2, and 5) > > > >> > > > >>'specified encoding' (3 and 4) > > > >> > > > >>Note I put 5 in the 'any encoding' category as I think 'local' > > > >>could be interpretted as any encoding. > > > > > > > >I agree that 'local' could be interpreted as "any encoding", but I > > > >choose to view it as "context-dependent". Thus a file that is > > > >CIF-conformant on one computer might not be CIF-conformant on > > > >another. Some will find that unsatisfactory. In my view, however, > > > >it is the best interpretation of CIF1's provisions; its purpose is > > > >thus to ensure that *all* well-formed CIF1 files are also > > > >well-formed CIF2 files (a context-dependent question). Lest I > > > >appear to overstate the case, I acknowledge that the UTF8-only and > > > >UTF-8 + UTF-16 proposals would have the result that a large majority > > > >of well-formed CIF1 files are also well-formed CIF2 files. The > > > >variations of Herb's proposal probably also make all well-formed > > > >CIF1 files well-formed CIF2 files, but I disfavor them on different > > > >grounds (mostly that they are too open to differing interpretations). > > > > > > > >[...] > > > > > > > >>In either case, a degree of work will be required to accommodate > > > >>user practice and the legacy of CIF1. > > > > > > > >I think the entire question reduces to which accommodations for the > > > >CIF1 legacy are assured by CIF2 vs. which will constitute > > > >non-standard extensions. I don't think that individual responses, > > > >from Chester for example, are likely to depend much on which option > > > >is adopted, but I do think the overall consistency of responses will > > > >be affected. Thus I favor precision of the specification and > > > >coverage of the likely uses, in hope of achieving the greatest > > > >consistency of response. > > > > > > > >I doubt this has swayed anyone's opinion, so please consider it an > > > >advance explanation for my upcoming vote (inasmuch as I rely on > > > >James's previous assurance that voting rights in this context are > > > >not restricted to COMCIFS members). > > > > > > > > > > > >Best Regards, > > > > > > > >John > > > >-- > > > >John C. Bollinger, Ph.D. > > > >Department of Structural Biology > > > >St. Jude Children's Research Hospital > > > > > > > > > > > >Email Disclaimer: > > > >www.stjude.org/emaildisclaimer > > > >_______________________________________________ > > > >cif2-encoding mailing list > > > >cif2-encoding at iucr.org > >>>http://scripts.iu > c > > > > > r.org/mailman/listinfo/cif2-encoding > > > > > > > > > > > >_______________________________________________ > > > >cif2-encoding mailing list > > > >cif2-encoding at iucr.org > > > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > > > > -- > > > ===================================================== > > > Herbert J. Bernstein, Professor of Computer Science > > > Dowling College, Kramer Science Center, KSC 121 > > > Idle Hour Blvd, Oakdale, NY, 11769 > > > > > > +1-631-244-3035 > > > yaya at dowling.edu > > > ===================================================== > > > _______________________________________________ > > > cif2-encoding mailing list > > > cif2-encoding at iucr.org > > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100924/8d44fc52/attachment-0001.html From yaya at bernstein-plus-sons.com Fri Sep 24 14:22:01 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 24 Sep 2010 09:22:01 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <613218.81205.qm@web87011.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> Message-ID: Dear Simon, Thank you very much for being willing to reconsider. I will be praying. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 24 Sep 2010, SIMON WESTRIP wrote: > Dear Herbert > > Not for the first time, I find your arguement persuasive. Brian's vote and explanation > have also raised some > questions that I would like to look into. > > I will confirm or otherwise my vote as soon as possible, assuming that is OK with James > and assuming that > this round of votes might wrap this up. > > Cheers > > Simon > > _______________________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 24 September, 2010 13:17:14 > Subject: Re: [Cif2-encoding] How we wrap this up > > If he ignores the standard, in most cases all he has to do to comply with CIF2 is to > run whatever applications he currently runs to produce CIF1 and, perhaps, in some > cases, run a minor edit pass at the end, to convert for the minor syntactive > differences and/or changed tags required to comply with CIF2 and the new dictionaries, > but he is unlikely to have to do anything to deal with the messy business of whether > his encoding is really a proper UTF8 encoding or not. > > The punishment if he tries to comply, is that he has to totally uproot and reconfigure > the environment in which he produces CIFs from whatever he is currently doing to create > an enviroment in which he can reliably create and, more importantly, transmit compliant > UTF8 files.? This can be very tricky if he does only a partial job, say fudging in one > special application (yet to be written), because if he stays with his old system, all > kinds of tools will keep trying to transcode whatever he has produced back to whatever > his system considers a standard. Those of us who have files, applications and tools > that have lived through several generations of macs are living proof of the problem. > Macs now have excellent UTF8/16 unicode support, but every once in a while in working > with a unicode file I find it has been strangely and unexpectedly converted to > something else, and it can be really tricky to spot when the unaccented roman text part > has been left untouched but just a few accented letters have gotten different accents. > > Mandating UTF8 is simply trying to shift a serious software problem from the central > handlers of CIF (IUCr, PDB, etc.) to the external users. Most users will probably have > the good sense to simply ignore the demand and leave the burden just where it is now.? > A few sophisticated users will probably adapt with no trouble, but the punishment for > those users who blindly follow orders before we have a complete multiplatform > supporting infrastructure in place by mandating UTF8 is severe, expensive and > undeserved.? Until and unless we have developed solid support, we will just be > alienating people from CIF.? I will continue to oppose such a move. > > Simon, I beg you to change your vote.? Once we have the rest of CIF2 in > place and supported, I will be happy to cooperate in trying to develop > the software support we would need to make UTF8/UTF16 work well for > users on Mac, Linux and Windows, but it is a big job that I do not > believe can be done soon enough and well enough for options 3 through > 5 to make sense right now.? Please do not "make the perfect the > enemy of the good". > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Fri, 24 Sep 2010, SIMON WESTRIP wrote: > > > I do not understand why a user who adhere's to a CIF2 standard > > that specifies an encoding will be 'punished'? > > What worries me about a specification that allows any encodng > > is that users who ignore any recommendations regarding > > a preferred encoding might experience difficulties when e.g. > > submitting their CIF to a journal/archive, even though they > > have adhered to the standard (unjustly punished). > > > > With regard to the lack of CIF2 software support, surely CIF2 > > in general is of little use to users, not just its encoding requirements. > > But perhaps you already have CIF2 software that can be dropped into existing > > workflows save for the fact that it would require modification to work > > with 'specified encodings'? > > > > > > > > ____________________________________________________________________________ > > From: Herbert J. Bernstein > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Friday, 24 September, 2010 2:03:50 > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > I see not point in a final specification that users will > > ignore, and that will actually punish users who > > pay attention to it.? That is not a useful standard, > > and very damaging to the CIF brand.? We should be > > promolgating reasonable standards that we expect will > > in fact be adhered to, not ignored.? In the present > > state of lack of software support and clear guidance, > > all the prescriptive UTF8 recommendations are unhelpful > > to users who read and pay attention to what the standard > > says. > > > > ===================================================== > > Herbert J. Bernstein, Professor of Computer Science > > ? Dowling College, Kramer Science Center, KSC 121 > > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > > ? ? ? ? ? ? ? ? +1-631-244-3035 > > ? ? ? ? ? ? ? ? yaya at dowling.edu > > ===================================================== > > > > On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > > > > > I agree to some extent with what you say, Herbert, but I'm > > > a bit more optomistic (for once) that the IUCr at least will be able to > > > adapt to > > > a 'specified encoding' system relatively quickly, and in the interim > > > certainly not reject non-UTFx CIFs. I'm not convinced that whatever > > > appears in the specification will have any influence on user practice, > > > especially in the non-IUCr world; rather I think the success (or > > otherwise) > > > of CIF2 will result from the software that implements it (as you suggest). > > > I don't share your pessimism about the potential confusion of specifying > > > UTF8 etc., > > > and certainly don't think that a restricted encoding will be any more > > > confusing than > > > 'any encoding', given, as you say, "people may not understand what they > > are > > > doing..." > > > > > > I suppose much of the difference in our views lies in our perception of > > user > > > interest - > > > I suspect there may even be overlap in this respect - but I'm perhaps less > > > inclined to > > > think that the final specification will have a marked influence on users: > > > "they can keep doing whatever they are currently doing that is currently > > > working for them" > > > > > > Anyway, its not me you have to convince :-), and its time I went to bed! > > > > > > Cheers > > > > > > Simon > > > > > >___________________________________________________________________________ > > _ > > > From: Herbert J. Bernstein > > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > > > Sent: Thursday, 23 September, 2010 22:39:24 > > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > > > Dear Simon, > > > > > > ? That is precisely the point -- there is a serious and growing > > > problems with encodings.? The strict UTF8 proposal then makes > > > it a universal problem for everybody using CIF, and we do _not_ > > > have a coherent means setup to deal with it.? The substitution > > > of UTF8 for ASCII in the CIF1 spec does not, in and of itself > > > make anything worse for anybody currently receiving 128 character > > > ASCII -- it is identical, and it does not force users working > > > in other systems that the IUCr journals are currently coping > > > with to jump into the boiling water, they can keep doing whatever > > > they are currently doing that is currently working for them > > > and the IUCr.? All the journals have to do until something that > > > is actually supports not-lower-128-ASCII is ready is to tell people that > > for > > > the jounrnals they will still have to use Brian's reverse solidus > > > escape codes for anything else -- nothing major changes for most > > > people.? If and when there really is a coherent scheme to support > > > more native Unicode code points for journal submission with tested > > > software, then we can do something more.? Right now, proposals > > > 3,4 and 5 will make things worse for large numbers of users > > > and not really make anything better for the IUCr.? It is too > > > early in the UTF8 conversion process. > > > > > > ===================================================== > > > Herbert J. Bernstein, Professor of Computer Science > > > ? Dowling College, Kramer Science Center, KSC 121 > > > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > > > > ? ? ? ? ? ? ? ? +1-631-244-3035 > > > ? ? ? ? ? ? ? ? yaya at dowling.edu > > > ===================================================== > > > > > > On Thu, 23 Sep 2010, SIMON WESTRIP wrote: > > > > > > > Just because I'm still at my desk - and despite the fact that I told > > > myself > > > > I would not > > > > contribute further beyond my vote - it might be worth mentioning that > > the > > > > IUCr are already > > > > experiencing problems related to encoding issues (in their web > > services), > > > > and the occurence > > > > of such problems is most likely to increase when CIFs can contain > > > non-ASCII > > > > text. > > > > > > > > Cheers > > > > > > > > Simon > > > > > > >>__________________________________________________________________________ > > _ > > > _ > > > > From: Herbert J. Bernstein > > > > To: Group for discussing encoding and content validation schemes for > > CIF2 > > > > > > > > Sent: Thursday, 23 September, 2010 21:31:24 > > > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > > > > > Votes: > > > > > > > > In terms of the requested preference voting, I vote in declining order > > of > > > > preference > > > > > > > > 1, then 2, then (big gap) 5, then 4, then 3. > > > > > > > > On absolute voting up or down in COMCIFS, I will accept 1 or 2, but will > > > > lobby against and vote strongly against 3, 4, and 5. > > > > > > > > Explanation: > > > > > > > > I am not opposed to Brian recommendations.? The only reason I would vote > > > > for 1 over 2 is that I fear Brian's recommendation would generate yet > > > > more debate over the precise details and I believe we have more than > > > > run out of time to get something concrete ready for the IUCr meeting. > > > > > > > > I am very strongly opposed to 3, 4 and 5 because I believe they will > > > > cause confusion and delay in adoption of CIF2, while choices > > > > 1 and 2 keep the practices the community and the IUCr have lived > > > > with successfully for many years, simply applying then to UTF8 > > > > instead of ASCII.? People may not understand what they are doing > > > > in that mode, but they manage to successfully submit CIFs to the > > > > IUCr that way, and we don't have software ready to support anything > > > > else. > > > > > > > > ? -- Herbert > > > > > > > > At 8:13 PM +0000 9/23/10, SIMON WESTRIP wrote: > > > > >Faced with the options: > > > > > > > > > >1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > > > >recently posted here and to COMCIFS. > > > > >2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > > > >together with Brian's *recommendations* > > > > >3. UTF8-only as in the original draft > > > > >4. UTF8 + UTF16 > > > > >5. UTF8, UTF16 + "local" > > > > > > > > > >I have to vote for (4). > > > > > > > > > >When it comes down to it, I believe that the specification of a > > > > >'standard' should not be based on uncertainty, > > > > >and as 'any encoding' presents uncertainty, it should not be in the > > > > standard. > > > > > > > > > >I might be accused of changing my position (I have recently > > > > >expressed support for flexibilty and even a qualified > > > > >acceptance of the 'as for CIF1 proposal with UTF8 in place of > > > > >ASCII'), but part of the value of these discussions > > > > >is to question your own views in the light of other's perspectives. > > > > >Indeed, I have found these discussions > > > > >extremely informative and am now in a far better position to handle > > > > >the realities of introducing non-ASCII CIFs, > > > > >whatever the final COMCIFS decision. > > > > > > > > > >Cheers > > > > > > > > > >Simon > > > > > > > > > > > > > > > > > > > >From: "Bollinger, John C" > > > > >To: Group for discussing encoding and content validation schemes for > > > > >CIF2 > > > > >Sent: Thursday, 23 September, 2010 15:02:25 > > > > >Subject: Re: [Cif2-encoding] How we wrap this up > > > > > > > > > >On Thursday, September 23, 2010 5:46 AM, SIMON WESTRIP wrote: > > > > > > > > > >>1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' > > > > >>recently posted here and to COMCIFS. > > > > >>2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', > > > > >>together with Brian's *recommendations* > > > > >>3. UTF8-only as in the original draft > > > > >>4. UTF8 + UTF16 > > > > >>5. UTF8, UTF16 + "local" > > > > >> > > > > >>These can be broken down to: > > > > >> > > > > >>'any encoding' (1, 2, and 5) > > > > >> > > > > >>'specified encoding' (3 and 4) > > > > >> > > > > >>Note I put 5 in the 'any encoding' category as I think 'local' > > > > >>could be interpretted as any encoding. > > > > > > > > > >I agree that 'local' could be interpreted as "any encoding", but I > > > > >choose to view it as "context-dependent".? Thus a file that is > > > > >CIF-conformant on one computer might not be CIF-conformant on > > > > >another.? Some will find that unsatisfactory.? In my view, however, > > > > >it is the best interpretation of CIF1's provisions; its purpose is > > > > >thus to ensure that *all* well-formed CIF1 files are also > > > > >well-formed CIF2 files (a context-dependent question).? Lest I > > > > >appear to overstate the case, I acknowledge that the UTF8-only and > > > > >UTF-8 + UTF-16 proposals would have the result that a large majority > > > > >of well-formed CIF1 files are also well-formed CIF2 files.? The > > > > >variations of Herb's proposal probably also make all well-formed > > > > >CIF1 files well-formed CIF2 files, but I disfavor them on different > > > > >grounds (mostly that they are too open to differing interpretations). > > > > > > > > > >[...] > > > > > > > > > >>In either case, a degree of work will be required to accommodate > > > > >>user practice and the legacy of CIF1. > > > > > > > > > >I think the entire question reduces to which accommodations for the > > > > >CIF1 legacy are assured by CIF2 vs. which will constitute > > > > >non-standard extensions.? I don't think that individual responses, > > > > >from Chester for example, are likely to depend much on which option > > > > >is adopted, but I do think the overall consistency of responses will > > > > >be affected.? Thus I favor precision of the specification and > > > > >coverage of the likely uses, in hope of achieving the greatest > > > > >consistency of response. > > > > > > > > > >I doubt this has swayed anyone's opinion, so please consider it an > > > > >advance explanation for my upcoming vote (inasmuch as I rely on > > > > >James's previous assurance that voting rights in this context are > > > > >not restricted to COMCIFS members). > > > > > > > > > > > > > > >Best Regards, > > > > > > > > > >John > > > > >-- > > > > >John C. Bollinger, Ph.D. > > > > >Department of Structural Biology > > > > >St. Jude Children's Research Hospital > > > > > > > > > > > > > > >Email Disclaimer: > > > > >www.stjude.org/emaildisclaimer > > > > >_______________________________________________ > > > > >cif2-encoding mailing list > > > > >cif2-encoding at iucr.org > > >>>http://scripts.iu > > c > > > > > > > r.org/mailman/listinfo/cif2-encoding > > > > > > > > > > > > > > >_______________________________________________ > > > > >cif2-encoding mailing list > > > > >cif2-encoding at iucr.org > > > > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > > > > > > > -- > > > > ===================================================== > > > > ? Herbert J. Bernstein, Professor of Computer Science > > > > ? ? Dowling College, Kramer Science Center, KSC 121 > > > > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > > > > > > ? ? ? ? ? ? ? ? ? +1-631-244-3035 > > > > ? ? ? ? ? ? ? ? ? yaya at dowling.edu > > > > ===================================================== > > > > _______________________________________________ > > > > cif2-encoding mailing list > > > > cif2-encoding at iucr.org > > > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > > > > > > > > > > > > > > > From John.Bollinger at STJUDE.ORG Fri Sep 24 14:50:57 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Fri, 24 Sep 2010 08:50:57 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <613218.81205.qm@web87011.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local> Dear Simon, It is exactly this sort of issue that drove me to support more permissive encoding rules and ultimately to devise the UTF-8 + UTF-16 + local proposal. Do please think about the considerations Herb raised. As you reconsider your votes, I urge you also to ask yourself what, *precisely*, a "text file" is, and to consider whether your answer is functionally different from my "local". If you decide not, then please consider what that answer implies about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under each option on the table, especially for CIFs containing non-ASCII characters. Whatever you decide about the meaning of "text file", please consider whether reasonable people might reach a different conclusion, as I assert they might do, and to what extent the standard needs to address that. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital >From: cif2-encoding-bounces at iucr.org [mailto:cif2-encoding-bounces at iucr.org] On Behalf Of SIMON WESTRIP >Sent: Friday, September 24, 2010 7:53 AM >To: Group for discussing encoding and content validation schemes for CIF2 >Subject: Re: [Cif2-encoding] How we wrap this up. . > >Dear Herbert > >Not for the first time, I find your arguement persuasive. Brian's vote and explanation have also raised some >questions that I would like to look into. > >I will confirm or otherwise my vote as soon as possible, assuming that is OK with James and assuming that >this round of votes might wrap this up. > >Cheers > >Simon > >________________________________________ >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for CIF2 >Sent: Friday, 24 September, 2010 13:17:14 >Subject: Re: [Cif2-encoding] How we wrap this up > >If he ignores the standard, in most cases all he has to do to comply with CIF2 is to run whatever applications he currently runs to produce CIF1 and, perhaps, in some cases, run a minor edit pass at the end, to convert for the minor syntactive differences and/or changed tags required to comply with CIF2 and the new dictionaries, but he is unlikely to have to do anything to deal with the messy business of whether his encoding is really a proper UTF8 encoding or not. >The punishment if he tries to comply, is that he has to totally uproot and reconfigure the environment in which he produces CIFs from whatever he is currently doing to create an enviroment in which he can reliably create and, more importantly, transmit compliant UTF8 files. This can be very tricky if he does only a partial job, say fudging in one special application (yet to be written), because if he stays with his old system, all kinds of tools will keep trying to transcode whatever he has produced back to whatever his system considers a standard. Those of us who have files, applications and tools that have lived through several generations of macs are living proof of the problem. Macs now have excellent UTF8/16 unicode support, but every once in a while in working with a unicode file I find it has been strangely and unexpectedly converted to something else, and it can be really tricky to spot when the unaccented roman text part has been left untouched but just a few accented letters have gotten different accents. >Mandating UTF8 is simply trying to shift a serious software problem from the central handlers of CIF (IUCr, PDB, etc.) to the external users. Most users will probably have the good sense to simply ignore the demand and leave the burden just where it is now. A few sophisticated users will probably adapt with no trouble, but the punishment for those users who blindly follow orders before we have a complete multiplatform supporting infrastructure in place by mandating UTF8 is severe, expensive and undeserved. Until and unless we have developed solid support, we will just be alienating people from CIF. I will continue to oppose such a move. [...] Email Disclaimer: www.stjude.org/emaildisclaimer From simonwestrip at btinternet.com Fri Sep 24 15:05:30 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Fri, 24 Sep 2010 14:05:30 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <20100924092359.GA21694@emerald.iucr.org> References: <20100924092359.GA21694@emerald.iucr.org> Message-ID: <502393.2327.qm@web87016.mail.ird.yahoo.com> Hi Brian Just for info (not argument), regarding your concerns over fonts and the clipboard: Fonts: this can be an issue. The libraries I use access the system fonts to locate the appropriate glyphs. publCIF is also distributed with an opensource font that contains a few modified glyphs to represent e.g. delocalized double bonds. Clipboard: publCIF reads the clipboard looking for html or plain text and attempts to convert the data to CIF format when pasting into the CIF (i.e. converting symbols to their ASCII CIF codes and discarding unwanted html tags). publBio employs a similar mechanism to intercept pasting into some of its html input boxes. So although this is an issue that we should be aware of, and should review, I think communication with the clipboard should not be too much cause for concern. Writing to the clipboard is less of a problem - the writer controls what is written. File import across mounts etc: this would require research. So import (and maybe clipboard communication) might indeed require further work beyond what has already been done in publCIF to prevent inclusion of non-CIF text. Cheers Simon ________________________________ From: Brian McMahon To: Group for discussing encoding and content validation schemes for CIF2 Sent: Friday, 24 September, 2010 10:23:59 Subject: Re: [Cif2-encoding] How we wrap this up My vote: Preference Option 1 2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together with Brian's *recommendations* 2 1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently posted here and to COMCIFS. 3 4. UTF8 + UTF16 4 3. UTF8-only as in the original draft 5 5. UTF8, UTF16 + "local" Rationale: I still feel this argument is at heart a "binary/text" dichotomy, where "binary" implies that one can prescribe specific byte-level representations of every distinct character; "text" implies that you're at the mercy of external libraries and mappings between encoding conventions - and those mappings are not always explicit or easy to identify. I sympathise greatly with James's desire for a prescriptive, "binary" approach, but its corollary is that a CIF application must take full responsibility for expressing any supported extended character set (I mean accented Latin letters, Greek characters, Cyrillic or Chinese alphabets). First off, I don't know how difficult that is technically. I would guess that rather than trying to handle arbitrary keyboard mappings, the natural approach would be to pick from a graphical character grid. (What are the implications for this of glyph rendering - does a CIF editor have to be compiled with its own large font library?) But that's a laborious method of authoring if relatively large amounts of "non-standard" text are involved, and the way that authors would prefer to work, surely, is by copying and pasting text from Word or some other tool of choice. Permitting that necessarily pollutes the "binary" approach with byte streams delivered by text-oriented applications. If I could be sure that publCIF, say, can be compiled with libraries that reliably transcode byte streams imported from clipboards and file import (across the mess of SMB/NFS mounts etc. that exist in the real world) - and equally reliably transcode its UTF8 encoded text to the author's locale-based clipboard, then I'd be more willing to promote option 3 to the top as the starting point at least for CIF 2.0 (but its "enforcement" does depend on the availability of such a robust CIF-editing tool). I prefer the UTF8 + UTF16 option over UTF8-only because of the real-world use case that Herbert has described before; and in existing imgCIF applications the UTF16 encoding is being done rather carefully and for a specific purpose. I put option 5 at the bottom because of the non-portability of a "local" encoding. Note, though, that whatever the outcome I would still favour the discussion of character set encodings to be presented as a Part 3 to the complete CIF2 spec. Best wishes Brian _________________________________________________________________________ Brian McMahon tel: +44 1244 342878 Research and Development Officer fax: +44 1244 314888 International Union of Crystallography e-mail: bm at iucr.org 5 Abbey Square, Chester CH1 2HU, England On Thu, Sep 23, 2010 at 10:37:48AM +1000, James Hester wrote: > Dear CIF2 encoding participants, > > As Herbert has indicated, we are starting to run out of time for > resolution of the encoding issue. I believe that we have now explored > the various proposals sufficiently to all have a good understanding of > the consequences and advantages of each approach. So, after a round > of final comments, I propose that we vote on the general scheme that > we recommend. We can then flesh out the details of the particular > scheme that we have settled on, and take this completed proposal to > the DDLm group for their approval, following which we will present the > entire CIF2 syntax document to COMCIFS for a formal vote. > > The proposals that I believe are still on the table are: > > 1. Herbert's 'as for CIF1 proposal' recently posted here and to COMCIFS. > 2. Herbert's 'as for CIF1 proposal', together with Brian's proposal > (if you agree that they are compatible) > 2. UTF8-only as in the original draft > 3. UTF8 + UTF16 > 4. UTF8, UTF16 + "local" > > I have not included the hashcode proposal as I believe it no longer > has any supporters. > > We would need to conduct a preferential vote. I stress that this is > purely to determine the recommendation of this working group, and is > not in any way binding on COMCIFS. > > James. > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100924/03434907/attachment.html From simonwestrip at btinternet.com Fri Sep 24 20:10:13 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Fri, 24 Sep 2010 19:10:13 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <281388.90819.qm@web87012.mail.ird.yahoo.com> Dear James As you may have gathered I have been reconsidering my position on this issue. Please forgive me, but I would like to change my vote if that is OK, in favour of the 'any encoding' camp. This apparent U-turn is not a response to recent contributions; rather it is the outcome of a meeting I had this morning where I demonstrated some new software to the Managing Editor of IUCr journals. By way of explanation: I have been developing a new docx template which the IUCr editorial office is shortly to release for use by authors. The template will be packaged with some tools to extract data from CIFs and tabulate them in the Word document, e.g. open an mmCIF, click a button, and standard tables populated with data from the CIF will be included in the document, acting as table templates for the author to edit as appropriate for their manuscript. Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' biologists to start using/accepting mmCIF as a useful medium, rather than as a product of their deposition to the PDB, and to encourage them to become comfortable with passing mmCIFs between applications, and even to edit the things (in the same way as the core-CIF community treats CIFs). For example, our perception is that there is no reason why an author should not feel free to take an mmCIF that has been created by e.g. pdb_extract and populate it using third-party software before uploading to the PDB for deposition. This cause would not be furthered by effectively invalidating an mmCIF if it were not to be encoded in one of the specified encodings. So although I am uneasy about a specification that propogates uncertainty, I'm also uneasy about alienating users, especially when we are struggling to change their mindset as in the case of the biological community (my perception of the biological community's attitude to mmCIF is based on feedback from authors/coeditors to IUCr journals). Granted this may not be the most compelling argument in favour of 'any encoding', but recognizing the hurdles that may have to be overcome once we move beyond ASCII whatever the CIF2 specification, I support 'any encoding' as 'a means to an end'. I will not provide my preferences in terms of the numbered options until you say so; afterall, I have already voted and all this has to be signed off by COMCIFs in any case. Cheers Simon ________________________________ From: "Bollinger, John C" To: Group for discussing encoding and content validation schemes for CIF2 Sent: Friday, 24 September, 2010 14:50:57 Subject: Re: [Cif2-encoding] How we wrap this up Dear Simon, It is exactly this sort of issue that drove me to support more permissive encoding rules and ultimately to devise the UTF-8 + UTF-16 + local proposal. Do please think about the considerations Herb raised. As you reconsider your votes, I urge you also to ask yourself what, *precisely*, a "text file" is, and to consider whether your answer is functionally different from my "local". If you decide not, then please consider what that answer implies about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under each option on the table, especially for CIFs containing non-ASCII characters. Whatever you decide about the meaning of "text file", please consider whether reasonable people might reach a different conclusion, as I assert they might do, and to what extent the standard needs to address that. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital >From: cif2-encoding-bounces at iucr.org [mailto:cif2-encoding-bounces at iucr.org] On >Behalf Of SIMON WESTRIP >Sent: Friday, September 24, 2010 7:53 AM >To: Group for discussing encoding and content validation schemes for CIF2 >Subject: Re: [Cif2-encoding] How we wrap this up. . > >Dear Herbert > >Not for the first time, I find your arguement persuasive. Brian's vote and >explanation have also raised some >questions that I would like to look into. > >I will confirm or otherwise my vote as soon as possible, assuming that is OK >with James and assuming that >this round of votes might wrap this up. > >Cheers > >Simon > >________________________________________ >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for CIF2 > >Sent: Friday, 24 September, 2010 13:17:14 >Subject: Re: [Cif2-encoding] How we wrap this up > >If he ignores the standard, in most cases all he has to do to comply with CIF2 >is to run whatever applications he currently runs to produce CIF1 and, perhaps, >in some cases, run a minor edit pass at the end, to convert for the minor >syntactive differences and/or changed tags required to comply with CIF2 and the >new dictionaries, but he is unlikely to have to do anything to deal with the >messy business of whether his encoding is really a proper UTF8 encoding or not. >The punishment if he tries to comply, is that he has to totally uproot and >reconfigure the environment in which he produces CIFs from whatever he is >currently doing to create an enviroment in which he can reliably create and, >more importantly, transmit compliant UTF8 files. This can be very tricky if he >does only a partial job, say fudging in one special application (yet to be >written), because if he stays with his old system, all kinds of tools will keep >trying to transcode whatever he has produced back to whatever his system >considers a standard. Those of us who have files, applications and tools that >have lived through several generations of macs are living proof of the problem. >Macs now have excellent UTF8/16 unicode support, but every once in a while in >working with a unicode file I find it has been strangely and unexpectedly >converted to something else, and it can be really tricky to spot when the >unaccented roman text part has been left untouched but just a few accen ted letters have gotten different accents. >Mandating UTF8 is simply trying to shift a serious software problem from the >central handlers of CIF (IUCr, PDB, etc.) to the external users. Most users will >probably have the good sense to simply ignore the demand and leave the burden >just where it is now. A few sophisticated users will probably adapt with no >trouble, but the punishment for those users who blindly follow orders before we >have a complete multiplatform supporting infrastructure in place by mandating >UTF8 is severe, expensive and undeserved. Until and unless we have developed >solid support, we will just be alienating people from CIF. I will continue to >oppose such a move. [...] Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100924/d66a5859/attachment-0001.html From yaya at bernstein-plus-sons.com Fri Sep 24 22:25:49 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 24 Sep 2010 17:25:49 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <281388.90819.qm@web87012.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local> <281388.90819.qm@web87012.mail.ird.yahoo.com> Message-ID: Dear Simon, Thank you, very much. You have done the right thing. -- Herbert Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 24 Sep 2010, SIMON WESTRIP wrote: > Dear James > > As you may have gathered I have been reconsidering my position on this issue. > Please forgive me, but I would like to change my vote if that is OK, in favour of the > 'any encoding' camp. > This apparent U-turn is not a response to recent contributions; rather it is the > outcome of a meeting I had this morning > where I demonstrated some new software to the Managing Editor of IUCr journals. > > By way of explanation: > > I have been developing a new docx template which the IUCr editorial office is shortly > to release for use by > authors. The template will be packaged with some tools to extract data from CIFs > and tabulate them in the Word document, e.g. open an mmCIF, click a button, and > standard > tables populated with data from the CIF will be included in the document, acting as > table templates for the author to edit as appropriate for their manuscript. > > Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' biologists to > start using/accepting mmCIF > as a useful medium, rather than as a product of their deposition to the PDB, and to > encourage them to become comfortable > with passing mmCIFs between applications, and even to edit the things (in the same way > as the core-CIF community > treats CIFs). For example, our perception is that there is no reason why an author > should not feel free to take an mmCIF > that has been created by e.g. pdb_extract and populate it using third-party software > before uploading to the PDB for > deposition. > > This cause would not be furthered by effectively invalidating an mmCIF if it were not > to be encoded in one of > the specified encodings. > > So although I am uneasy about a specification that propogates uncertainty, I'm also > uneasy about alienating users, > especially when we are struggling to change their mindset as in the case of the > biological community > (my perception of the biological community's attitude to mmCIF is based on feedback > from authors/coeditors to > IUCr journals). > > Granted this may not be the most compelling argument in favour of 'any encoding', but > recognizing the hurdles that > may have to be overcome once we move beyond ASCII whatever the CIF2 specification, I > support 'any encoding' > as 'a means to an end'. > > I will not provide my preferences in terms of the numbered options until you say so; > afterall, I have already voted and > all this has to be signed off by COMCIFs in any case. > > Cheers > > Simon > > > > > _______________________________________________________________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 24 September, 2010 14:50:57 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > It is exactly this sort of issue that drove me to support more permissive encoding > rules and ultimately to devise the UTF-8 + UTF-16 + local proposal. > > Do please think about the considerations Herb raised.? As you reconsider your votes, I > urge you also to ask yourself what, *precisely*, a "text file" is, and to consider > whether your answer is functionally different from my "local".? If you decide not, then > please consider what that answer implies about CIF2 support of UTF-8 and UTF-16 (which > evidently you favor) under each option on the table, especially for CIFs containing > non-ASCII characters.? Whatever you decide about the meaning of "text file", please > consider whether reasonable people might reach a different conclusion, as I assert they > might do, and to what extent the standard needs to address that. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > >From: cif2-encoding-bounces at iucr.org [mailto:cif2-encoding-bounces at iucr.org] On Behalf > Of SIMON WESTRIP > >Sent: Friday, September 24, 2010 7:53 AM > >To: Group for discussing encoding and content validation schemes for CIF2 > >Subject: Re: [Cif2-encoding] How we wrap this up. . > > > >Dear Herbert > > > >Not for the first time, I find your arguement persuasive. Brian's vote and explanation > have also raised some > >questions that I would like to look into. > > > >I will confirm or otherwise my vote as soon as possible, assuming that is OK with > James and assuming that > >this round of votes might wrap this up. > > > >Cheers > > > >Simon > > > >________________________________________ > >From: Herbert J. Bernstein > >To: Group for discussing encoding and content validation schemes for CIF2 > > >Sent: Friday, 24 September, 2010 13:17:14 > >Subject: Re: [Cif2-encoding] How we wrap this up > > > >If he ignores the standard, in most cases all he has to do to comply with CIF2 is to > run whatever applications he currently runs to produce CIF1 and, perhaps, in some > cases, run a minor edit pass at the end, to convert for the minor syntactive > differences and/or changed tags required to comply with CIF2 and the new dictionaries, > but he is unlikely to have to do anything to deal with the messy business of whether > his encoding is really a proper UTF8 encoding or not. > > >The punishment if he tries to comply, is that he has to totally uproot and reconfigure > the environment in which he produces CIFs from whatever he is currently doing to create > an enviroment in which he can reliably create and, more importantly, transmit compliant > UTF8 files.? This can be very tricky if he does only a partial job, say fudging in one > special application (yet to be written), because if he stays with his old system, all > kinds of tools will keep trying to transcode whatever he has produced back to whatever > his system considers a standard. Those of us who have files, applications and tools > that have lived through several generations of macs are living proof of the problem. > Macs now have excellent UTF8/16 unicode support, but every once in a while in working > with a unicode file I find it has been strangely and unexpectedly converted to > something else, and it can be really tricky to spot when the unaccented roman text part > has been left untouched but just a few accen > ted letters have gotten different accents. > > >Mandating UTF8 is simply trying to shift a serious software problem from the central > handlers of CIF (IUCr, PDB, etc.) to the external users. Most users will probably have > the good sense to simply ignore the demand and leave the burden just where it is now.? > A few sophisticated users will probably adapt with no trouble, but the punishment for > those users who blindly follow orders before we have a complete multiplatform > supporting infrastructure in place by mandating UTF8 is severe, expensive and > undeserved.? Until and unless we have developed solid support, we will just be > alienating people from CIF.? I will continue to oppose such a move. > > [...] > > > Email Disclaimer:? www.stjude.org/emaildisclaimer > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > From simonwestrip at btinternet.com Sat Sep 25 17:09:12 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Sat, 25 Sep 2010 16:09:12 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <281388.90819.qm@web87012.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local> <281388.90819.qm@web87012.mail.ird.yahoo.com> Message-ID: <463665.7127.qm@web87004.mail.ird.yahoo.com> Dear all In the event that CIF2 adopts the 'any encoding' approach, would there be any objections to explicitly defining a default encoding in the specification, to be defaulted to when there were no indications to the contrary. At worst this would give CIF2 service providers an excuse to interpret CIFs as e.g. UTF8 if they couldnt determine the encoding by other means - but such intollerant service providers would soon find that their service is not successful - while at best this might raise awareness of the issues regarding encoding once non-ASCII is used in a CIF. Essentially, it does not require users to change there working practices, which is one of the main arguments for 'any encoding'. So, CIF2 would remain 'any encoding', and specifications in terms of e.g. "Herbert's as for CIF1..." might only require a single sentence to define the default after stating what the 'preferred' encoding was; the proposal might be phrased as "Herbert's as for CIF1..." + "explicit default encoding"? I do not wish to prolong this debate - if there are objections I will not launch into an endless round of exchanges that cover the same ground that has led us this far. Cheers Simon ________________________________ From: SIMON WESTRIP To: Group for discussing encoding and content validation schemes for CIF2 Sent: Friday, 24 September, 2010 20:10:13 Subject: Re: [Cif2-encoding] How we wrap this up Dear James As you may have gathered I have been reconsidering my position on this issue. Please forgive me, but I would like to change my vote if that is OK, in favour of the 'any encoding' camp. This apparent U-turn is not a response to recent contributions; rather it is the outcome of a meeting I had this morning where I demonstrated some new software to the Managing Editor of IUCr journals. By way of explanation: I have been developing a new docx template which the IUCr editorial office is shortly to release for use by authors. The template will be packaged with some tools to extract data from CIFs and tabulate them in the Word document, e.g. open an mmCIF, click a button, and standard tables populated with data from the CIF will be included in the document, acting as table templates for the author to edit as appropriate for their manuscript. Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' biologists to start using/accepting mmCIF as a useful medium, rather than as a product of their deposition to the PDB, and to encourage them to become comfortable with passing mmCIFs between applications, and even to edit the things (in the same way as the core-CIF community treats CIFs). For example, our perception is that there is no reason why an author should not feel free to take an mmCIF that has been created by e.g. pdb_extract and populate it using third-party software before uploading to the PDB for deposition. This cause would not be furthered by effectively invalidating an mmCIF if it were not to be encoded in one of the specified encodings. So although I am uneasy about a specification that propogates uncertainty, I'm also uneasy about alienating users, especially when we are struggling to change their mindset as in the case of the biological community (my perception of the biological community's attitude to mmCIF is based on feedback from authors/coeditors to IUCr journals). Granted this may not be the most compelling argument in favour of 'any encoding', but recognizing the hurdles that may have to be overcome once we move beyond ASCII whatever the CIF2 specification, I support 'any encoding' as 'a means to an end'. I will not provide my preferences in terms of the numbered options until you say so; afterall, I have already voted and all this has to be signed off by COMCIFs in any case. Cheers Simon ________________________________ From: "Bollinger, John C" To: Group for discussing encoding and content validation schemes for CIF2 Sent: Friday, 24 September, 2010 14:50:57 Subject: Re: [Cif2-encoding] How we wrap this up Dear Simon, It is exactly this sort of issue that drove me to support more permissive encoding rules and ultimately to devise the UTF-8 + UTF-16 + local proposal. Do please think about the considerations Herb raised. As you reconsider your votes, I urge you also to ask yourself what, *precisely*, a "text file" is, and to consider whether your answer is functionally different from my "local". If you decide not, then please consider what that answer implies about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under each option on the table, especially for CIFs containing non-ASCII characters. Whatever you decide about the meaning of "text file", please consider whether reasonable people might reach a different conclusion, as I assert they might do, and to what extent the standard needs to address that. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital >From: cif2-encoding-bounces at iucr.org [mailto:cif2-encoding-bounces at iucr.org] On >Behalf Of SIMON WESTRIP >Sent: Friday, September 24, 2010 7:53 AM >To: Group for discussing encoding and content validation schemes for CIF2 >Subject: Re: [Cif2-encoding] How we wrap this up. . > >Dear Herbert > >Not for the first time, I find your arguement persuasive. Brian's vote and >explanation have also raised some >questions that I would like to look into. > >I will confirm or otherwise my vote as soon as possible, assuming that is OK >with James and assuming that >this round of votes might wrap this up. > >Cheers > >Simon > >________________________________________ >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for CIF2 > >Sent: Friday, 24 September, 2010 13:17:14 >Subject: Re: [Cif2-encoding] How we wrap this up > >If he ignores the standard, in most cases all he has to do to comply with CIF2 >is to run whatever applications he currently runs to produce CIF1 and, perhaps, >in some cases, run a minor edit pass at the end, to convert for the minor >syntactive differences and/or changed tags required to comply with CIF2 and the >new dictionaries, but he is unlikely to have to do anything to deal with the >messy business of whether his encoding is really a proper UTF8 encoding or not. >The punishment if he tries to comply, is that he has to totally uproot and >reconfigure the environment in which he produces CIFs from whatever he is >currently doing to create an enviroment in which he can reliably create and, >more importantly, transmit compliant UTF8 files. This can be very tricky if he >does only a partial job, say fudging in one special application (yet to be >written), because if he stays with his old system, all kinds of tools will keep >trying to transcode whatever he has produced back to whatever his system >considers a standard. Those of us who have files, applications and tools that >have lived through several generations of macs are living proof of the problem. >Macs now have excellent UTF8/16 unicode support, but every once in a while in >working with a unicode file I find it has been strangely and unexpectedly >converted to something else, and it can be really tricky to spot when the >unaccented roman text part has been left untouched but just a few accen ted letters have gotten different accents. >Mandating UTF8 is simply trying to shift a serious software problem from the >central handlers of CIF (IUCr, PDB, etc.) to the external users. Most users will >probably have the good sense to simply ignore the demand and leave the burden >just where it is now. A few sophisticated users will probably adapt with no >trouble, but the punishment for those users who blindly follow orders before we >have a complete multiplatform supporting infrastructure in place by mandating >UTF8 is severe, expensive and undeserved. Until and unless we have developed >solid support, we will just be alienating people from CIF. I will continue to >oppose such a move. [...] Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100925/f9b6feca/attachment-0001.html From yaya at bernstein-plus-sons.com Sat Sep 25 19:18:54 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Sat, 25 Sep 2010 14:18:54 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <463665.7127.qm@web87004.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> Message-ID: Dear Simon, Unfortunately, that is likely to take us back into our infinite loop or into a diverging spiral. Right now, we would have UTF8 as no more or less a default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad first guess as the likely default encoding for any given CIF, but not a formal constraint. I would suggest we leave the wording in that imprecise state, get CIF2 out and accepted and then work further on the encoding issue. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > Dear all > > In the event that CIF2 adopts the 'any encoding' approach, would there be > any objections to > explicitly defining a default encoding in the specification, to be defaulted > to when there were no indications > to the contrary. At worst this would give CIF2 service providers an excuse > to interpret CIFs as e.g. UTF8 if they couldnt > determine the encoding by other means - but such intollerant service > providers would soon find that their service is > not successful - while at best this might raise awareness of the issues > regarding encoding once non-ASCII is used in > a CIF. Essentially, it does not require users to change there working > practices, which is one of the main arguments for > 'any encoding'. > > So, CIF2 would remain 'any encoding', and specifications in terms of e.g. > "Herbert's as for CIF1..." > might only require a single sentence to define the default after stating > what the 'preferred' encoding was; > the proposal might be phrased as "Herbert's as for CIF1..." + "explicit > default encoding"? > > I do not wish to prolong this debate - if there are objections I will not > launch into an endless round of exchanges > that cover the same ground that has led us this far. > > Cheers > > Simon > > > > > > > ____________________________________________________________________________ > From: SIMON WESTRIP > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 24 September, 2010 20:10:13 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear James > > As you may have gathered I have been reconsidering my position on this > issue. > Please forgive me, but I would like to change my vote if that is OK, in > favour of the 'any encoding' camp. > This apparent U-turn is not a response to recent contributions; rather it is > the outcome of a meeting I had this morning > where I demonstrated some new software to the Managing Editor of IUCr > journals. > > By way of explanation: > > I have been developing a new docx template which the IUCr editorial office > is shortly to release for use by > authors. The template will be packaged with some tools to extract data from > CIFs > and tabulate them in the Word document, e.g. open an mmCIF, click a button, > and standard > tables populated with data from the CIF will be included in the document, > acting as > table templates for the author to edit as appropriate for their manuscript. > > Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' > biologists to start using/accepting mmCIF > as a useful medium, rather than as a product of their deposition to the PDB, > and to encourage them to become comfortable > with passing mmCIFs between applications, and even to edit the things (in > the same way as the core-CIF community > treats CIFs). For example, our perception is that there is no reason why an > author should not feel free to take an mmCIF > that has been created by e.g. pdb_extract and populate it using third-party > software before uploading to the PDB for > deposition. > > This cause would not be furthered by effectively invalidating an mmCIF if it > were not to be encoded in one of > the specified encodings. > > So although I am uneasy about a specification that propogates uncertainty, > I'm also uneasy about alienating users, > especially when we are struggling to change their mindset as in the case of > the biological community > (my perception of the biological community's attitude to mmCIF is based on > feedback from authors/coeditors to > IUCr journals). > > Granted this may not be the most compelling argument in favour of 'any > encoding', but recognizing the hurdles that > may have to be overcome once we move beyond ASCII whatever the CIF2 > specification, I support 'any encoding' > as 'a means to an end'. > > I will not provide my preferences in terms of the numbered options until you > say so; afterall, I have already voted and > all this has to be signed off by COMCIFs in any case. > > Cheers > > Simon > > > > > ____________________________________________________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 24 September, 2010 14:50:57 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > It is exactly this sort of issue that drove me to support more permissive > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local proposal. > > Do please think about the considerations Herb raised.? As you reconsider > your votes, I urge you also to ask yourself what, *precisely*, a "text file" > is, and to consider whether your answer is functionally different from my > "local".? If you decide not, then please consider what that answer implies > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under > each option on the table, especially for CIFs containing non-ASCII > characters.? Whatever you decide about the meaning of "text file", please > consider whether reasonable people might reach a different conclusion, as I > assert they might do, and to what extent the standard needs to address that. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > >From: cif2-encoding-bounces at iucr.org > [mailto:cif2-encoding-bounces at iucr.org] On Behalf Of SIMON WESTRIP > >Sent: Friday, September 24, 2010 7:53 AM > >To: Group for discussing encoding and content validation schemes for CIF2 > >Subject: Re: [Cif2-encoding] How we wrap this up. . > > > >Dear Herbert > > > >Not for the first time, I find your arguement persuasive. Brian's vote and > explanation have also raised some > >questions that I would like to look into. > > > >I will confirm or otherwise my vote as soon as possible, assuming that is > OK with James and assuming that > >this round of votes might wrap this up. > > > >Cheers > > > >Simon > > > >________________________________________ > >From: Herbert J. Bernstein > >To: Group for discussing encoding and content validation schemes for CIF2 > > >Sent: Friday, 24 September, 2010 13:17:14 > >Subject: Re: [Cif2-encoding] How we wrap this up > > > >If he ignores the standard, in most cases all he has to do to comply with > CIF2 is to run whatever applications he currently runs to produce CIF1 and, > perhaps, in some cases, run a minor edit pass at the end, to convert for the > minor syntactive differences and/or changed tags required to comply with > CIF2 and the new dictionaries, but he is unlikely to have to do anything to > deal with the messy business of whether his encoding is really a proper UTF8 > encoding or not. > > >The punishment if he tries to comply, is that he has to totally uproot and > reconfigure the environment in which he produces CIFs from whatever he is > currently doing to create an enviroment in which he can reliably create and, > more importantly, transmit compliant UTF8 files.? This can be very tricky if > he does only a partial job, say fudging in one special application (yet to > be written), because if he stays with his old system, all kinds of tools > will keep trying to transcode whatever he has produced back to whatever his > system considers a standard. Those of us who have files, applications and > tools that have lived through several generations of macs are living proof > of the problem. Macs now have excellent UTF8/16 unicode support, but every > once in a while in working with a unicode file I find it has been strangely > and unexpectedly converted to something else, and it can be really tricky to > spot when the unaccented roman text part has been left untouched but just a > few accen > ted letters have gotten different accents. > > >Mandating UTF8 is simply trying to shift a serious software problem from > the central handlers of CIF (IUCr, PDB, etc.) to the external users. Most > users will probably have the good sense to simply ignore the demand and > leave the burden just where it is now.? A few sophisticated users will > probably adapt with no trouble, but the punishment for those users who > blindly follow orders before we have a complete multiplatform supporting > infrastructure in place by mandating UTF8 is severe, expensive and > undeserved.? Until and unless we have developed solid support, we will just > be alienating people from CIF.? I will continue to oppose such a move. > > [...] > > > Email Disclaimer:? www.stjude.org/emaildisclaimer > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > From simonwestrip at btinternet.com Sat Sep 25 20:34:14 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Sat, 25 Sep 2010 19:34:14 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> Message-ID: <262880.46378.qm@web87002.mail.ird.yahoo.com> OK - as promised, I wont pursue the matter :-) ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Saturday, 25 September, 2010 19:18:54 Subject: Re: [Cif2-encoding] How we wrap this up Dear Simon, Unfortunately, that is likely to take us back into our infinite loop or into a diverging spiral. Right now, we would have UTF8 as no more or less a default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad first guess as the likely default encoding for any given CIF, but not a formal constraint. I would suggest we leave the wording in that imprecise state, get CIF2 out and accepted and then work further on the encoding issue. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > Dear all > > In the event that CIF2 adopts the 'any encoding' approach, would there be > any objections to > explicitly defining a default encoding in the specification, to be defaulted > to when there were no indications > to the contrary. At worst this would give CIF2 service providers an excuse > to interpret CIFs as e.g. UTF8 if they couldnt > determine the encoding by other means - but such intollerant service > providers would soon find that their service is > not successful - while at best this might raise awareness of the issues > regarding encoding once non-ASCII is used in > a CIF. Essentially, it does not require users to change there working > practices, which is one of the main arguments for > 'any encoding'. > > So, CIF2 would remain 'any encoding', and specifications in terms of e.g. > "Herbert's as for CIF1..." > might only require a single sentence to define the default after stating > what the 'preferred' encoding was; > the proposal might be phrased as "Herbert's as for CIF1..." + "explicit > default encoding"? > > I do not wish to prolong this debate - if there are objections I will not > launch into an endless round of exchanges > that cover the same ground that has led us this far. > > Cheers > > Simon > > > > > > > ____________________________________________________________________________ > From: SIMON WESTRIP > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 24 September, 2010 20:10:13 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear James > > As you may have gathered I have been reconsidering my position on this > issue. > Please forgive me, but I would like to change my vote if that is OK, in > favour of the 'any encoding' camp. > This apparent U-turn is not a response to recent contributions; rather it is > the outcome of a meeting I had this morning > where I demonstrated some new software to the Managing Editor of IUCr > journals. > > By way of explanation: > > I have been developing a new docx template which the IUCr editorial office > is shortly to release for use by > authors. The template will be packaged with some tools to extract data from > CIFs > and tabulate them in the Word document, e.g. open an mmCIF, click a button, > and standard > tables populated with data from the CIF will be included in the document, > acting as > table templates for the author to edit as appropriate for their manuscript. > > Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' > biologists to start using/accepting mmCIF > as a useful medium, rather than as a product of their deposition to the PDB, > and to encourage them to become comfortable > with passing mmCIFs between applications, and even to edit the things (in > the same way as the core-CIF community > treats CIFs). For example, our perception is that there is no reason why an > author should not feel free to take an mmCIF > that has been created by e.g. pdb_extract and populate it using third-party > software before uploading to the PDB for > deposition. > > This cause would not be furthered by effectively invalidating an mmCIF if it > were not to be encoded in one of > the specified encodings. > > So although I am uneasy about a specification that propogates uncertainty, > I'm also uneasy about alienating users, > especially when we are struggling to change their mindset as in the case of > the biological community > (my perception of the biological community's attitude to mmCIF is based on > feedback from authors/coeditors to > IUCr journals). > > Granted this may not be the most compelling argument in favour of 'any > encoding', but recognizing the hurdles that > may have to be overcome once we move beyond ASCII whatever the CIF2 > specification, I support 'any encoding' > as 'a means to an end'. > > I will not provide my preferences in terms of the numbered options until you > say so; afterall, I have already voted and > all this has to be signed off by COMCIFs in any case. > > Cheers > > Simon > > > > > ____________________________________________________________________________ > From: "Bollinger, John C" > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Friday, 24 September, 2010 14:50:57 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > It is exactly this sort of issue that drove me to support more permissive > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local proposal. > > Do please think about the considerations Herb raised. As you reconsider > your votes, I urge you also to ask yourself what, *precisely*, a "text file" > is, and to consider whether your answer is functionally different from my > "local". If you decide not, then please consider what that answer implies > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under > each option on the table, especially for CIFs containing non-ASCII > characters. Whatever you decide about the meaning of "text file", please > consider whether reasonable people might reach a different conclusion, as I > assert they might do, and to what extent the standard needs to address that. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > >From: cif2-encoding-bounces at iucr.org > [mailto:cif2-encoding-bounces at iucr.org] On Behalf Of SIMON WESTRIP > >Sent: Friday, September 24, 2010 7:53 AM > >To: Group for discussing encoding and content validation schemes for CIF2 > >Subject: Re: [Cif2-encoding] How we wrap this up. . > > > >Dear Herbert > > > >Not for the first time, I find your arguement persuasive. Brian's vote and > explanation have also raised some > >questions that I would like to look into. > > > >I will confirm or otherwise my vote as soon as possible, assuming that is > OK with James and assuming that > >this round of votes might wrap this up. > > > >Cheers > > > >Simon > > > >________________________________________ > >From: Herbert J. Bernstein > >To: Group for discussing encoding and content validation schemes for CIF2 > > >Sent: Friday, 24 September, 2010 13:17:14 > >Subject: Re: [Cif2-encoding] How we wrap this up > > > >If he ignores the standard, in most cases all he has to do to comply with > CIF2 is to run whatever applications he currently runs to produce CIF1 and, > perhaps, in some cases, run a minor edit pass at the end, to convert for the > minor syntactive differences and/or changed tags required to comply with > CIF2 and the new dictionaries, but he is unlikely to have to do anything to > deal with the messy business of whether his encoding is really a proper UTF8 > encoding or not. > > >The punishment if he tries to comply, is that he has to totally uproot and > reconfigure the environment in which he produces CIFs from whatever he is > currently doing to create an enviroment in which he can reliably create and, > more importantly, transmit compliant UTF8 files. This can be very tricky if > he does only a partial job, say fudging in one special application (yet to > be written), because if he stays with his old system, all kinds of tools > will keep trying to transcode whatever he has produced back to whatever his > system considers a standard. Those of us who have files, applications and > tools that have lived through several generations of macs are living proof > of the problem. Macs now have excellent UTF8/16 unicode support, but every > once in a while in working with a unicode file I find it has been strangely > and unexpectedly converted to something else, and it can be really tricky to > spot when the unaccented roman text part has been left untouched but just a > few accen > ted letters have gotten different accents. > > >Mandating UTF8 is simply trying to shift a serious software problem from > the central handlers of CIF (IUCr, PDB, etc.) to the external users. Most > users will probably have the good sense to simply ignore the demand and > leave the burden just where it is now. A few sophisticated users will > probably adapt with no trouble, but the punishment for those users who > blindly follow orders before we have a complete multiplatform supporting > infrastructure in place by mandating UTF8 is severe, expensive and > undeserved. Until and unless we have developed solid support, we will just > be alienating people from CIF. I will continue to oppose such a move. > > [...] > > > Email Disclaimer: www.stjude.org/emaildisclaimer > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100925/f36e4eca/attachment-0001.html From yaya at bernstein-plus-sons.com Sat Sep 25 20:37:46 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Sat, 25 Sep 2010 15:37:46 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <262880.46378.qm@web87002.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> Message-ID: Thank you for your cooperation. -- Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > OK - as promised, I wont pursue the matter :-) > > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Saturday, 25 September, 2010 19:18:54 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > ? Unfortunately, that is likely to take us back into our infinite loop or > into a diverging spiral.? Right now, we would have UTF8 as no more or less a > default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad first guess as > the likely default encoding for any given CIF, but not a formal constraint.? > I would suggest we leave the wording in that imprecise state, get CIF2 out > and accepted and then work further on the encoding issue. > > ? Regards, > ? ? Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > > > Dear all > > > > In the event that CIF2 adopts the 'any encoding' approach, would there be > > any objections to > > explicitly defining a default encoding in the specification, to be > defaulted > > to when there were no indications > > to the contrary. At worst this would give CIF2 service providers an excuse > > to interpret CIFs as e.g. UTF8 if they couldnt > > determine the encoding by other means - but such intollerant service > > providers would soon find that their service is > > not successful - while at best this might raise awareness of the issues > > regarding encoding once non-ASCII is used in > > a CIF. Essentially, it does not require users to change there working > > practices, which is one of the main arguments for > > 'any encoding'. > > > > So, CIF2 would remain 'any encoding', and specifications in terms of e.g. > > "Herbert's as for CIF1..." > > might only require a single sentence to define the default after stating > > what the 'preferred' encoding was; > > the proposal might be phrased as "Herbert's as for CIF1..." + "explicit > > default encoding"? > > > > I do not wish to prolong this debate - if there are objections I will not > > launch into an endless round of exchanges > > that cover the same ground that has led us this far. > > > > Cheers > > > > Simon > > > > > > > > > > > > > >___________________________________________________________________________ > _ > > From: SIMON WESTRIP > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Friday, 24 September, 2010 20:10:13 > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > Dear James > > > > As you may have gathered I have been reconsidering my position on this > > issue. > > Please forgive me, but I would like to change my vote if that is OK, in > > favour of the 'any encoding' camp. > > This apparent U-turn is not a response to recent contributions; rather it > is > > the outcome of a meeting I had this morning > > where I demonstrated some new software to the Managing Editor of IUCr > > journals. > > > > By way of explanation: > > > > I have been developing a new docx template which the IUCr editorial office > > is shortly to release for use by > > authors. The template will be packaged with some tools to extract data > from > > CIFs > > and tabulate them in the Word document, e.g. open an mmCIF, click a > button, > > and standard > > tables populated with data from the CIF will be included in the document, > > acting as > > table templates for the author to edit as appropriate for their > manuscript. > > > > Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' > > biologists to start using/accepting mmCIF > > as a useful medium, rather than as a product of their deposition to the > PDB, > > and to encourage them to become comfortable > > with passing mmCIFs between applications, and even to edit the things (in > > the same way as the core-CIF community > > treats CIFs). For example, our perception is that there is no reason why > an > > author should not feel free to take an mmCIF > > that has been created by e.g. pdb_extract and populate it using > third-party > > software before uploading to the PDB for > > deposition. > > > > This cause would not be furthered by effectively invalidating an mmCIF if > it > > were not to be encoded in one of > > the specified encodings. > > > > So although I am uneasy about a specification that propogates uncertainty, > > I'm also uneasy about alienating users, > > especially when we are struggling to change their mindset as in the case > of > > the biological community > > (my perception of the biological community's attitude to mmCIF is based on > > feedback from authors/coeditors to > > IUCr journals). > > > > Granted this may not be the most compelling argument in favour of 'any > > encoding', but recognizing the hurdles that > > may have to be overcome once we move beyond ASCII whatever the CIF2 > > specification, I support 'any encoding' > > as 'a means to an end'. > > > > I will not provide my preferences in terms of the numbered options until > you > > say so; afterall, I have already voted and > > all this has to be signed off by COMCIFs in any case. > > > > Cheers > > > > Simon > > > > > > > > > >___________________________________________________________________________ > _ > > From: "Bollinger, John C" > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Friday, 24 September, 2010 14:50:57 > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > Dear Simon, > > > > It is exactly this sort of issue that drove me to support more permissive > > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local > proposal. > > > > Do please think about the considerations Herb raised.? As you reconsider > > your votes, I urge you also to ask yourself what, *precisely*, a "text > file" > > is, and to consider whether your answer is functionally different from my > > "local".? If you decide not, then please consider what that answer implies > > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under > > each option on the table, especially for CIFs containing non-ASCII > > characters.? Whatever you decide about the meaning of "text file", please > > consider whether reasonable people might reach a different conclusion, as > I > > assert they might do, and to what extent the standard needs to address > that. > > > > > > Regards, > > > > John > > -- > > John C. Bollinger, Ph.D. > > Department of Structural Biology > > St. Jude Children's Research Hospital > > > > > > >From: cif2-encoding-bounces at iucr.org > > [mailto:cif2-encoding-bounces at iucr.org] On Behalf Of SIMON WESTRIP > > >Sent: Friday, September 24, 2010 7:53 AM > > >To: Group for discussing encoding and content validation schemes for CIF2 > > >Subject: Re: [Cif2-encoding] How we wrap this up. . > > > > > >Dear Herbert > > > > > >Not for the first time, I find your arguement persuasive. Brian's vote > and > > explanation have also raised some > > >questions that I would like to look into. > > > > > >I will confirm or otherwise my vote as soon as possible, assuming that is > > OK with James and assuming that > > >this round of votes might wrap this up. > > > > > >Cheers > > > > > >Simon > > > > > >________________________________________ > > >From: Herbert J. Bernstein > > >To: Group for discussing encoding and content validation schemes for CIF2 > > > > >Sent: Friday, 24 September, 2010 13:17:14 > > >Subject: Re: [Cif2-encoding] How we wrap this up > > > > > >If he ignores the standard, in most cases all he has to do to comply with > > CIF2 is to run whatever applications he currently runs to produce CIF1 > and, > > perhaps, in some cases, run a minor edit pass at the end, to convert for > the > > minor syntactive differences and/or changed tags required to comply with > > CIF2 and the new dictionaries, but he is unlikely to have to do anything > to > > deal with the messy business of whether his encoding is really a proper > UTF8 > > encoding or not. > > > > >The punishment if he tries to comply, is that he has to totally uproot > and > > reconfigure the environment in which he produces CIFs from whatever he is > > currently doing to create an enviroment in which he can reliably create > and, > > more importantly, transmit compliant UTF8 files.? This can be very tricky > if > > he does only a partial job, say fudging in one special application (yet to > > be written), because if he stays with his old system, all kinds of tools > > will keep trying to transcode whatever he has produced back to whatever > his > > system considers a standard. Those of us who have files, applications and > > tools that have lived through several generations of macs are living proof > > of the problem. Macs now have excellent UTF8/16 unicode support, but every > > once in a while in working with a unicode file I find it has been > strangely > > and unexpectedly converted to something else, and it can be really tricky > to > > spot when the unaccented roman text part has been left untouched but just > a > > few accen > > ted letters have gotten different accents. > > > > >Mandating UTF8 is simply trying to shift a serious software problem from > > the central handlers of CIF (IUCr, PDB, etc.) to the external users. Most > > users will probably have the good sense to simply ignore the demand and > > leave the burden just where it is now.? A few sophisticated users will > > probably adapt with no trouble, but the punishment for those users who > > blindly follow orders before we have a complete multiplatform supporting > > infrastructure in place by mandating UTF8 is severe, expensive and > > undeserved.? Until and unless we have developed solid support, we will > just > > be alienating people from CIF.? I will continue to oppose such a move. > > > > [...] > > > > > > Email Disclaimer:? www.stjude.org/emaildisclaimer > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > From simonwestrip at btinternet.com Sun Sep 26 21:40:54 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Sun, 26 Sep 2010 13:40:54 -0700 (PDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> Message-ID: <206078.51827.qm@web87010.mail.ird.yahoo.com> Dear all While reviewing my hypothetical 'to do' list for implementing CIF2 in current software, I realized that the issue of current support for elided character codes hasnt really been addressed in the context of CIF2. My 'to do' list contains notes that software could treat them as keyboard shortcuts, and their use could be defined in the dictionary. However, that was based on a distinct difference between CIF1 and CIF2, while the current arguments for 'as for CIF1...' suggest that the distinction between CIF1 and CIF2 should almost be imperceptible. How is this issue to be addressed in the specification? Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Saturday, 25 September, 2010 20:37:46 Subject: Re: [Cif2-encoding] How we wrap this up Thank you for your cooperation. -- Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > OK - as promised, I wont pursue the matter :-) > > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Saturday, 25 September, 2010 19:18:54 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > Unfortunately, that is likely to take us back into our infinite loop or > into a diverging spiral. Right now, we would have UTF8 as no more or less a > default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad first guess as > the likely default encoding for any given CIF, but not a formal constraint. > I would suggest we leave the wording in that imprecise state, get CIF2 out > and accepted and then work further on the encoding issue. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > > > Dear all > > > > In the event that CIF2 adopts the 'any encoding' approach, would there be > > any objections to > > explicitly defining a default encoding in the specification, to be > defaulted > > to when there were no indications > > to the contrary. At worst this would give CIF2 service providers an excuse > > to interpret CIFs as e.g. UTF8 if they couldnt > > determine the encoding by other means - but such intollerant service > > providers would soon find that their service is > > not successful - while at best this might raise awareness of the issues > > regarding encoding once non-ASCII is used in > > a CIF. Essentially, it does not require users to change there working > > practices, which is one of the main arguments for > > 'any encoding'. > > > > So, CIF2 would remain 'any encoding', and specifications in terms of e.g. > > "Herbert's as for CIF1..." > > might only require a single sentence to define the default after stating > > what the 'preferred' encoding was; > > the proposal might be phrased as "Herbert's as for CIF1..." + "explicit > > default encoding"? > > > > I do not wish to prolong this debate - if there are objections I will not > > launch into an endless round of exchanges > > that cover the same ground that has led us this far. > > > > Cheers > > > > Simon > > > > > > > > > > > > > >___________________________________________________________________________ > _ > > From: SIMON WESTRIP > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Friday, 24 September, 2010 20:10:13 > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > Dear James > > > > As you may have gathered I have been reconsidering my position on this > > issue. > > Please forgive me, but I would like to change my vote if that is OK, in > > favour of the 'any encoding' camp. > > This apparent U-turn is not a response to recent contributions; rather it > is > > the outcome of a meeting I had this morning > > where I demonstrated some new software to the Managing Editor of IUCr > > journals. > > > > By way of explanation: > > > > I have been developing a new docx template which the IUCr editorial office > > is shortly to release for use by > > authors. The template will be packaged with some tools to extract data > from > > CIFs > > and tabulate them in the Word document, e.g. open an mmCIF, click a > button, > > and standard > > tables populated with data from the CIF will be included in the document, > > acting as > > table templates for the author to edit as appropriate for their > manuscript. > > > > Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' > > biologists to start using/accepting mmCIF > > as a useful medium, rather than as a product of their deposition to the > PDB, > > and to encourage them to become comfortable > > with passing mmCIFs between applications, and even to edit the things (in > > the same way as the core-CIF community > > treats CIFs). For example, our perception is that there is no reason why > an > > author should not feel free to take an mmCIF > > that has been created by e.g. pdb_extract and populate it using > third-party > > software before uploading to the PDB for > > deposition. > > > > This cause would not be furthered by effectively invalidating an mmCIF if > it > > were not to be encoded in one of > > the specified encodings. > > > > So although I am uneasy about a specification that propogates uncertainty, > > I'm also uneasy about alienating users, > > especially when we are struggling to change their mindset as in the case > of > > the biological community > > (my perception of the biological community's attitude to mmCIF is based on > > feedback from authors/coeditors to > > IUCr journals). > > > > Granted this may not be the most compelling argument in favour of 'any > > encoding', but recognizing the hurdles that > > may have to be overcome once we move beyond ASCII whatever the CIF2 > > specification, I support 'any encoding' > > as 'a means to an end'. > > > > I will not provide my preferences in terms of the numbered options until > you > > say so; afterall, I have already voted and > > all this has to be signed off by COMCIFs in any case. > > > > Cheers > > > > Simon > > > > > > > > > >___________________________________________________________________________ > _ > > From: "Bollinger, John C" > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Friday, 24 September, 2010 14:50:57 > > Subject: Re: [Cif2-encoding] How we wrap this up > > > > Dear Simon, > > > > It is exactly this sort of issue that drove me to support more permissive > > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local > proposal. > > > > Do please think about the considerations Herb raised. As you reconsider > > your votes, I urge you also to ask yourself what, *precisely*, a "text > file" > > is, and to consider whether your answer is functionally different from my > > "local". If you decide not, then please consider what that answer implies > > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under > > each option on the table, especially for CIFs containing non-ASCII > > characters. Whatever you decide about the meaning of "text file", please > > consider whether reasonable people might reach a different conclusion, as > I > > assert they might do, and to what extent the standard needs to address > that. > > > > > > Regards, > > > > John > > -- > > John C. Bollinger, Ph.D. > > Department of Structural Biology > > St. Jude Children's Research Hospital > > > > > > >From: cif2-encoding-bounces at iucr.org > > [mailto:cif2-encoding-bounces at iucr.org] On Behalf Of SIMON WESTRIP > > >Sent: Friday, September 24, 2010 7:53 AM > > >To: Group for discussing encoding and content validation schemes for CIF2 > > >Subject: Re: [Cif2-encoding] How we wrap this up. . > > > > > >Dear Herbert > > > > > >Not for the first time, I find your arguement persuasive. Brian's vote > and > > explanation have also raised some > > >questions that I would like to look into. > > > > > >I will confirm or otherwise my vote as soon as possible, assuming that is > > OK with James and assuming that > > >this round of votes might wrap this up. > > > > > >Cheers > > > > > >Simon > > > > > >________________________________________ > > >From: Herbert J. Bernstein > > >To: Group for discussing encoding and content validation schemes for CIF2 > > > > >Sent: Friday, 24 September, 2010 13:17:14 > > >Subject: Re: [Cif2-encoding] How we wrap this up > > > > > >If he ignores the standard, in most cases all he has to do to comply with > > CIF2 is to run whatever applications he currently runs to produce CIF1 > and, > > perhaps, in some cases, run a minor edit pass at the end, to convert for > the > > minor syntactive differences and/or changed tags required to comply with > > CIF2 and the new dictionaries, but he is unlikely to have to do anything > to > > deal with the messy business of whether his encoding is really a proper > UTF8 > > encoding or not. > > > > >The punishment if he tries to comply, is that he has to totally uproot > and > > reconfigure the environment in which he produces CIFs from whatever he is > > currently doing to create an enviroment in which he can reliably create > and, > > more importantly, transmit compliant UTF8 files. This can be very tricky > if > > he does only a partial job, say fudging in one special application (yet to > > be written), because if he stays with his old system, all kinds of tools > > will keep trying to transcode whatever he has produced back to whatever > his > > system considers a standard. Those of us who have files, applications and > > tools that have lived through several generations of macs are living proof > > of the problem. Macs now have excellent UTF8/16 unicode support, but every > > once in a while in working with a unicode file I find it has been > strangely > > and unexpectedly converted to something else, and it can be really tricky > to > > spot when the unaccented roman text part has been left untouched but just > a > > few accen > > ted letters have gotten different accents. > > > > >Mandating UTF8 is simply trying to shift a serious software problem from > > the central handlers of CIF (IUCr, PDB, etc.) to the external users. Most > > users will probably have the good sense to simply ignore the demand and > > leave the burden just where it is now. A few sophisticated users will > > probably adapt with no trouble, but the punishment for those users who > > blindly follow orders before we have a complete multiplatform supporting > > infrastructure in place by mandating UTF8 is severe, expensive and > > undeserved. Until and unless we have developed solid support, we will > just > > be alienating people from CIF. I will continue to oppose such a move. > > > > [...] > > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100926/01078d2e/attachment-0001.html From yaya at bernstein-plus-sons.com Sun Sep 26 23:14:55 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Sun, 26 Sep 2010 18:14:55 -0400 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <206078.51827.qm@web87010.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yahoo.com> Message-ID: Dear Simon, The current CIF2 spec, with or without the changes I have suggested to temporarily resolve the encoding issue is at best vague and confusing on the elide character issue. The interacting issue on which the CIF2 spec is clear is that we are changing the handling of quoted strings so that they end on the first occurrence of the quoting character and leaves the handling of elides to the calling application. This will be a problem -- the change from CIF1 in the termination of quoted strings along with the absence of a way of eliding the quotes will invalidate a significant number of existing CIFS without any simple mechanism to recover. Rather than reopen another endless discussion, I would suggest we simply add the python string concatenation character "+" to ensure we can map all current CIF1 files and use Brian's common semantic features for the moment. We can then deal with the full elides discussion at a future date. Regards, Herbert At 1:40 PM -0700 9/26/10, SIMON WESTRIP wrote: >Dear all > >While reviewing my hypothetical 'to do' list for implementing CIF2 >in current software, I realized that >the issue of current support for elided character codes hasnt really >been addressed in the context of CIF2. >My 'to do' list contains notes that software could treat them as >keyboard shortcuts, and their use could be >defined in the dictionary. However, that was based on a distinct >difference between CIF1 and CIF2, >while the current arguments for 'as for CIF1...' suggest that the >distinction between CIF1 and CIF2 >should almost be imperceptible. > >How is this issue to be addressed in the specification? > >Cheers > >Simon > > > >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for >CIF2 >Sent: Saturday, 25 September, 2010 20:37:46 >Subject: Re: [Cif2-encoding] How we wrap this up > >Thank you for your cooperation. -- Herbert > >===================================================== >Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu >===================================================== > >On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > >> OK - as promised, I wont pursue the matter :-) >> >> >> ____________________________________________________________________________ >> From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >> To: Group for discussing encoding and content validation schemes for CIF2 >> <cif2-encoding at iucr.org> >> Sent: Saturday, 25 September, 2010 19:18:54 >> Subject: Re: [Cif2-encoding] How we wrap this up >> >> Dear Simon, >> >> Unfortunately, that is likely to take us back into our infinite loop or >> into a diverging spiral. Right now, we would have UTF8 as no more or less a >> default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad first guess as >> the likely default encoding for any given CIF, but not a formal constraint. >> I would suggest we leave the wording in that imprecise state, get CIF2 out >> and accepted and then work further on the encoding issue. >> >> Regards, >> Herbert >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> >> On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >> >> > Dear all >> > >> > In the event that CIF2 adopts the 'any encoding' approach, would there be > > > any objections to >> > explicitly defining a default encoding in the specification, to be >> defaulted >> > to when there were no indications >> > to the contrary. At worst this would give CIF2 service providers an excuse >> > to interpret CIFs as e.g. UTF8 if they couldnt >> > determine the encoding by other means - but such intollerant service >> > providers would soon find that their service is >> > not successful - while at best this might raise awareness of the issues >> > regarding encoding once non-ASCII is used in >> > a CIF. Essentially, it does not require users to change there working >> > practices, which is one of the main arguments for >> > 'any encoding'. >> > >> > So, CIF2 would remain 'any encoding', and specifications in terms of e.g. >> > "Herbert's as for CIF1..." >> > might only require a single sentence to define the default after stating >> > what the 'preferred' encoding was; >> > the proposal might be phrased as "Herbert's as for CIF1..." + "explicit >> > default encoding"? >> > >> > I do not wish to prolong this debate - if there are objections I will not >> > launch into an endless round of exchanges >> > that cover the same ground that has led us this far. >> > >> > Cheers >> > >> > Simon >> > >> > >> > >> > >> > >> > >> >___________________________________________________________________________ >> _ >> > From: SIMON WESTRIP >><simonwestrip at btinternet.com> >> > To: Group for discussing encoding and content validation schemes for CIF2 >> > <cif2-encoding at iucr.org> >> > Sent: Friday, 24 September, 2010 20:10:13 >> > Subject: Re: [Cif2-encoding] How we wrap this up >> > >> > Dear James >> > >> > As you may have gathered I have been reconsidering my position on this >> > issue. >> > Please forgive me, but I would like to change my vote if that is OK, in >> > favour of the 'any encoding' camp. >> > This apparent U-turn is not a response to recent contributions; rather it >> is >> > the outcome of a meeting I had this morning >> > where I demonstrated some new software to the Managing Editor of IUCr >> > journals. >> > >> > By way of explanation: >> > >> > I have been developing a new docx template which the IUCr editorial office >> > is shortly to release for use by >> > authors. The template will be packaged with some tools to extract data >> from >> > CIFs >> > and tabulate them in the Word document, e.g. open an mmCIF, click a >> button, >> > and standard >> > tables populated with data from the CIF will be included in the document, >> > acting as >> > table templates for the author to edit as appropriate for their >> manuscript. >> > >> > Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' >> > biologists to start using/accepting mmCIF >> > as a useful medium, rather than as a product of their deposition to the >> PDB, >> > and to encourage them to become comfortable >> > with passing mmCIFs between applications, and even to edit the things (in >> > the same way as the core-CIF community >> > treats CIFs). For example, our perception is that there is no reason why >> an >> > author should not feel free to take an mmCIF >> > that has been created by e.g. pdb_extract and populate it using >> third-party >> > software before uploading to the PDB for >> > deposition. >> > >> > This cause would not be furthered by effectively invalidating an mmCIF if >> it >> > were not to be encoded in one of >> > the specified encodings. >> > >> > So although I am uneasy about a specification that propogates uncertainty, >> > I'm also uneasy about alienating users, >> > especially when we are struggling to change their mindset as in the case >> of >> > the biological community >> > (my perception of the biological community's attitude to mmCIF is based on >> > feedback from authors/coeditors to >> > IUCr journals). >> > >> > Granted this may not be the most compelling argument in favour of 'any >> > encoding', but recognizing the hurdles that >> > may have to be overcome once we move beyond ASCII whatever the CIF2 >> > specification, I support 'any encoding' >> > as 'a means to an end'. >> > >> > I will not provide my preferences in terms of the numbered options until > > you >> > say so; afterall, I have already voted and >> > all this has to be signed off by COMCIFs in any case. >> > >> > Cheers >> > >> > Simon >> > >> > >> > >> > >> >___________________________________________________________________________ >> _ >> > From: "Bollinger, John C" >><John.Bollinger at STJUDE.ORG> >> > To: Group for discussing encoding and content validation schemes for CIF2 >> > <cif2-encoding at iucr.org> >> > Sent: Friday, 24 September, 2010 14:50:57 >> > Subject: Re: [Cif2-encoding] How we wrap this up >> > >> > Dear Simon, >> > >> > It is exactly this sort of issue that drove me to support more permissive >> > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local >> proposal. >> > >> > Do please think about the considerations Herb raised. As you reconsider >> > your votes, I urge you also to ask yourself what, *precisely*, a "text >> file" >> > is, and to consider whether your answer is functionally different from my >> > "local". If you decide not, then please consider what that answer implies >> > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under >> > each option on the table, especially for CIFs containing non-ASCII >> > characters. Whatever you decide about the meaning of "text file", please >> > consider whether reasonable people might reach a different conclusion, as >> I >> > assert they might do, and to what extent the standard needs to address >> that. >> > >> > >> > Regards, >> > >> > John >> > -- >> > John C. Bollinger, Ph.D. >> > Department of Structural Biology >> > St. Jude Children's Research Hospital >> > >> > >> > >From: >>cif2-encoding-bounces at iucr.org >> > >>[mailto:cif2-encoding-bounces at iucr.org] >>On Behalf Of SIMON WESTRIP >> > >Sent: Friday, September 24, 2010 7:53 AM >> > >To: Group for discussing encoding and content validation schemes for CIF2 >> > >Subject: Re: [Cif2-encoding] How we wrap this up. . >> > > >> > >Dear Herbert >> > > >> > >Not for the first time, I find your arguement persuasive. Brian's vote >> and >> > explanation have also raised some >> > >questions that I would like to look into. >> > > >> > >I will confirm or otherwise my vote as soon as possible, assuming that is >> > OK with James and assuming that >> > >this round of votes might wrap this up. >> > > >> > >Cheers >> > > >> > >Simon >> > > >> > >________________________________________ >> > >From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >> > >To: Group for discussing encoding and content validation schemes for CIF2 >> > <cif2-encoding at iucr.org> >> > >Sent: Friday, 24 September, 2010 13:17:14 >> > >Subject: Re: [Cif2-encoding] How we wrap this up >> > > >> > >If he ignores the standard, in most cases all he has to do to comply with >> > CIF2 is to run whatever applications he currently runs to produce CIF1 >> and, >> > perhaps, in some cases, run a minor edit pass at the end, to convert for >> the >> > minor syntactive differences and/or changed tags required to comply with >> > CIF2 and the new dictionaries, but he is unlikely to have to do anything >> to >> > deal with the messy business of whether his encoding is really a proper >> UTF8 >> > encoding or not. >> > >> > >The punishment if he tries to comply, is that he has to totally uproot >> and >> > reconfigure the environment in which he produces CIFs from whatever he is >> > currently doing to create an enviroment in which he can reliably create >> and, >> > more importantly, transmit compliant UTF8 files. This can be very tricky >> if >> > he does only a partial job, say fudging in one special application (yet to >> > be written), because if he stays with his old system, all kinds of tools >> > will keep trying to transcode whatever he has produced back to whatever >> his >> > system considers a standard. Those of us who have files, applications and >> > tools that have lived through several generations of macs are living proof >> > of the problem. Macs now have excellent UTF8/16 unicode support, but every > > > once in a while in working with a unicode file I find it has been >> strangely >> > and unexpectedly converted to something else, and it can be really tricky >> to >> > spot when the unaccented roman text part has been left untouched but just >> a >> > few accen >> > ted letters have gotten different accents. >> > >> > >Mandating UTF8 is simply trying to shift a serious software problem from >> > the central handlers of CIF (IUCr, PDB, etc.) to the external users. Most >> > users will probably have the good sense to simply ignore the demand and >> > leave the burden just where it is now. A few sophisticated users will >> > probably adapt with no trouble, but the punishment for those users who >> > blindly follow orders before we have a complete multiplatform supporting >> > infrastructure in place by mandating UTF8 is severe, expensive and >> > undeserved. Until and unless we have developed solid support, we will >> just >> > be alienating people from CIF. I will continue to oppose such a move. >> > >> > [...] >> > >> > >> > Email Disclaimer: >>www.stjude.org/emaildisclaimer >> > _______________________________________________ >> > cif2-encoding mailing list >> > cif2-encoding at iucr.org >> > >>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> > >> > >> >> > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From jamesrhester at gmail.com Mon Sep 27 00:46:49 2010 From: jamesrhester at gmail.com (James Hester) Date: Mon, 27 Sep 2010 09:46:49 +1000 Subject: [Cif2-encoding] Let's all take a deep breath... Message-ID: Well, I didn't even manage to properly call a vote and everybody has piled in, Simon even managed to vote twice (and that's quite OK Simon, we are trying to determine what the will of the group is and so I think it only reasonable that if somebody's assessment of the situation changes that they can 'update' their vote). I am however unhappy that both Brian and Simon introduced new concerns and nobody has had a chance to comment on how the various proposals under consideration might affect those concerns. I would therefore like to suggest that the voting period continues until the end of this week, and that we all endeavour to express any concerns or comments that we need to make in a timely fashion. I will be commenting on Brian and Simon's concerns presently, and also on Herbert's proposal, which I have not subjected to my hopefully not too long-winded scrutiny. None of us should feel steamrolled by a certain artifical urgency that has appeared in the dialogue - while we do need to wrap things up in a timely fashion, it has only been 4 days since I even started discussing the vote. Some initial general comments (I will comment separately on Brian and Simon's issues). (i) We are *not* in an infinite loop. The last few months have seen several proposals analysed and explored, and it is my perception that these discussions have led at least some participants (including myself) to a better understanding of the consequences of what they are proposing. So nobody should feel that throwing out a new criticism of an old or new proposal is somehow hindering progress by looping over old ground. Quite the reverse, it is making progress. What *is* important is to get your comments into the mix in a timely fashion, because time is indeed short. (ii) It is not correct to assume that we can figure out the encoding issues later. Maybe we can, but maybe we can't. Once CIF2 files are produced and software is distributed, you can't put the genie back in the bottle, by which I mean you can't easily change the way that distributed software behaves, and how files are interpreted. We have to therefore be confident that the standard we promulgate does not close off an avenue we need for solving encoding issues. (iii) It is extremely misleading to think that simply substituting UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the same results as we had for CIF1. The 'any encoding' clause in the CIF1 standard was essentially irrelevant - encodings used in the overwhelming majority of systems producing CIF1 files coincided with ASCII for CIF text, as I have said many times before, so software had no trouble in turning a stream of CIF bytes from any unknown source into the same text that the CIF writer was working from. If I repeat this point endlessly, it is only because the CIF1 approach continues to be invoked like magic fairy dust that will make everything OK, when in fact the magic fairy dust was the dominance of ASCII encoding for ASCII codepoints. There is *no such uniformity* in encoding of Unicode codepoints. We have a new problem for CIF, and whatever we do will have *new* consequences, and that very much includes the 'as for CIF1' proposal. So please, enough with the 'CIF1 has served us well for 15 years' line. (iv) The majority are currently in favour of the 'as for CIF1' approach, which if nobody changes their vote by the end of the week, is what we will be taking to the DDLm group and COMCIFS. This means we will have a pure text standard, and I mean really pure, because there is no predictable link between this beautiful textual castle in the sky and the solid ground of bytes on disk. I am a cross-platform CIF programmer. Looking forward to the halcyon 'as for CIF1' days that await us, a small question occupies my mind. As my program does not operate in that glorious abstract space occupied by pure text standards that are most certainly not anybody's laughing stock, my program will be forced to (as briefly as possible) deal with humble plebiean bytes according to some encoding to obtain the exalted CIF text. Under the 'as for CIF1' proposal, how does my program turn these bytes into text in the way that the writer of the bytes intended? If that is not yet resolved, how can anybody even write a CIF2 program? -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Mon Sep 27 02:36:58 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Sun, 26 Sep 2010 21:36:58 -0400 Subject: [Cif2-encoding] Let's all take a deep breath... In-Reply-To: References: Message-ID: Dear Colleagues, As one might expect, I respectfully disagree with almost everything James has said, but the really critical point of disagreement is >(iii) It is extremely misleading to think that simply substituting >UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the >same results as we had for CIF1. The 'any encoding' clause in the >CIF1 standard was essentially irrelevant - encodings used in the >overwhelming majority of systems producing CIF1 files coincided with >ASCII for CIF text, as I have said many times before, so software had >no trouble in turning a stream of CIF bytes from any unknown source >into the same text that the CIF writer was working from. If I repeat >this point endlessly, it is only because the CIF1 approach continues >to be invoked like magic fairy dust that will make everything OK, when >in fact the magic fairy dust was the dominance of ASCII encoding for >ASCII codepoints. There is *no such uniformity* in encoding of >Unicode codepoints. We have a new problem for CIF, and whatever we do >will have *new* consequences, and that very much includes the 'as for >CIF1' proposal. So please, enough with the 'CIF1 has served us well >for 15 years' line. I vigorously disagree on this point. If the only change we were to make in going to CIF2 were to be that we were inserting UTF8 in place of ASCII there would be absolutely _no_ impact on any existing CIF application or CIF data file, because for the characters that are formally legal under CIF1, UTF8 and ASCII are identical encodings. The relevant portion of the CIF1 syntax specification is: "22. Characters within a CIF are restricted to certain printable or white-space characters. Specifically, these are the ones located in the ASCII character set at decimal positions 09 (HT or horizontal tab), 10 (LF or line feed), 13 (CR or carriage return) and the letters, numerals and punctuation marks at positions 32-126." Any existing data file or application that conforms to that restriction in _any_ encoding, will be indistinguishable with the "UTF8 in place of ASCII" change. For those applications and data file, this is not a change. As James implies, the only new problems that arise are in introducing characters into CIFs drawn from codepoints 128 ff, but we already have that problem under CIF1. The use of "UTF8 in place of ASCII" simply allows us to coherently consider how to handle those characters in the future. If we don't use them in any IUCr-sanctioned dictionary tags for the moment, we are in no worse shape going forward under my proposal with Brian's recommendations added than we are staying with CIF1 and ASCII, and, I believe, in much better shape. This is a serious matter, not appropriate from sarcastic "fairy dust" comments. It really is true that "CIF1 has served us well for 15 years," and we should take our time on the encoding issue and be certain we are really improving things, not making them worse by what we propose. I agree that we need to discuss and resolve the encoding issue, but it is not a new problem suddenly introduced by using UTF8. In my opinion, however, a hasty, ill-considered resolution to that serious problem would be a very bad idea, but delaying all of CIF2 in order to wait until we work our way out of a thicket that has no clear exit yet also seems to me to be a very bad idea. I expect if we ever manage to meet face to face, or even in a series of Skype meetings, we could come to closure fairly quickly, but as things now stand it seems unlikely that we will have a chance to do that before the IUCr meeting. I use CIF as a text format and I use it as a binary format. I use both DDL1 CIF and DDL2 CIF. I am also a cross-platform and cross version CIF programmer. I do not fool myself into thinking CIF1 to be perfect. It is not. But it is a very useful tool, and I would like to be certain that what we propose as CIF2 is a least as useful as CIF1 and hopefully more so. I do not believe that options 3, 4 or 5 are far enough along to provide such utility, and by being too prescriptive at this stage, may well do harm. I urge all concerned to support either options 1 or 2 or both, so we have get CIF2 out for the IUCr meeting this coming summer, and to let the encoding issue take its own time. If by some chance we come up with a solution before summer 2011, so much the better, but please don't make the perfect (CIF2 with all issues including the encoding issue resolved) the enemy of the good (the CIF2 we have now with the encoding issue left open). Regards, Herbert At 9:46 AM +1000 9/27/10, James Hester wrote: >Well, I didn't even manage to properly call a vote and everybody has >piled in, Simon even managed to vote twice (and that's quite OK Simon, >we are trying to determine what the will of the group is and so I >think it only reasonable that if somebody's assessment of the >situation changes that they can 'update' their vote). I am however >unhappy that both Brian and Simon introduced new concerns and nobody >has had a chance to comment on how the various proposals under >consideration might affect those concerns. I would therefore like to >suggest that the voting period continues until the end of this week, >and that we all endeavour to express any concerns or comments that we >need to make in a timely fashion. I will be commenting on Brian and >Simon's concerns presently, and also on Herbert's proposal, which I >have not subjected to my hopefully not too long-winded scrutiny. > >None of us should feel steamrolled by a certain artifical urgency that >has appeared in the dialogue - while we do need to wrap things up in a >timely fashion, it has only been 4 days since I even started >discussing the vote. > >Some initial general comments (I will comment separately on Brian and >Simon's issues). > >(i) We are *not* in an infinite loop. The last few months have seen >several proposals analysed and explored, and it is my perception that >these discussions have led at least some participants (including >myself) to a better understanding of the consequences of what they are >proposing. So nobody should feel that throwing out a new criticism of >an old or new proposal is somehow hindering progress by looping over >old ground. Quite the reverse, it is making progress. What *is* >important is to get your comments into the mix in a timely fashion, >because time is indeed short. > >(ii) It is not correct to assume that we can figure out the encoding >issues later. Maybe we can, but maybe we can't. Once CIF2 files are >produced and software is distributed, you can't put the genie back in >the bottle, by which I mean you can't easily change the way that >distributed software behaves, and how files are interpreted. We have >to therefore be confident that the standard we promulgate does not >close off an avenue we need for solving encoding issues. > >(iii) It is extremely misleading to think that simply substituting >UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the >same results as we had for CIF1. The 'any encoding' clause in the >CIF1 standard was essentially irrelevant - encodings used in the >overwhelming majority of systems producing CIF1 files coincided with >ASCII for CIF text, as I have said many times before, so software had >no trouble in turning a stream of CIF bytes from any unknown source >into the same text that the CIF writer was working from. If I repeat >this point endlessly, it is only because the CIF1 approach continues >to be invoked like magic fairy dust that will make everything OK, when >in fact the magic fairy dust was the dominance of ASCII encoding for >ASCII codepoints. There is *no such uniformity* in encoding of >Unicode codepoints. We have a new problem for CIF, and whatever we do >will have *new* consequences, and that very much includes the 'as for >CIF1' proposal. So please, enough with the 'CIF1 has served us well >for 15 years' line. > >(iv) The majority are currently in favour of the 'as for CIF1' >approach, which if nobody changes their vote by the end of the week, >is what we will be taking to the DDLm group and COMCIFS. This means >we will have a pure text standard, and I mean really pure, because >there is no predictable link between this beautiful textual castle in >the sky and the solid ground of bytes on disk. > >I am a cross-platform CIF programmer. Looking forward to the halcyon >'as for CIF1' days that await us, a small question occupies my mind. >As my program does not operate in that glorious abstract space >occupied by pure text standards that are most certainly not anybody's >laughing stock, my program will be forced to (as briefly as possible) >deal with humble plebiean bytes according to some encoding to obtain >the exalted CIF text. Under the 'as for CIF1' proposal, how does my >program turn these bytes into text in the way that the writer of the >bytes intended? If that is not yet resolved, how can anybody even >write a CIF2 program? > >-- >T +61 (02) 9717 9907 >F +61 (02) 9717 3145 >M +61 (04) 0249 4148 >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From simonwestrip at btinternet.com Mon Sep 27 10:34:04 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 27 Sep 2010 09:34:04 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> Message-ID: <476110.27334.qm@web87005.mail.ird.yahoo.com> I was not so concerned about invalidating existing CIFs, or even the likelihood that users will continue to write e.g. 'f\'oo' - this is a syntax error in CIF2 that is readily recoverable. Rather there is a large group of CIF1 users that are in the habit of using elided ASCII sequences to represent non-ASCII characters. With CIF2 these users will be able to use the unicode character itself. So we might end up with a mixture of esacaped sequences and unicode characters (e.g. a user may have a keyboard shortcut for an accented character that forms part of their name, but might still resort to \a for alpha, under the assumption that \a is still valid because CIF2 is basically the same as CIF1, and, rightly or wrongly, they perceive the eliding machanism as part of CIF syntax. I think this is an issue where we can't afford to take an 'as for CIF1...' approach, especially as the CIF1 specification isn't entirely satisfactory (e.g. there's an example in the line-folding protocal that uses elides in a file path to make a point, but actually these elides may easily be interpretted as escape sequences), and as the encoding issue is very much concerned with user practice, the large group of users that currently use elided character codes need to be aware what the situation is in CIF2? I'm not convinced this issue should be left for discussion later; it is relevant when considering how the move beyond ASCII is specified. Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Sunday, 26 September, 2010 23:14:55 Subject: Re: [Cif2-encoding] How we wrap this up Dear Simon, The current CIF2 spec, with or without the changes I have suggested to temporarily resolve the encoding issue is at best vague and confusing on the elide character issue. The interacting issue on which the CIF2 spec is clear is that we are changing the handling of quoted strings so that they end on the first occurrence of the quoting character and leaves the handling of elides to the calling application. This will be a problem -- the change from CIF1 in the termination of quoted strings along with the absence of a way of eliding the quotes will invalidate a significant number of existing CIFS without any simple mechanism to recover. Rather than reopen another endless discussion, I would suggest we simply add the python string concatenation character "+" to ensure we can map all current CIF1 files and use Brian's common semantic features for the moment. We can then deal with the full elides discussion at a future date. Regards, Herbert At 1:40 PM -0700 9/26/10, SIMON WESTRIP wrote: >Dear all > >While reviewing my hypothetical 'to do' list for implementing CIF2 >in current software, I realized that >the issue of current support for elided character codes hasnt really >been addressed in the context of CIF2. >My 'to do' list contains notes that software could treat them as >keyboard shortcuts, and their use could be >defined in the dictionary. However, that was based on a distinct >difference between CIF1 and CIF2, >while the current arguments for 'as for CIF1...' suggest that the >distinction between CIF1 and CIF2 >should almost be imperceptible. > >How is this issue to be addressed in the specification? > >Cheers > >Simon > > > >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for >CIF2 >Sent: Saturday, 25 September, 2010 20:37:46 >Subject: Re: [Cif2-encoding] How we wrap this up > >Thank you for your cooperation. -- Herbert > >===================================================== >Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu >===================================================== > >On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > >> OK - as promised, I wont pursue the matter :-) >> >> >> ____________________________________________________________________________ >> From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >> To: Group for discussing encoding and content validation schemes for CIF2 >> <cif2-encoding at iucr.org> >> Sent: Saturday, 25 September, 2010 19:18:54 >> Subject: Re: [Cif2-encoding] How we wrap this up >> >> Dear Simon, >> >> Unfortunately, that is likely to take us back into our infinite loop or >> into a diverging spiral. Right now, we would have UTF8 as no more or less a >> default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad first guess as >> the likely default encoding for any given CIF, but not a formal constraint. >> I would suggest we leave the wording in that imprecise state, get CIF2 out >> and accepted and then work further on the encoding issue. >> >> Regards, >> Herbert >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> >> On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >> >> > Dear all >> > >> > In the event that CIF2 adopts the 'any encoding' approach, would there be > > > any objections to >> > explicitly defining a default encoding in the specification, to be >> defaulted >> > to when there were no indications >> > to the contrary. At worst this would give CIF2 service providers an excuse >> > to interpret CIFs as e.g. UTF8 if they couldnt >> > determine the encoding by other means - but such intollerant service >> > providers would soon find that their service is >> > not successful - while at best this might raise awareness of the issues >> > regarding encoding once non-ASCII is used in >> > a CIF. Essentially, it does not require users to change there working >> > practices, which is one of the main arguments for >> > 'any encoding'. >> > >> > So, CIF2 would remain 'any encoding', and specifications in terms of e.g. >> > "Herbert's as for CIF1..." >> > might only require a single sentence to define the default after stating >> > what the 'preferred' encoding was; >> > the proposal might be phrased as "Herbert's as for CIF1..." + "explicit >> > default encoding"? >> > >> > I do not wish to prolong this debate - if there are objections I will not >> > launch into an endless round of exchanges >> > that cover the same ground that has led us this far. >> > >> > Cheers >> > >> > Simon >> > >> > >> > >> > >> > >> > >> >___________________________________________________________________________ >> _ >> > From: SIMON WESTRIP >><simonwestrip at btinternet.com> >> > To: Group for discussing encoding and content validation schemes for CIF2 >> > <cif2-encoding at iucr.org> >> > Sent: Friday, 24 September, 2010 20:10:13 >> > Subject: Re: [Cif2-encoding] How we wrap this up >> > >> > Dear James >> > >> > As you may have gathered I have been reconsidering my position on this >> > issue. >> > Please forgive me, but I would like to change my vote if that is OK, in >> > favour of the 'any encoding' camp. >> > This apparent U-turn is not a response to recent contributions; rather it >> is >> > the outcome of a meeting I had this morning >> > where I demonstrated some new software to the Managing Editor of IUCr >> > journals. >> > >> > By way of explanation: >> > >> > I have been developing a new docx template which the IUCr editorial office >> > is shortly to release for use by >> > authors. The template will be packaged with some tools to extract data >> from >> > CIFs >> > and tabulate them in the Word document, e.g. open an mmCIF, click a >> button, >> > and standard >> > tables populated with data from the CIF will be included in the document, >> > acting as >> > table templates for the author to edit as appropriate for their >> manuscript. >> > >> > Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' >> > biologists to start using/accepting mmCIF >> > as a useful medium, rather than as a product of their deposition to the >> PDB, >> > and to encourage them to become comfortable >> > with passing mmCIFs between applications, and even to edit the things (in >> > the same way as the core-CIF community >> > treats CIFs). For example, our perception is that there is no reason why >> an >> > author should not feel free to take an mmCIF >> > that has been created by e.g. pdb_extract and populate it using >> third-party >> > software before uploading to the PDB for >> > deposition. >> > >> > This cause would not be furthered by effectively invalidating an mmCIF if >> it >> > were not to be encoded in one of >> > the specified encodings. >> > >> > So although I am uneasy about a specification that propogates uncertainty, >> > I'm also uneasy about alienating users, >> > especially when we are struggling to change their mindset as in the case >> of >> > the biological community >> > (my perception of the biological community's attitude to mmCIF is based on >> > feedback from authors/coeditors to >> > IUCr journals). >> > >> > Granted this may not be the most compelling argument in favour of 'any >> > encoding', but recognizing the hurdles that >> > may have to be overcome once we move beyond ASCII whatever the CIF2 >> > specification, I support 'any encoding' >> > as 'a means to an end'. >> > >> > I will not provide my preferences in terms of the numbered options until > > you >> > say so; afterall, I have already voted and >> > all this has to be signed off by COMCIFs in any case. >> > >> > Cheers >> > >> > Simon >> > >> > >> > >> > >> >___________________________________________________________________________ >> _ >> > From: "Bollinger, John C" >><John.Bollinger at STJUDE.ORG> >> > To: Group for discussing encoding and content validation schemes for CIF2 >> > <cif2-encoding at iucr.org> >> > Sent: Friday, 24 September, 2010 14:50:57 >> > Subject: Re: [Cif2-encoding] How we wrap this up >> > >> > Dear Simon, >> > >> > It is exactly this sort of issue that drove me to support more permissive >> > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local >> proposal. >> > >> > Do please think about the considerations Herb raised. As you reconsider >> > your votes, I urge you also to ask yourself what, *precisely*, a "text >> file" >> > is, and to consider whether your answer is functionally different from my >> > "local". If you decide not, then please consider what that answer implies >> > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under >> > each option on the table, especially for CIFs containing non-ASCII >> > characters. Whatever you decide about the meaning of "text file", please >> > consider whether reasonable people might reach a different conclusion, as >> I >> > assert they might do, and to what extent the standard needs to address >> that. >> > >> > >> > Regards, >> > >> > John >> > -- >> > John C. Bollinger, Ph.D. >> > Department of Structural Biology >> > St. Jude Children's Research Hospital >> > >> > >> > >From: >>cif2-encoding-bounces at iucr.org >> > >>[mailto:cif2-encoding-bounces at iucr.org] >>On Behalf Of SIMON WESTRIP >> > >Sent: Friday, September 24, 2010 7:53 AM >> > >To: Group for discussing encoding and content validation schemes for CIF2 >> > >Subject: Re: [Cif2-encoding] How we wrap this up. . >> > > >> > >Dear Herbert >> > > >> > >Not for the first time, I find your arguement persuasive. Brian's vote >> and >> > explanation have also raised some >> > >questions that I would like to look into. >> > > >> > >I will confirm or otherwise my vote as soon as possible, assuming that is >> > OK with James and assuming that >> > >this round of votes might wrap this up. >> > > >> > >Cheers >> > > >> > >Simon >> > > >> > >________________________________________ >> > >From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >> > >To: Group for discussing encoding and content validation schemes for CIF2 >> > <cif2-encoding at iucr.org> >> > >Sent: Friday, 24 September, 2010 13:17:14 >> > >Subject: Re: [Cif2-encoding] How we wrap this up >> > > >> > >If he ignores the standard, in most cases all he has to do to comply with >> > CIF2 is to run whatever applications he currently runs to produce CIF1 >> and, >> > perhaps, in some cases, run a minor edit pass at the end, to convert for >> the >> > minor syntactive differences and/or changed tags required to comply with >> > CIF2 and the new dictionaries, but he is unlikely to have to do anything >> to >> > deal with the messy business of whether his encoding is really a proper >> UTF8 >> > encoding or not. >> > >> > >The punishment if he tries to comply, is that he has to totally uproot >> and >> > reconfigure the environment in which he produces CIFs from whatever he is >> > currently doing to create an enviroment in which he can reliably create >> and, >> > more importantly, transmit compliant UTF8 files. This can be very tricky >> if >> > he does only a partial job, say fudging in one special application (yet to >> > be written), because if he stays with his old system, all kinds of tools >> > will keep trying to transcode whatever he has produced back to whatever >> his >> > system considers a standard. Those of us who have files, applications and >> > tools that have lived through several generations of macs are living proof >> > of the problem. Macs now have excellent UTF8/16 unicode support, but every > > > once in a while in working with a unicode file I find it has been >> strangely >> > and unexpectedly converted to something else, and it can be really tricky >> to >> > spot when the unaccented roman text part has been left untouched but just >> a >> > few accen >> > ted letters have gotten different accents. >> > >> > >Mandating UTF8 is simply trying to shift a serious software problem from >> > the central handlers of CIF (IUCr, PDB, etc.) to the external users. Most >> > users will probably have the good sense to simply ignore the demand and >> > leave the burden just where it is now. A few sophisticated users will >> > probably adapt with no trouble, but the punishment for those users who >> > blindly follow orders before we have a complete multiplatform supporting >> > infrastructure in place by mandating UTF8 is severe, expensive and >> > undeserved. Until and unless we have developed solid support, we will >> just >> > be alienating people from CIF. I will continue to oppose such a move. >> > >> > [...] >> > >> > >> > Email Disclaimer: >>www.stjude.org/emaildisclaimer >> > _______________________________________________ >> > cif2-encoding mailing list >> > cif2-encoding at iucr.org >> > >>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> > >> > >> >> > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100927/842f31be/attachment-0001.html From simonwestrip at btinternet.com Mon Sep 27 10:49:54 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 27 Sep 2010 09:49:54 +0000 (GMT) Subject: [Cif2-encoding] Let's all take a deep breath... In-Reply-To: References: Message-ID: <771212.46051.qm@web87016.mail.ird.yahoo.com> Dear James I welcome your call 'to take a breath'. Although I now favour the 'any encoding' compromise, I've been struggling to accept the 'As for CIF1...' descriptions of 'any encoding' - to the extent that I would not be able complete the process and order the five proposals according to preference. Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 27 September, 2010 0:46:49 Subject: [Cif2-encoding] Let's all take a deep breath... Well, I didn't even manage to properly call a vote and everybody has piled in, Simon even managed to vote twice (and that's quite OK Simon, we are trying to determine what the will of the group is and so I think it only reasonable that if somebody's assessment of the situation changes that they can 'update' their vote). I am however unhappy that both Brian and Simon introduced new concerns and nobody has had a chance to comment on how the various proposals under consideration might affect those concerns. I would therefore like to suggest that the voting period continues until the end of this week, and that we all endeavour to express any concerns or comments that we need to make in a timely fashion. I will be commenting on Brian and Simon's concerns presently, and also on Herbert's proposal, which I have not subjected to my hopefully not too long-winded scrutiny. None of us should feel steamrolled by a certain artifical urgency that has appeared in the dialogue - while we do need to wrap things up in a timely fashion, it has only been 4 days since I even started discussing the vote. Some initial general comments (I will comment separately on Brian and Simon's issues). (i) We are *not* in an infinite loop. The last few months have seen several proposals analysed and explored, and it is my perception that these discussions have led at least some participants (including myself) to a better understanding of the consequences of what they are proposing. So nobody should feel that throwing out a new criticism of an old or new proposal is somehow hindering progress by looping over old ground. Quite the reverse, it is making progress. What *is* important is to get your comments into the mix in a timely fashion, because time is indeed short. (ii) It is not correct to assume that we can figure out the encoding issues later. Maybe we can, but maybe we can't. Once CIF2 files are produced and software is distributed, you can't put the genie back in the bottle, by which I mean you can't easily change the way that distributed software behaves, and how files are interpreted. We have to therefore be confident that the standard we promulgate does not close off an avenue we need for solving encoding issues. (iii) It is extremely misleading to think that simply substituting UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the same results as we had for CIF1. The 'any encoding' clause in the CIF1 standard was essentially irrelevant - encodings used in the overwhelming majority of systems producing CIF1 files coincided with ASCII for CIF text, as I have said many times before, so software had no trouble in turning a stream of CIF bytes from any unknown source into the same text that the CIF writer was working from. If I repeat this point endlessly, it is only because the CIF1 approach continues to be invoked like magic fairy dust that will make everything OK, when in fact the magic fairy dust was the dominance of ASCII encoding for ASCII codepoints. There is *no such uniformity* in encoding of Unicode codepoints. We have a new problem for CIF, and whatever we do will have *new* consequences, and that very much includes the 'as for CIF1' proposal. So please, enough with the 'CIF1 has served us well for 15 years' line. (iv) The majority are currently in favour of the 'as for CIF1' approach, which if nobody changes their vote by the end of the week, is what we will be taking to the DDLm group and COMCIFS. This means we will have a pure text standard, and I mean really pure, because there is no predictable link between this beautiful textual castle in the sky and the solid ground of bytes on disk. I am a cross-platform CIF programmer. Looking forward to the halcyon 'as for CIF1' days that await us, a small question occupies my mind. As my program does not operate in that glorious abstract space occupied by pure text standards that are most certainly not anybody's laughing stock, my program will be forced to (as briefly as possible) deal with humble plebiean bytes according to some encoding to obtain the exalted CIF text. Under the 'as for CIF1' proposal, how does my program turn these bytes into text in the way that the writer of the bytes intended? If that is not yet resolved, how can anybody even write a CIF2 program? -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100927/90f6eb8b/attachment.html From yaya at bernstein-plus-sons.com Mon Sep 27 11:48:49 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 27 Sep 2010 06:48:49 -0400 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <476110.27334.qm@web87005.mail.ird.yahoo.com> References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> Message-ID: Dear Simon, Under the CIF2 specification with UTF8 in place of ASCII there is _no_ change in the use of elided ASCII sequences to represent non-ASCII characters until and unless the IUCr publications office decides that, for that particular application, they are ready to accept something new. It is _only_ if you go forward with options 3, 4 or 5 that you are giving the green light to users to do precisely what you are concerned about -- using the unicode characters instead instead in possibly strange admixtures that nobody is ready to process. Remember, under the CIF2 specification as now written, it is _not_ part of the CIF2 specification to determine the handling of the characters in quoted strings other than to ensure that those string do not contain illegal characters from the point of view of CIF2. Dealing with the validity of particular character sequences in strings users provide is, just as in CIF1, the responsibility of the application (i.e. the IUCr journal flows or the PDB archiving flows). My apologies to James, who I know is trying to do what he believes to be right, but I believe James has things backwards -- the "deep breath" is provided by my proposal -- taking the time to properly engineer the use of the extra characters UTF8 allows us to discuss clearly, while James' push for an immediate prescriptive use of UTF8 with prescriptions that differ drastically from what has been adopted by all other frameworks (HTML, XML, python, etc.) in ways that are untested and unsupported by most existing software is the untimely rush to judgement. I beg you to support options 1 and/or 2 to allow CIF2 to go forward in all other respects while we all take a deep breath and deal with the tricky issue you raised slowly and carefully without the pressure of trying to have CIF2 itself ready for next summer. Regards, Herbert At 9:34 AM +0000 9/27/10, SIMON WESTRIP wrote: >I was not so concerned about invalidating existing CIFs, or even the >likelihood >that users will continue to write e.g. 'f\'oo' - this is a syntax >error in CIF2 that is readily recoverable. > >Rather there is a large group of CIF1 users that are in the habit of >using elided ASCII sequences to >represent non-ASCII characters. With CIF2 these users will be able >to use the unicode character itself. >So we might end up with a mixture of esacaped sequences and unicode >characters (e.g. a user may have a keyboard shortcut >for an accented character that forms part of their name, but might >still resort to \a for alpha, under the assumption that \a is still >valid because CIF2 is basically the same as CIF1, and, rightly or >wrongly, they perceive the eliding machanism as part of >CIF syntax. > >I think this is an issue where we can't afford to take an 'as for >CIF1...' approach, especially as the CIF1 specification >isn't entirely satisfactory (e.g. there's an example in the >line-folding protocal that uses elides in a file path to make a >point, >but actually these elides may easily be interpretted as escape >sequences), and as the encoding issue is very much concerned with >user practice, the large group of users that currently use elided >character codes need to be aware what the situation is in >CIF2? > >I'm not convinced this issue should be left for discussion later; >it is relevant when considering how the move beyond ASCII is specified. > >Cheers > >Simon > > > > >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for >CIF2 >Sent: Sunday, 26 September, 2010 23:14:55 >Subject: Re: [Cif2-encoding] How we wrap this up > >Dear Simon, > > The current CIF2 spec, with or without the changes I have suggested >to temporarily resolve the encoding issue is at best vague and >confusing on the elide character issue. The interacting issue on >which the CIF2 spec >is clear is that we are changing the handling of quoted strings so >that they end on the first occurrence of the quoting character and leaves >the handling of elides to the calling application. > > This will be a problem -- the change from CIF1 in the termination of >quoted strings along with the absence of a way of eliding the quotes >will invalidate a significant number of existing CIFS without any simple >mechanism to recover. Rather than reopen another endless discussion, >I would suggest we simply add the python string concatenation character >"+" to ensure we can map all current CIF1 files and use Brian's common >semantic features for the moment. We can then deal with the full elides >discussion at a future date. > > Regards, > Herbert > > > > > >At 1:40 PM -0700 9/26/10, SIMON WESTRIP wrote: >>Dear all >> >>While reviewing my hypothetical 'to do' list for implementing CIF2 >>in current software, I realized that >>the issue of current support for elided character codes hasnt really >>been addressed in the context of CIF2. >>My 'to do' list contains notes that software could treat them as >>keyboard shortcuts, and their use could be >>defined in the dictionary. However, that was based on a distinct >>difference between CIF1 and CIF2, >>while the current arguments for 'as for CIF1...' suggest that the >>distinction between CIF1 and CIF2 >>should almost be imperceptible. >> >>How is this issue to be addressed in the specification? >> >>Cheers >> >>Simon >> >> >> >>From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >>To: Group for discussing encoding and content validation schemes for >>CIF2 <cif2-encoding at iucr.org> >>Sent: Saturday, 25 September, 2010 20:37:46 >>Subject: Re: [Cif2-encoding] How we wrap this up >> >>Thank you for your cooperation. -- Herbert >> >>===================================================== >>Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> >>yaya at dowling.edu>yaya at dowling.edu >>===================================================== >> >>On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >> >>> OK - as promised, I wont pursue the matter :-) >>> >>> >>> >>>____________________________________________________________________________ >>> From: Herbert J. Bernstein >>><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >>> To: Group for discussing encoding and content validation schemes for CIF2 >>> >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >>> Sent: Saturday, 25 September, 2010 19:18:54 >>> Subject: Re: [Cif2-encoding] How we wrap this up >>> >>> Dear Simon, >>> >>> Unfortunately, that is likely to take us back into our infinite loop or >>> into a diverging spiral. Right now, we would have UTF8 as no >>>more or less a >>> default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad >>>first guess as >>> the likely default encoding for any given CIF, but not a formal >>>constraint. >>> I would suggest we leave the wording in that imprecise state, get CIF2 out >>> and accepted and then work further on the encoding issue. >>> >>> Regards, >>> Herbert >>> >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> >>>yaya at dowling.edu>yaya at dowling.edu >>> ===================================================== >>> >>> On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >>> >>> > Dear all >>> > >>> > In the event that CIF2 adopts the 'any encoding' approach, >>>would there be >> > > any objections to > >> > explicitly defining a default encoding in the specification, to be >>> defaulted >>> > to when there were no indications >>> > to the contrary. At worst this would give CIF2 service >>>providers an excuse >>> > to interpret CIFs as e.g. UTF8 if they couldnt >>> > determine the encoding by other means - but such intollerant service >>> > providers would soon find that their service is >>> > not successful - while at best this might raise awareness of the issues >>> > regarding encoding once non-ASCII is used in >>> > a CIF. Essentially, it does not require users to change there working >>> > practices, which is one of the main arguments for >>> > 'any encoding'. >>> > >>> > So, CIF2 would remain 'any encoding', and specifications in >>>terms of e.g. >>> > "Herbert's as for CIF1..." >>> > might only require a single sentence to define the default after stating >>> > what the 'preferred' encoding was; >>> > the proposal might be phrased as "Herbert's as for CIF1..." + "explicit >>> > default encoding"? >>> > >>> > I do not wish to prolong this debate - if there are objections >>>I will not >>> > launch into an endless round of exchanges >>> > that cover the same ground that has led us this far. >>> > >>> > Cheers >>> > >>> > Simon >>> > >>> > >>> > >>> > >>> > >>> > >>> >>>>___________________________________________________________________________ >>> _ >>> > From: SIMON WESTRIP >>><simonwestrip at btinternet.com>simonwestrip at btinternet.com> >>> > To: Group for discussing encoding and content validation >>>schemes for CIF2 >>> > >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >>> > Sent: Friday, 24 September, 2010 20:10:13 >>> > Subject: Re: [Cif2-encoding] How we wrap this up >>> > >>> > Dear James >>> > >>> > As you may have gathered I have been reconsidering my position on this >>> > issue. >>> > Please forgive me, but I would like to change my vote if that is OK, in >>> > favour of the 'any encoding' camp. >>> > This apparent U-turn is not a response to recent >>>contributions; rather it >>> is >>> > the outcome of a meeting I had this morning >>> > where I demonstrated some new software to the Managing Editor of IUCr >>> > journals. >>> > >>> > By way of explanation: >>> > >>> > I have been developing a new docx template which the IUCr >>>editorial office >>> > is shortly to release for use by >>> > authors. The template will be packaged with some tools to extract data >>> from >>> > CIFs >>> > and tabulate them in the Word document, e.g. open an mmCIF, click a >>> button, >>> > and standard >>> > tables populated with data from the CIF will be included in >>>the document, >>> > acting as >>> > table templates for the author to edit as appropriate for their >>> manuscript. >>> > >>> > Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' >>> > biologists to start using/accepting mmCIF >>> > as a useful medium, rather than as a product of their deposition to the >>> PDB, >>> > and to encourage them to become comfortable >>> > with passing mmCIFs between applications, and even to edit the >>>things (in >>> > the same way as the core-CIF community >>> > treats CIFs). For example, our perception is that there is no reason why >>> an >>> > author should not feel free to take an mmCIF >>> > that has been created by e.g. pdb_extract and populate it using >>> third-party >>> > software before uploading to the PDB for >>> > deposition. >>> > >>> > This cause would not be furthered by effectively invalidating >>>an mmCIF if >>> it >>> > were not to be encoded in one of >>> > the specified encodings. >>> > >>> > So although I am uneasy about a specification that propogates >>>uncertainty, >>> > I'm also uneasy about alienating users, >>> > especially when we are struggling to change their mindset as in the case >>> of >>> > the biological community >>> > (my perception of the biological community's attitude to mmCIF >>>is based on >>> > feedback from authors/coeditors to >>> > IUCr journals). >>> > > >> > Granted this may not be the most compelling argument in favour of 'any >>> > encoding', but recognizing the hurdles that >>> > may have to be overcome once we move beyond ASCII whatever the CIF2 >>> > specification, I support 'any encoding' >>> > as 'a means to an end'. >>> > >>> > I will not provide my preferences in terms of the numbered options until >> > you >>> > say so; afterall, I have already voted and >>> > all this has to be signed off by COMCIFs in any case. >>> > >>> > Cheers >>> > >>> > Simon >>> > >>> > >>> > >>> > >>> >>>>___________________________________________________________________________ >>> _ >>> > From: "Bollinger, John C" >>><John.Bollinger at STJUDE.ORG>John.Bollinger at STJUDE.ORG> >>> > To: Group for discussing encoding and content validation >>>schemes for CIF2 >>> > >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >>> > Sent: Friday, 24 September, 2010 14:50:57 >>> > Subject: Re: [Cif2-encoding] How we wrap this up >>> > >>> > Dear Simon, >>> > >>> > It is exactly this sort of issue that drove me to support more >>>permissive >>> > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local >>> proposal. >>> > >>> > Do please think about the considerations Herb raised. As you reconsider >>> > your votes, I urge you also to ask yourself what, *precisely*, a "text >>> file" >>> > is, and to consider whether your answer is functionally >>>different from my >>> > "local". If you decide not, then please consider what that >>>answer implies >>> > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under >>> > each option on the table, especially for CIFs containing non-ASCII >>> > characters. Whatever you decide about the meaning of "text >>>file", please >>> > consider whether reasonable people might reach a different >>>conclusion, as >>> I >>> > assert they might do, and to what extent the standard needs to address >>> that. >>> > >>> > >>> > Regards, >>> > >>> > John >>> > -- >>> > John C. Bollinger, Ph.D. >>> > Department of Structural Biology >>> > St. Jude Children's Research Hospital >>> > >>> > >>> > >From: >>>cif2-encoding-bounces at iucr.org>cif2-encoding-bounces at iucr.org >>> > >>>[mailto:cif2-encoding-bounces at iucr.org>cif2-encoding-bounces at iucr.org] >>>On Behalf Of SIMON WESTRIP >>> > >Sent: Friday, September 24, 2010 7:53 AM >>> > >To: Group for discussing encoding and content validation >>>schemes for CIF2 >>> > >Subject: Re: [Cif2-encoding] How we wrap this up. . >>> > > >>> > >Dear Herbert >>> > > >>> > >Not for the first time, I find your arguement persuasive. Brian's vote >>> and >>> > explanation have also raised some >>> > >questions that I would like to look into. >>> > > >>> > >I will confirm or otherwise my vote as soon as possible, >>>assuming that is >>> > OK with James and assuming that >>> > >this round of votes might wrap this up. >>> > > >>> > >Cheers >>> > > >>> > >Simon >>> > > >>> > >________________________________________ >>> > >From: Herbert J. Bernstein >>><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >>> > >To: Group for discussing encoding and content validation >>>schemes for CIF2 >>> > >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >>> > >Sent: Friday, 24 September, 2010 13:17:14 >>> > >Subject: Re: [Cif2-encoding] How we wrap this up >>> > > >>> > >If he ignores the standard, in most cases all he has to do to >>>comply with >>> > CIF2 is to run whatever applications he currently runs to produce CIF1 >>> and, >>> > perhaps, in some cases, run a minor edit pass at the end, to convert for >>> the >>> > minor syntactive differences and/or changed tags required to comply with >>> > CIF2 and the new dictionaries, but he is unlikely to have to do anything > >> to >>> > deal with the messy business of whether his encoding is really a proper >>> UTF8 >>> > encoding or not. >>> > >>> > >The punishment if he tries to comply, is that he has to totally uproot >>> and >>> > reconfigure the environment in which he produces CIFs from >>>whatever he is >>> > currently doing to create an enviroment in which he can reliably create >>> and, >>> > more importantly, transmit compliant UTF8 files. This can be >>>very tricky >>> if >>> > he does only a partial job, say fudging in one special >>>application (yet to >>> > be written), because if he stays with his old system, all kinds of tools >>> > will keep trying to transcode whatever he has produced back to whatever >>> his >>> > system considers a standard. Those of us who have files, >>>applications and >>> > tools that have lived through several generations of macs are >>>living proof >>> > of the problem. Macs now have excellent UTF8/16 unicode >>>support, but every >> > > once in a while in working with a unicode file I find it has been >>> strangely >>> > and unexpectedly converted to something else, and it can be >>>really tricky >>> to >>> > spot when the unaccented roman text part has been left >>>untouched but just >>> a >>> > few accen >>> > ted letters have gotten different accents. >>> > >>> > >Mandating UTF8 is simply trying to shift a serious software >>>problem from >>> > the central handlers of CIF (IUCr, PDB, etc.) to the external >>>users. Most >>> > users will probably have the good sense to simply ignore the demand and >>> > leave the burden just where it is now. A few sophisticated users will >>> > probably adapt with no trouble, but the punishment for those users who >>> > blindly follow orders before we have a complete multiplatform supporting >>> > infrastructure in place by mandating UTF8 is severe, expensive and >>> > undeserved. Until and unless we have developed solid support, we will >>> just >>> > be alienating people from CIF. I will continue to oppose such a move. >>> > >>> > [...] >>> > >>> > >>> > Email Disclaimer: >>><http://www.stjude.org/emaildisclaimer>www.stjude.org/emaildisclaimer >>> > _______________________________________________ >>> > cif2-encoding mailing list >>> > >>>cif2-encoding at iucr.org>cif2-encoding at iucr.org >>> > >>><http://scripts.iucr.org/mailman/listinfo/cif2-encoding>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> > >>> > >>> >>> >> >>_______________________________________________ >>cif2-encoding mailing list >>cif2-encoding at iucr.org >>http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > >-- >===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu >===================================================== >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From simonwestrip at btinternet.com Mon Sep 27 13:28:44 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 27 Sep 2010 12:28:44 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> Message-ID: <93847.2110.qm@web87014.mail.ird.yahoo.com> Dear Herbert I do not understand why it is *only* options 3, 4 or 5 that allow users to start using unicode characters? More generally, are you suggesting that the use of anything but ASCII in a data value is only allowed if e.g. the dictionary definition of the data item permits, or even only if the IUCr says that's OK? Fundamentally, I'm starting to infer that the purpose of the 'as for CIF1...' approach to encoding is to open the door to full unicode support, but not actually let anyone cross the threshold? Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 27 September, 2010 11:48:49 Subject: Re: [Cif2-encoding] How we wrap this up Dear Simon, Under the CIF2 specification with UTF8 in place of ASCII there is _no_ change in the use of elided ASCII sequences to represent non-ASCII characters until and unless the IUCr publications office decides that, for that particular application, they are ready to accept something new. It is _only_ if you go forward with options 3, 4 or 5 that you are giving the green light to users to do precisely what you are concerned about -- using the unicode characters instead instead in possibly strange admixtures that nobody is ready to process. Remember, under the CIF2 specification as now written, it is _not_ part of the CIF2 specification to determine the handling of the characters in quoted strings other than to ensure that those string do not contain illegal characters from the point of view of CIF2. Dealing with the validity of particular character sequences in strings users provide is, just as in CIF1, the responsibility of the application (i.e. the IUCr journal flows or the PDB archiving flows). My apologies to James, who I know is trying to do what he believes to be right, but I believe James has things backwards -- the "deep breath" is provided by my proposal -- taking the time to properly engineer the use of the extra characters UTF8 allows us to discuss clearly, while James' push for an immediate prescriptive use of UTF8 with prescriptions that differ drastically from what has been adopted by all other frameworks (HTML, XML, python, etc.) in ways that are untested and unsupported by most existing software is the untimely rush to judgement. I beg you to support options 1 and/or 2 to allow CIF2 to go forward in all other respects while we all take a deep breath and deal with the tricky issue you raised slowly and carefully without the pressure of trying to have CIF2 itself ready for next summer. Regards, Herbert At 9:34 AM +0000 9/27/10, SIMON WESTRIP wrote: >I was not so concerned about invalidating existing CIFs, or even the >likelihood >that users will continue to write e.g. 'f\'oo' - this is a syntax >error in CIF2 that is readily recoverable. > >Rather there is a large group of CIF1 users that are in the habit of >using elided ASCII sequences to >represent non-ASCII characters. With CIF2 these users will be able >to use the unicode character itself. >So we might end up with a mixture of esacaped sequences and unicode >characters (e.g. a user may have a keyboard shortcut >for an accented character that forms part of their name, but might >still resort to \a for alpha, under the assumption that \a is still >valid because CIF2 is basically the same as CIF1, and, rightly or >wrongly, they perceive the eliding machanism as part of >CIF syntax. > >I think this is an issue where we can't afford to take an 'as for >CIF1...' approach, especially as the CIF1 specification >isn't entirely satisfactory (e.g. there's an example in the >line-folding protocal that uses elides in a file path to make a >point, >but actually these elides may easily be interpretted as escape >sequences), and as the encoding issue is very much concerned with >user practice, the large group of users that currently use elided >character codes need to be aware what the situation is in >CIF2? > >I'm not convinced this issue should be left for discussion later; >it is relevant when considering how the move beyond ASCII is specified. > >Cheers > >Simon > > > > >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for >CIF2 >Sent: Sunday, 26 September, 2010 23:14:55 >Subject: Re: [Cif2-encoding] How we wrap this up > >Dear Simon, > > The current CIF2 spec, with or without the changes I have suggested >to temporarily resolve the encoding issue is at best vague and >confusing on the elide character issue. The interacting issue on >which the CIF2 spec >is clear is that we are changing the handling of quoted strings so >that they end on the first occurrence of the quoting character and leaves >the handling of elides to the calling application. > > This will be a problem -- the change from CIF1 in the termination of >quoted strings along with the absence of a way of eliding the quotes >will invalidate a significant number of existing CIFS without any simple >mechanism to recover. Rather than reopen another endless discussion, >I would suggest we simply add the python string concatenation character >"+" to ensure we can map all current CIF1 files and use Brian's common >semantic features for the moment. We can then deal with the full elides >discussion at a future date. > > Regards, > Herbert > > > > > >At 1:40 PM -0700 9/26/10, SIMON WESTRIP wrote: >>Dear all >> >>While reviewing my hypothetical 'to do' list for implementing CIF2 >>in current software, I realized that >>the issue of current support for elided character codes hasnt really >>been addressed in the context of CIF2. >>My 'to do' list contains notes that software could treat them as >>keyboard shortcuts, and their use could be >>defined in the dictionary. However, that was based on a distinct >>difference between CIF1 and CIF2, >>while the current arguments for 'as for CIF1...' suggest that the >>distinction between CIF1 and CIF2 >>should almost be imperceptible. >> >>How is this issue to be addressed in the specification? >> >>Cheers >> >>Simon >> >> >> >>From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >>To: Group for discussing encoding and content validation schemes for >>CIF2 <cif2-encoding at iucr.org> >>Sent: Saturday, 25 September, 2010 20:37:46 >>Subject: Re: [Cif2-encoding] How we wrap this up >> >>Thank you for your cooperation. -- Herbert >> >>===================================================== >>Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> >>yaya at dowling.edu>yaya at dowling.edu >> >>===================================================== >> >>On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >> >>> OK - as promised, I wont pursue the matter :-) >>> >>> >>> >>>____________________________________________________________________________ >>> From: Herbert J. Bernstein >>><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >>> >>> To: Group for discussing encoding and content validation schemes for CIF2 >>> >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >>> >>> Sent: Saturday, 25 September, 2010 19:18:54 >>> Subject: Re: [Cif2-encoding] How we wrap this up >>> >>> Dear Simon, >>> >>> Unfortunately, that is likely to take us back into our infinite loop or >>> into a diverging spiral. Right now, we would have UTF8 as no >>>more or less a >>> default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad >>>first guess as >>> the likely default encoding for any given CIF, but not a formal >>>constraint. >>> I would suggest we leave the wording in that imprecise state, get CIF2 out >>> and accepted and then work further on the encoding issue. >>> >>> Regards, >>> Herbert >>> >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> >>>yaya at dowling.edu>yaya at dowling.edu >>> >>> ===================================================== >>> >>> On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >>> >>> > Dear all >>> > >>> > In the event that CIF2 adopts the 'any encoding' approach, >>>would there be >> > > any objections to > >> > explicitly defining a default encoding in the specification, to be >>> defaulted >>> > to when there were no indications >>> > to the contrary. At worst this would give CIF2 service >>>providers an excuse >>> > to interpret CIFs as e.g. UTF8 if they couldnt >>> > determine the encoding by other means - but such intollerant service >>> > providers would soon find that their service is >>> > not successful - while at best this might raise awareness of the issues >>> > regarding encoding once non-ASCII is used in >>> > a CIF. Essentially, it does not require users to change there working >>> > practices, which is one of the main arguments for >>> > 'any encoding'. >>> > >>> > So, CIF2 would remain 'any encoding', and specifications in >>>terms of e.g. >>> > "Herbert's as for CIF1..." >>> > might only require a single sentence to define the default after stating >>> > what the 'preferred' encoding was; >>> > the proposal might be phrased as "Herbert's as for CIF1..." + "explicit >>> > default encoding"? >>> > >>> > I do not wish to prolong this debate - if there are objections >>>I will not >>> > launch into an endless round of exchanges >>> > that cover the same ground that has led us this far. >>> > >>> > Cheers >>> > >>> > Simon >>> > >>> > >>> > >>> > >>> > >>> > >>> >>>>___________________________________________________________________________ >>> _ >>> > From: SIMON WESTRIP >>><simonwestrip at btinternet.com>simonwestrip at btinternet.com> >>> >>> > To: Group for discussing encoding and content validation >>>schemes for CIF2 >>> > >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >>> >>> > Sent: Friday, 24 September, 2010 20:10:13 >>> > Subject: Re: [Cif2-encoding] How we wrap this up >>> > >>> > Dear James >>> > >>> > As you may have gathered I have been reconsidering my position on this >>> > issue. >>> > Please forgive me, but I would like to change my vote if that is OK, in >>> > favour of the 'any encoding' camp. >>> > This apparent U-turn is not a response to recent >>>contributions; rather it >>> is >>> > the outcome of a meeting I had this morning >>> > where I demonstrated some new software to the Managing Editor of IUCr >>> > journals. >>> > >>> > By way of explanation: >>> > >>> > I have been developing a new docx template which the IUCr >>>editorial office >>> > is shortly to release for use by >>> > authors. The template will be packaged with some tools to extract data >>> from >>> > CIFs >>> > and tabulate them in the Word document, e.g. open an mmCIF, click a >>> button, >>> > and standard >>> > tables populated with data from the CIF will be included in >>>the document, >>> > acting as >>> > table templates for the author to edit as appropriate for their >>> manuscript. >>> > >>> > Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' >>> > biologists to start using/accepting mmCIF >>> > as a useful medium, rather than as a product of their deposition to the >>> PDB, >>> > and to encourage them to become comfortable >>> > with passing mmCIFs between applications, and even to edit the >>>things (in >>> > the same way as the core-CIF community >>> > treats CIFs). For example, our perception is that there is no reason why >>> an >>> > author should not feel free to take an mmCIF >>> > that has been created by e.g. pdb_extract and populate it using >>> third-party >>> > software before uploading to the PDB for >>> > deposition. >>> > >>> > This cause would not be furthered by effectively invalidating >>>an mmCIF if >>> it >>> > were not to be encoded in one of >>> > the specified encodings. >>> > >>> > So although I am uneasy about a specification that propogates >>>uncertainty, >>> > I'm also uneasy about alienating users, >>> > especially when we are struggling to change their mindset as in the case >>> of >>> > the biological community >>> > (my perception of the biological community's attitude to mmCIF >>>is based on >>> > feedback from authors/coeditors to >>> > IUCr journals). >>> > > >> > Granted this may not be the most compelling argument in favour of 'any >>> > encoding', but recognizing the hurdles that >>> > may have to be overcome once we move beyond ASCII whatever the CIF2 >>> > specification, I support 'any encoding' >>> > as 'a means to an end'. >>> > >>> > I will not provide my preferences in terms of the numbered options until >> > you >>> > say so; afterall, I have already voted and >>> > all this has to be signed off by COMCIFs in any case. >>> > >>> > Cheers >>> > >>> > Simon >>> > >>> > >>> > >>> > >>> >>>>___________________________________________________________________________ >>> _ >>> > From: "Bollinger, John C" >>><John.Bollinger at STJUDE.ORG>John.Bollinger at STJUDE.ORG> >>> >>> > To: Group for discussing encoding and content validation >>>schemes for CIF2 >>> > >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >>> >>> > Sent: Friday, 24 September, 2010 14:50:57 >>> > Subject: Re: [Cif2-encoding] How we wrap this up >>> > >>> > Dear Simon, >>> > >>> > It is exactly this sort of issue that drove me to support more >>>permissive >>> > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local >>> proposal. >>> > >>> > Do please think about the considerations Herb raised. As you reconsider >>> > your votes, I urge you also to ask yourself what, *precisely*, a "text >>> file" >>> > is, and to consider whether your answer is functionally >>>different from my >>> > "local". If you decide not, then please consider what that >>>answer implies >>> > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) under >>> > each option on the table, especially for CIFs containing non-ASCII >>> > characters. Whatever you decide about the meaning of "text >>>file", please >>> > consider whether reasonable people might reach a different >>>conclusion, as >>> I >>> > assert they might do, and to what extent the standard needs to address >>> that. >>> > >>> > >>> > Regards, >>> > >>> > John >>> > -- >>> > John C. Bollinger, Ph.D. >>> > Department of Structural Biology >>> > St. Jude Children's Research Hospital >>> > >>> > >>> > >From: >>>cif2-encoding-bounces at iucr.org>cif2-encoding-bounces at iucr.org >>> >>> > >>>[mailto:cif2-encoding-bounces at iucr.org>cif2-encoding-bounces at iucr.org] >>> >>>On Behalf Of SIMON WESTRIP >>> > >Sent: Friday, September 24, 2010 7:53 AM >>> > >To: Group for discussing encoding and content validation >>>schemes for CIF2 >>> > >Subject: Re: [Cif2-encoding] How we wrap this up. . >>> > > >>> > >Dear Herbert >>> > > >>> > >Not for the first time, I find your arguement persuasive. Brian's vote >>> and >>> > explanation have also raised some >>> > >questions that I would like to look into. >>> > > >>> > >I will confirm or otherwise my vote as soon as possible, >>>assuming that is >>> > OK with James and assuming that >>> > >this round of votes might wrap this up. >>> > > >>> > >Cheers >>> > > >>> > >Simon >>> > > >>> > >________________________________________ >>> > >From: Herbert J. Bernstein >>><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >>> >>> > >To: Group for discussing encoding and content validation >>>schemes for CIF2 >>> > >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >>> >>> > >Sent: Friday, 24 September, 2010 13:17:14 >>> > >Subject: Re: [Cif2-encoding] How we wrap this up >>> > > >>> > >If he ignores the standard, in most cases all he has to do to >>>comply with >>> > CIF2 is to run whatever applications he currently runs to produce CIF1 >>> and, >>> > perhaps, in some cases, run a minor edit pass at the end, to convert for >>> the >>> > minor syntactive differences and/or changed tags required to comply with >>> > CIF2 and the new dictionaries, but he is unlikely to have to do anything > >> to >>> > deal with the messy business of whether his encoding is really a proper >>> UTF8 >>> > encoding or not. >>> > >>> > >The punishment if he tries to comply, is that he has to totally uproot >>> and >>> > reconfigure the environment in which he produces CIFs from >>>whatever he is >>> > currently doing to create an enviroment in which he can reliably create >>> and, >>> > more importantly, transmit compliant UTF8 files. This can be >>>very tricky >>> if >>> > he does only a partial job, say fudging in one special >>>application (yet to >>> > be written), because if he stays with his old system, all kinds of tools >>> > will keep trying to transcode whatever he has produced back to whatever >>> his >>> > system considers a standard. Those of us who have files, >>>applications and >>> > tools that have lived through several generations of macs are >>>living proof >>> > of the problem. Macs now have excellent UTF8/16 unicode >>>support, but every >> > > once in a while in working with a unicode file I find it has been >>> strangely >>> > and unexpectedly converted to something else, and it can be >>>really tricky >>> to >>> > spot when the unaccented roman text part has been left >>>untouched but just >>> a >>> > few accen >>> > ted letters have gotten different accents. >>> > >>> > >Mandating UTF8 is simply trying to shift a serious software >>>problem from >>> > the central handlers of CIF (IUCr, PDB, etc.) to the external >>>users. Most >>> > users will probably have the good sense to simply ignore the demand and >>> > leave the burden just where it is now. A few sophisticated users will >>> > probably adapt with no trouble, but the punishment for those users who >>> > blindly follow orders before we have a complete multiplatform supporting >>> > infrastructure in place by mandating UTF8 is severe, expensive and >>> > undeserved. Until and unless we have developed solid support, we will >>> just >>> > be alienating people from CIF. I will continue to oppose such a move. >>> > >>> > [...] >>> > >>> > >>> > Email Disclaimer: >>><http://www.stjude.org/emaildisclaimer>www.stjude.org/emaildisclaimer >>> >>> > _______________________________________________ >>> > cif2-encoding mailing list >>> > >>>cif2-encoding at iucr.org>cif2-encoding at iucr.org >>> >>> > >>><http://scripts.iucr.org/mailman/listinfo/cif2-encoding>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >>> > >>> > >>> >>> >> >>_______________________________________________ >>cif2-encoding mailing list >>cif2-encoding at iucr.org >>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> > > >-- >===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu >===================================================== >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100927/96426c88/attachment-0001.html From jamesrhester at gmail.com Mon Sep 27 14:40:18 2010 From: jamesrhester at gmail.com (James Hester) Date: Mon, 27 Sep 2010 23:40:18 +1000 Subject: [Cif2-encoding] Let's all take a deep breath... In-Reply-To: References: Message-ID: See interpolated comments. On Mon, Sep 27, 2010 at 11:36 AM, Herbert J. Bernstein < yaya at bernstein-plus-sons.com> wrote: > Dear Colleagues, > > As one might expect, I respectfully disagree with almost everything > James has said, but the really critical point of disagreement is > > >(iii) It is extremely misleading to think that simply substituting > >UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the > >same results as we had for CIF1. The 'any encoding' clause in the > >CIF1 standard was essentially irrelevant - encodings used in the > >overwhelming majority of systems producing CIF1 files coincided with > >ASCII for CIF text, as I have said many times before, so software had > >no trouble in turning a stream of CIF bytes from any unknown source > >into the same text that the CIF writer was working from. If I repeat > >this point endlessly, it is only because the CIF1 approach continues > >to be invoked like magic fairy dust that will make everything OK, when > >in fact the magic fairy dust was the dominance of ASCII encoding for > >ASCII codepoints. There is *no such uniformity* in encoding of > >Unicode codepoints. We have a new problem for CIF, and whatever we do > >will have *new* consequences, and that very much includes the 'as for > >CIF1' proposal. So please, enough with the 'CIF1 has served us well > >for 15 years' line. > > I vigorously disagree on this point. If the only change we were to > make in going to CIF2 were to be that we were inserting UTF8 in place > of ASCII there would be absolutely _no_ impact on any existing > CIF application or CIF data file, because for the characters that > are formally legal under CIF1, UTF8 and ASCII are identical encodings. > The relevant portion of the CIF1 syntax specification is: > > "22. Characters within a CIF are restricted to certain printable or > white-space characters. Specifically, these are the ones located in > the ASCII character set at decimal positions 09 (HT or horizontal > tab), 10 (LF or line feed), 13 (CR or carriage return) and the > letters, numerals and punctuation marks at positions 32-126." > > Any existing data file or application that conforms to that restriction > in _any_ encoding, will be indistinguishable with the "UTF8 in place of > ASCII" > change. For those applications and data file, this is not a change. > Up to here, I absolutely agree with you. > > As James implies, the only new problems that arise are in introducing > characters into CIFs drawn from codepoints 128 ff, but we already have > that problem under CIF1. No we don't have that problem at all in CIF1, because we don't accept characters from above codepoint 128 in CIF1. While it is indeed the only new problem, it is a huge one. > The use of "UTF8 in place of ASCII" simply > allows us to coherently consider how to handle those characters in the > future. How does it create this breathing space? Because as far as I can see, 'as for CIF1' is allowing any encoding to be used for Unicode codepoints. This is where I was forced to invoke magic fairy dust, because the 'UTF8 in place of ASCII' approach advocates a formulation that was largely irrelevant in CIF1 to solve a problem that was not present in CIF1 in the first place, with no justification beyond "it worked for CIF1", when in fact it worked about as well as my most excellent elephant repellent does (look, no elephants, it must work!). Sorry for being obtuse, but I don't follow your logic at all. > If we don't use them in any IUCr-sanctioned dictionary tags > for the moment, we are in no worse shape going forward under my proposal > with Brian's recommendations added than we are staying with CIF1 and > ASCII, and, I believe, in much better shape. > Is this also part of your proposal - that the IUCr hold off on using non-ASCII characters in tags (and, I assume, values) until we sort this out? > This is a serious matter, not appropriate from sarcastic "fairy dust" > comments. You will note that I am taking it extremely seriously (my wife despairs), to the point that I am forced to write about castles in the sky. Which is why I really need to understand the thinking behind your statements and those of others. > It really is true that "CIF1 has served us well > for 15 years," Yes it has. This however does not justify adopting the same approach to encoding. See above for why. > and we should take our time on the encoding issue > and be certain we are really improving things, not making them > worse by what we propose. By adopting Unicode we immediately create big new problems, to which we need new solutions to even get close to the current situation in CIF1. > I agree that we need to discuss and > resolve the encoding issue, but it is not a new problem suddenly > introduced by using UTF8. (I assume you mean Unicode, not UTF8). No, the opposite is true. This is a new problem arising entirely from adoption of Unicode, because in the CIF1 ASCII range most encodings are identical. This is not so for Unicode. > In my opinion, however, a hasty, ill-considered > resolution to that serious problem would be a very bad idea, but > delaying all of CIF2 in order to wait until we work our way out of > a thicket that has no clear exit yet also seems to me to be a very bad > idea. You are right, although hasty is a bit of an exaggeration for the pointy end of a several-month-long discussion. Unfortunately, your proposal does not in my opinion create breathing room but simply ducks the new problem of multiple encodings. If you want breathing room, you allow only self-identifying encodings now (UTF8 and UTF16) and then work on how you would allow other encodings. Something like: "Only self-identifying encodings are currently supported for use in CIF2, but other encodings may become available at a later date". Then you can spend all the time in the world working on systems for incorporating other encodings. > I expect if we ever manage to meet face to face, or even in > a series of Skype meetings, we could come to closure fairly quickly, > but as things now stand it seems unlikely that we will have a chance > to do that before the IUCr meeting. > I have bought a web cam and when I spend less time researching encoding issues I should be able to get set up and running. > > I use CIF as a text format and I use it as a binary format. I use > both DDL1 CIF and DDL2 CIF. I am also a cross-platform and cross > version CIF programmer. I do not fool myself into thinking CIF1 to be > perfect. It is not. But it is a very useful tool, and I would like > to be certain that what we propose as CIF2 is a least as useful as > CIF1 and hopefully more so. I do not believe that options 3, 4 or > 5 are far enough along to provide such utility, and by being too > prescriptive at this stage, may well do harm. > Why can't we be open-ended, and say that "UTF8 and UTF16 are acceptable in the whole Unicode range. All other encodings are acceptable in the ASCII range only. We are investigating ways of extending the range of applicability of non UTF encodings". Given that UTF8 and UTF16 are definitely encodings that CIF2 will use, and require no extra CIF2 machinery (hashcodes etc.) to identify them, they are ready for use now. I am more than happy for us to continue to investigate ways of allowing a wider or infinite variety of encodings. > > I urge all concerned to support either options 1 or 2 or both, so we > have get CIF2 out for the IUCr meeting this coming summer, and > to let the encoding issue take its own time. If by some chance > we come up with a solution before summer 2011, so much the better, > but please don't make the perfect (CIF2 with all issues including > the encoding issue resolved) the enemy of the good (the CIF2 we have > now with the encoding issue left open). > Both options 3 and 4 can also be used as the basis of a proposal that allows further work on the encoding issue (see previous paragraph). Indeed, proposals based on 3 and 4 allow Unicode code points to be used in a controlled way and do not encourage proliferation of encodings before we are able to manage that proliferation. I invite you to answer the question at the end of my previous email (reproduced below). Note that under proposals 3, 4 and 5 it has a simple answer. >Under the 'as for CIF1' proposal, how does my >program turn these bytes into text in the way that the writer of the >bytes intended? If that is not yet resolved, how can anybody even >write a CIF2 program? all the best, James. > > Regards, > Herbert > > > At 9:46 AM +1000 9/27/10, James Hester wrote: > >Well, I didn't even manage to properly call a vote and everybody has > >piled in, Simon even managed to vote twice (and that's quite OK Simon, > >we are trying to determine what the will of the group is and so I > >think it only reasonable that if somebody's assessment of the > >situation changes that they can 'update' their vote). I am however > >unhappy that both Brian and Simon introduced new concerns and nobody > >has had a chance to comment on how the various proposals under > >consideration might affect those concerns. I would therefore like to > >suggest that the voting period continues until the end of this week, > >and that we all endeavour to express any concerns or comments that we > >need to make in a timely fashion. I will be commenting on Brian and > >Simon's concerns presently, and also on Herbert's proposal, which I > >have not subjected to my hopefully not too long-winded scrutiny. > > > >None of us should feel steamrolled by a certain artifical urgency that > >has appeared in the dialogue - while we do need to wrap things up in a > >timely fashion, it has only been 4 days since I even started > >discussing the vote. > > > >Some initial general comments (I will comment separately on Brian and > >Simon's issues). > > > >(i) We are *not* in an infinite loop. The last few months have seen > >several proposals analysed and explored, and it is my perception that > >these discussions have led at least some participants (including > >myself) to a better understanding of the consequences of what they are > >proposing. So nobody should feel that throwing out a new criticism of > >an old or new proposal is somehow hindering progress by looping over > >old ground. Quite the reverse, it is making progress. What *is* > >important is to get your comments into the mix in a timely fashion, > >because time is indeed short. > > > >(ii) It is not correct to assume that we can figure out the encoding > >issues later. Maybe we can, but maybe we can't. Once CIF2 files are > >produced and software is distributed, you can't put the genie back in > >the bottle, by which I mean you can't easily change the way that > >distributed software behaves, and how files are interpreted. We have > >to therefore be confident that the standard we promulgate does not > >close off an avenue we need for solving encoding issues. > > > >(iii) It is extremely misleading to think that simply substituting > >UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the > >same results as we had for CIF1. The 'any encoding' clause in the > >CIF1 standard was essentially irrelevant - encodings used in the > >overwhelming majority of systems producing CIF1 files coincided with > >ASCII for CIF text, as I have said many times before, so software had > >no trouble in turning a stream of CIF bytes from any unknown source > >into the same text that the CIF writer was working from. If I repeat > >this point endlessly, it is only because the CIF1 approach continues > >to be invoked like magic fairy dust that will make everything OK, when > >in fact the magic fairy dust was the dominance of ASCII encoding for > >ASCII codepoints. There is *no such uniformity* in encoding of > >Unicode codepoints. We have a new problem for CIF, and whatever we do > >will have *new* consequences, and that very much includes the 'as for > >CIF1' proposal. So please, enough with the 'CIF1 has served us well > >for 15 years' line. > > > >(iv) The majority are currently in favour of the 'as for CIF1' > >approach, which if nobody changes their vote by the end of the week, > >is what we will be taking to the DDLm group and COMCIFS. This means > >we will have a pure text standard, and I mean really pure, because > >there is no predictable link between this beautiful textual castle in > >the sky and the solid ground of bytes on disk. > > > >I am a cross-platform CIF programmer. Looking forward to the halcyon > >'as for CIF1' days that await us, a small question occupies my mind. > >As my program does not operate in that glorious abstract space > >occupied by pure text standards that are most certainly not anybody's > >laughing stock, my program will be forced to (as briefly as possible) > >deal with humble plebiean bytes according to some encoding to obtain > >the exalted CIF text. Under the 'as for CIF1' proposal, how does my > >program turn these bytes into text in the way that the writer of the > >bytes intended? If that is not yet resolved, how can anybody even > >write a CIF2 program? > > > >-- > >T +61 (02) 9717 9907 > >F +61 (02) 9717 3145 > >M +61 (04) 0249 4148 > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > -- > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100927/47e85b90/attachment-0001.html From yaya at bernstein-plus-sons.com Mon Sep 27 14:45:16 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 27 Sep 2010 09:45:16 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <93847.2110.qm@web87014.mail.ird.yahoo.com> References: <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <93847.2110.qm@web87014.mail.ird.yahoo.com> Message-ID: The problem is that options 3,4 and 5 specifically prescribe the use of Unicode characters (that is the entire point of those options -- and that is the point in dispute -- whether we should be prescribing UTF8 or using is as we now use ASCII, as a way to be clear what we are talking about as in CIF1) and we simply are not ready to deal such a requirement yet. I take the blame for starting this discussion many years ago when I simply asked for just what my motion says, that we start using UTF8 in the same way we had been using ASCII. Unfortunately this discussion has turned into a strong push to focus CIF on that particular encoding, stop using Brian's elides, etc. With the current weak state of software support for CIF and the large investment at the IUCr and at the PDB in current workflows, I think it would be a very disruptive and expensive change to make right now. God and the Devil are in the details. Note that I am _not_ basing this argument on imgCIF. At this point it appears, unfortunately, that CIF2 and imgCIF will have to diverge. If we have enough face-to-face discussions, perhaps we can bring them together again, as we did in 1998, but that is an even more difficult discussion than the one we need to have on encodings. What is I we will do is to go at this in incremental stages: 1. Make the transition from CIF1 to CIF2 using new dictionaries but allowing most data files to remain unchanges, and providing simple algorithmic transformations for the rest, but keeping most of the current semantic extensions that we have in CIF1, focusing our enegry on getting the new dictionaries used and making use of dREL; 2. Work on a CIF2.1 that, by creative and well-supported use of Unicode, allows for a well organized transition from Brian's elides to use of Unicode characters 3. Then working in that context, whatever it turns out to be, work on having imgCIF make the transition to CIF2 in some reasonably compatible way. I see how to do item 1 for next summer. I don't see how to do 2 and 3 in that time frame, though I am sure we could make a dent in them if we could meet face to face. email tends to stiffen too many positions. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Mon, 27 Sep 2010, SIMON WESTRIP wrote: > Dear Herbert > > I do not understand why it is *only* options 3, 4 or 5 that allow users to > start using > unicode characters? > > More generally, are you suggesting that the use of anything but ASCII in a > data value is only allowed if > e.g. the dictionary definition of the data item permits, or even only if the > IUCr says that's OK? > > Fundamentally, I'm starting to infer that the purpose of the 'as for > CIF1...' approach to encoding is > to open the door to full unicode support, but not actually let anyone cross > the threshold? > > > Cheers > > Simon > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Monday, 27 September, 2010 11:48:49 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > ? Under the CIF2 specification with UTF8 in place of ASCII there is > _no_ change in the use of elided ASCII sequences to represent non-ASCII > characters until and unless the IUCr publications office decides that, > for that particular application, they are ready to accept something > new. > > ? It is _only_ if you go forward with options 3, 4 or 5 that you > are giving the green light to users to do precisely what you are > concerned about -- using the unicode characters instead instead > in possibly strange admixtures that nobody is ready to process. > > ? Remember, under the CIF2 specification as now written, it is > _not_ part of the CIF2 specification to determine the handling > of the characters in quoted strings other than to ensure that > those string do not contain illegal characters from the point > of view of CIF2.? Dealing with the validity of particular character > sequences in strings users provide is, just as in CIF1, the > responsibility of the application (i.e. the IUCr journal flows > or the PDB archiving flows). > > ? My apologies to James, who I know is trying to do what he believes > to be right, but I believe James has things backwards -- the "deep > breath" is provided by my proposal -- taking the time to properly engineer > the use of the extra characters UTF8 allows us to discuss clearly, > while James' push for an immediate prescriptive use of UTF8 with > prescriptions that differ drastically from what has been adopted > by all other frameworks (HTML, XML, python, etc.) in ways that > are untested and unsupported by most existing software is > the untimely rush to judgement. > > ? I beg you to support options 1 and/or 2 to allow CIF2 to go forward > in all other respects while we all take a deep breath and deal > with the tricky issue you raised slowly and carefully without the > pressure of trying to have CIF2 itself ready for next summer. > > ? Regards, > ? ? Herbert > > At 9:34 AM +0000 9/27/10, SIMON WESTRIP wrote: > >I was not so concerned about invalidating existing CIFs, or even the > >likelihood > >that users will continue to write e.g. 'f\'oo' - this is a syntax > >error in CIF2 that is readily recoverable. > > > >Rather there is a large group of CIF1 users that are in the habit of > >using elided ASCII sequences to > >represent non-ASCII characters. With CIF2 these users will be able > >to use the unicode character itself. > >So we might end up with a mixture of esacaped sequences and unicode > >characters (e.g. a user may have a keyboard shortcut > >for an accented character that forms part of their name, but might > >still resort to \a for alpha, under the assumption that \a is still > >valid because CIF2 is basically the same as CIF1, and, rightly or > >wrongly, they perceive the eliding machanism as part of > >CIF syntax. > > > >I think this is an issue where we can't afford to take an 'as for > >CIF1...' approach, especially as the CIF1 specification > >isn't entirely satisfactory (e.g. there's an example in the > >line-folding protocal that uses elides in a file path to make a > >point, > >but actually these elides may easily be interpretted as escape > >sequences), and as the encoding issue is very much concerned with > >user practice, the large group of users that currently use elided > >character codes need to be aware what the situation is in > >CIF2? > > > >I'm not convinced this issue should be left for discussion later; > >it is relevant when considering how the move beyond ASCII is specified. > > > >Cheers > > > >Simon > > > > > > > > > >From: Herbert J. Bernstein > >To: Group for discussing encoding and content validation schemes for > >CIF2 > >Sent: Sunday, 26 September, 2010 23:14:55 > >Subject: Re: [Cif2-encoding] How we wrap this up > > > >Dear Simon, > > > >? The current CIF2 spec, with or without the changes I have suggested > >to temporarily resolve the encoding issue is at best vague and > >confusing on the elide character issue.? The interacting issue on > >which the CIF2 spec > >is clear is that we are changing the handling of quoted strings so > >that they end on the first occurrence of the quoting character and leaves > >the handling of elides to the calling application. > > > >? This will be a problem -- the change from CIF1 in the termination of > >quoted strings along with the absence of a way of eliding the quotes > >will invalidate a significant number of existing CIFS without any simple > >mechanism to recover.? Rather than reopen another endless discussion, > >I would suggest we simply add the python string concatenation character > >"+" to ensure we can map all current CIF1 files and use Brian's common > >semantic features for the moment.? We can then deal with the full elides > >discussion at a future date. > > > >? Regards, > >? ? Herbert > > > > > > > > > > > >At 1:40 PM -0700 9/26/10, SIMON WESTRIP wrote: > >>Dear all > >> > >>While reviewing my hypothetical 'to do' list for implementing CIF2 > >>in current software, I realized that > >>the issue of current support for elided character codes hasnt really > >>been addressed in the context of CIF2. > >>My 'to do' list contains notes that software could treat them as > >>keyboard shortcuts, and their use could be > >>defined in the dictionary. However, that was based on a distinct > >>difference between CIF1 and CIF2, > >>while the current arguments for 'as for CIF1...' suggest that the > >>distinction between CIF1 and CIF2 > >>should almost be imperceptible. > >> > >>How is this issue to be addressed in the specification? > >> > >>Cheers > >> > >>Simon > >> > >> > >> > >>From: Herbert J. Bernstein > >><yaya at bernstein-plus-sons.com> > >>To: Group for discussing encoding and content validation schemes for > >>CIF2 <cif2-encoding at iucr.org> > >>Sent: Saturday, 25 September, 2010 20:37:46 > >>Subject: Re: [Cif2-encoding] How we wrap this up > >> > >>Thank you for your cooperation. -- Herbert > >> > >>===================================================== > >>Herbert J. Bernstein, Professor of Computer Science > >>? Dowling College, Kramer Science Center, KSC 121 > >>? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > >> > >>? ? ? ? ? ? ? ? +1-631-244-3035 > >> > >>yaya at dowling.edu> u>yaya at dowling.edu > >>===================================================== > >> > >>On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > >> > >>>? OK - as promised, I wont pursue the matter :-) > >>> > >>> > >>> > >>>________________________________________________________________________ > ____ > >>>? From: Herbert J. Bernstein > >>><yaya at bernstein-plus-sons.c > om>yaya at bernstein-plus-sons.com> > >>>? To: Group for discussing encoding and content validation schemes for > CIF2 > >>> > >>><cif2-encoding at iucr.org> if2-encoding at iucr.org>cif2-encoding at iucr.org> > >>>? Sent: Saturday, 25 September, 2010 19:18:54 > >>>? Subject: Re: [Cif2-encoding] How we wrap this up > >>> > >>>? Dear Simon, > >>> > >>>? ? Unfortunately, that is likely to take us back into our infinite loop > or > >>>? into a diverging spiral.? Right now, we would have UTF8 as no > >>>more or less a > >>>? default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad > >>>first guess as > >>>? the likely default encoding for any given CIF, but not a formal > >>>constraint. > >>>? I would suggest we leave the wording in that imprecise state, get CIF2 > out > >>>? and accepted and then work further on the encoding issue. > >>> > >>>? ? Regards, > >>>? ? ? Herbert > >>> > >>>? ===================================================== > >>>? Herbert J. Bernstein, Professor of Computer Science > >>>? ? Dowling College, Kramer Science Center, KSC 121 > >>>? ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > >>> > >>>? ? ? ? ? ? ? ? ? +1-631-244-3035 > >>> > >>>yaya at dowling.edu> du>yaya at dowling.edu > >>>? ===================================================== > >>> > >>>? On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > >>> > >>>? > Dear all > >>>? > > >>>? > In the event that CIF2 adopts the 'any encoding' approach, > >>>would there be > >>? > > any objections to > >? >>? > explicitly defining a default encoding in the specification, to be > >>>? defaulted > >>>? > to when there were no indications > >>>? > to the contrary. At worst this would give CIF2 service > >>>providers an excuse > >>>? > to interpret CIFs as e.g. UTF8 if they couldnt > >>>? > determine the encoding by other means - but such intollerant service > >>>? > providers would soon find that their service is > >>>? > not successful - while at best this might raise awareness of the > issues > >>>? > regarding encoding once non-ASCII is used in > >>>? > a CIF. Essentially, it does not require users to change there working > >>>? > practices, which is one of the main arguments for > >>>? > 'any encoding'. > >>>? > > >>>? > So, CIF2 would remain 'any encoding', and specifications in > >>>terms of e.g. > >>>? > "Herbert's as for CIF1..." > >>>? > might only require a single sentence to define the default after > stating > >>>? > what the 'preferred' encoding was; > >>>? > the proposal might be phrased as "Herbert's as for CIF1..." + > "explicit > >>>? > default encoding"? > >>>? > > >>>? > I do not wish to prolong this debate - if there are objections > >>>I will not > >>>? > launch into an endless round of exchanges > >>>? > that cover the same ground that has led us this far. > >>>? > > >>>? > Cheers > >>>? > > >>>? > Simon > >>>? > > >>>? > > >>>? > > >>>? > > >>>? > > >>>? > > >>> > >>>>_______________________________________________________________________ > ____ > >>>? _ > >>>? > From: SIMON WESTRIP > >>><simonwestrip at btinternet.com > >simonwestrip at btinternet.com> > >>>? > To: Group for discussing encoding and content validation > >>>schemes for CIF2 > >>>? > > >>><cif2-encoding at iucr.org> if2-encoding at iucr.org>cif2-encoding at iucr.org> > >>>? > Sent: Friday, 24 September, 2010 20:10:13 > >>>? > Subject: Re: [Cif2-encoding] How we wrap this up > >>>? > > >>>? > Dear James > >>>? > > >>>? > As you may have gathered I have been reconsidering my position on > this > >>>? > issue. > >>>? > Please forgive me, but I would like to change my vote if that is OK, > in > >>>? > favour of the 'any encoding' camp. > >>>? > This apparent U-turn is not a response to recent > >>>contributions; rather it > >>>? is > >>>? > the outcome of a meeting I had this morning > >>>? > where I demonstrated some new software to the Managing Editor of IUCr > >>>? > journals. > >>>? > > >>>? > By way of explanation: > >>>? > > >>>? > I have been developing a new docx template which the IUCr > >>>editorial office > >>>? > is shortly to release for use by > >>>? > authors. The template will be packaged with some tools to extract > data > >>>? from > >>>? > CIFs > >>>? > and tabulate them in the Word document, e.g. open an mmCIF, click a > >>>? button, > >>>? > and standard > >>>? > tables populated with data from the CIF will be included in > >>>the document, > >>>? > acting as > >>>? > table templates for the author to edit as appropriate for their > >>>? manuscript. > >>>? > > >>>? > Inclusion of the mmCIF tools is part of an unofficial policy to > 'coax' > >>>? > biologists to start using/accepting mmCIF > >>>? > as a useful medium, rather than as a product of their deposition to > the > >>>? PDB, > >>>? > and to encourage them to become comfortable > >>>? > with passing mmCIFs between applications, and even to edit the > >>>things (in > >>>? > the same way as the core-CIF community > >>>? > treats CIFs). For example, our perception is that there is no reason > why > >>>? an > >>>? > author should not feel free to take an mmCIF > >>>? > that has been created by e.g. pdb_extract and populate it using > >>>? third-party > >>>? > software before uploading to the PDB for > >>>? > deposition. > >>>? > > >>>? > This cause would not be furthered by effectively invalidating > >>>an mmCIF if > >>>? it > >>>? > were not to be encoded in one of > >>>? > the specified encodings. > >>>? > > >>>? > So although I am uneasy about a specification that propogates > >>>uncertainty, > >>>? > I'm also uneasy about alienating users, > >>>? > especially when we are struggling to change their mindset as in the > case > >>>? of > >>>? > the biological community > >>>? > (my perception of the biological community's attitude to mmCIF > >>>is based on > >>>? > feedback from authors/coeditors to > >>>? > IUCr journals). > >>>? > > >? >>? > Granted this may not be the most compelling argument in favour of > 'any > >>>? > encoding', but recognizing the hurdles that > >>>? > may have to be overcome once we move beyond ASCII whatever the CIF2 > >>>? > specification, I support 'any encoding' > >>>? > as 'a means to an end'. > >>>? > > >>>? > I will not provide my preferences in terms of the numbered options > until > >>? > you > >>>? > say so; afterall, I have already voted and > >>>? > all this has to be signed off by COMCIFs in any case. > >>>? > > >>>? > Cheers > >>>? > > >>>? > Simon > >>>? > > >>>? > > >>>? > > >>>? > > >>> > >>>>_______________________________________________________________________ > ____ > >>>? _ > >>>? > From: "Bollinger, John C" > >>><John.Bollinger at STJUDE.ORG> ilto:John.Bollinger at STJUDE.ORG>John.Bollinger at STJUDE.ORG> > >>>? > To: Group for discussing encoding and content validation > >>>schemes for CIF2 > >>>? > > >>><cif2-encoding at iucr.org> if2-encoding at iucr.org>cif2-encoding at iucr.org> > >>>? > Sent: Friday, 24 September, 2010 14:50:57 > >>>? > Subject: Re: [Cif2-encoding] How we wrap this up > >>>? > > >>>? > Dear Simon, > >>>? > > >>>? > It is exactly this sort of issue that drove me to support more > >>>permissive > >>>? > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local > >>>? proposal. > >>>? > > >>>? > Do please think about the considerations Herb raised.? As you > reconsider > >>>? > your votes, I urge you also to ask yourself what, *precisely*, a > "text > >>>? file" > >>>? > is, and to consider whether your answer is functionally > >>>different from my > >>>? > "local".? If you decide not, then please consider what that > >>>answer implies > >>>? > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) > under > >>>? > each option on the table, especially for CIFs containing non-ASCII > >>>? > characters.? Whatever you decide about the meaning of "text > >>>file", please > >>>? > consider whether reasonable people might reach a different > >>>conclusion, as > >>>? I > >>>? > assert they might do, and to what extent the standard needs to > address > >>>? that. > >>>? > > >>>? > > >>>? > Regards, > >>>? > > >>>? > John > >>>? > -- > >>>? > John C. Bollinger, Ph.D. > >>>? > Department of Structural Biology > >>>? > St. Jude Children's Research Hospital > >>>? > > >>>? > > >>>? > >From: > >>>cif2-encoding-bounces at iuc > r.org>cif2-encoding-bounces at iucr.org > > >>>? > > >>>[mailto:cif2-encoding-bou > nces at iucr.org>cif2-encoding-bounces@ > iucr.org] > >>>On Behalf Of SIMON WESTRIP > >>>? > >Sent: Friday, September 24, 2010 7:53 AM > >>>? > >To: Group for discussing encoding and content validation > >>>schemes for CIF2 > >>>? > >Subject: Re: [Cif2-encoding] How we wrap this up. . > >>>? > > > >>>? > >Dear Herbert > >>>? > > > >>>? > >Not for the first time, I find your arguement persuasive. Brian's > vote > >>>? and > >>>? > explanation have also raised some > >>>? > >questions that I would like to look into. > >>>? > > > >>>? > >I will confirm or otherwise my vote as soon as possible, > >>>assuming that is > >>>? > OK with James and assuming that > >>>? > >this round of votes might wrap this up. > >>>? > > > >>>? > >Cheers > >>>? > > > >>>? > >Simon > >>>? > > > >>>? > >________________________________________ > >>>? > >From: Herbert J. Bernstein > >>><yaya at bernstein-plus-sons.c > om>yaya at bernstein-plus-sons.com> > >>>? > >To: Group for discussing encoding and content validation > >>>schemes for CIF2 > >>>? > > >>><cif2-encoding at iucr.org> if2-encoding at iucr.org>cif2-encoding at iucr.org> > >>>? > >Sent: Friday, 24 September, 2010 13:17:14 > >>>? > >Subject: Re: [Cif2-encoding] How we wrap this up > >>>? > > > >>>? > >If he ignores the standard, in most cases all he has to do to > >>>comply with > >>>? > CIF2 is to run whatever applications he currently runs to produce > CIF1 > >>>? and, > >>>? > perhaps, in some cases, run a minor edit pass at the end, to convert > for > >>>? the > >>>? > minor syntactive differences and/or changed tags required to comply > with > >>>? > CIF2 and the new dictionaries, but he is unlikely to have to do > anything > >? >>? to > >>>? > deal with the messy business of whether his encoding is really a > proper > >>>? UTF8 > >>>? > encoding or not. > >>>? > > >>>? > >The punishment if he tries to comply, is that he has to totally > uproot > >>>? and > >>>? > reconfigure the environment in which he produces CIFs from > >>>whatever he is > >>>? > currently doing to create an enviroment in which he can reliably > create > >>>? and, > >>>? > more importantly, transmit compliant UTF8 files.? This can be > >>>very tricky > >>>? if > >>>? > he does only a partial job, say fudging in one special > >>>application (yet to > >>>? > be written), because if he stays with his old system, all kinds of > tools > >>>? > will keep trying to transcode whatever he has produced back to > whatever > >>>? his > >>>? > system considers a standard. Those of us who have files, > >>>applications and > >>>? > tools that have lived through several generations of macs are > >>>living proof > >>>? > of the problem. Macs now have excellent UTF8/16 unicode > >>>support, but every > >>? > > once in a while in working with a unicode file I find it has been > >>>? strangely > >>>? > and unexpectedly converted to something else, and it can be > >>>really tricky > >>>? to > >>>? > spot when the unaccented roman text part has been left > >>>untouched but just > >>>? a > >>>? > few accen > >>>? > ted letters have gotten different accents. > >>>? > > >>>? > >Mandating UTF8 is simply trying to shift a serious software > >>>problem from > >>>? > the central handlers of CIF (IUCr, PDB, etc.) to the external > >>>users. Most > >>>? > users will probably have the good sense to simply ignore the demand > and > >>>? > leave the burden just where it is now.? A few sophisticated users > will > >>>? > probably adapt with no trouble, but the punishment for those users > who > >>>? > blindly follow orders before we have a complete multiplatform > supporting > >>>? > infrastructure in place by mandating UTF8 is severe, expensive and > >>>? > undeserved.? Until and unless we have developed solid support, we > will > >>>? just > >>>? > be alienating people from CIF.? I will continue to oppose such a > move. > >>>? > > >>>? > [...] > >>>? > > >>>? > > >>>? > Email Disclaimer: > >>><http://www.stjude.org/emaildiscl > aimer>www.stjude.org/emaildisclaimer > > >>>? > _______________________________________________ > >>>? > cif2-encoding mailing list > >>>? > > >>>cif2-encoding at iucr.org> f2-encoding at iucr.org>cif2-encoding at iucr.org > >>>? > > >>><http://scripts. > iucr.org/mailman/listinfo/cif2-encoding> stinfo/cif2-encoding>http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > >>>? > > >>>? > > >>> > >>> > >> > >>_______________________________________________ > >>cif2-encoding mailing list > >>cif2-encoding at iucr.org > >>http://scripts.iu > cr.org/mailman/listinfo/cif2-encoding > > > > > >-- > >===================================================== > >? Herbert J. Bernstein, Professor of Computer Science > >? ? Dowling College, Kramer Science Center, KSC 121 > >? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > >? ? ? ? ? ? ? ? ? +1-631-244-3035 > >? ? ? ? ? ? ? ? ? yaya at dowling.edu > >===================================================== > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iuc > r.org/mailman/listinfo/cif2-encoding > > > > > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > -- > ===================================================== > ? Herbert J. Bernstein, Professor of Computer Science > ? ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > From yaya at bernstein-plus-sons.com Mon Sep 27 15:12:58 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 27 Sep 2010 10:12:58 -0400 (EDT) Subject: [Cif2-encoding] Let's all take a deep breath... In-Reply-To: References: Message-ID: Dear James, Unless you run filter code in your CIF applications that recognizes and reports characters beyond 126 as errors, people can slip in UTF8 or various code page representations of accented characters right now under CIF1 and I seem to recall an earlier message in this discussion from Simon or Brian reporting exactly that problem already hitting the journal workflows. How does the UTF8 in place of ASCII proposal tell users to start using those characters in their journal submissions? Unless we put UTF8 characters in tags in the new dictionaries, the only issue is for data values that are intended to be free-form text, an area in which the control is _not_ at the CIF level but in the advice to authors and the type-setting programs. How does anything for a journal submission change because of the use of UTF8 in place of ASCII if the advice to authors and the type-setting programs remain based on Brian's current elides, except to get better, in that there is a slightly better chance to figuring out what an uncooperative author who doesn't read instructions meant by the strange characters he chose to introduce? James, we can go back and forth by email this way for years, each thinking the other is not understanding the obvious. It is an unfortunate effect of email dicussions. We _need_ to talk face-to-face to resolve this. If we cannot get together for a meeting, how about using Skype? Right now this really is an infinite loop. To keep this from growing to infinite size, I'll just respond to one last point -- am I proposing that we not use the UTF8 characters in tags? I am _not_ proposing that as part of the CIF2 specification. In order to get from where we are now to a system with reasonable handling of characters about code point 126, we need to have room in the specification to try approaches out. What I am proposing is that we be careful in the current crop of dictionaries to not introduce such tags yet for IUCr official dictionaries. I am not proposing to restrict values in the specification for the same reason, but I am suggesting that the IUCr and the PDB be careful not to encourage deposition of CIFs with characters with code points above 126 until there is a clear understanding of how to handle them and software to support them. I am trying to separate problems, to make the transition to CIF2 modular, so that is has a good chance of success. Please consider a Skype meeting. It might not work, but I don't think it can make things any worse than they are right now. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Mon, 27 Sep 2010, James Hester wrote: > See interpolated comments. > > On Mon, Sep 27, 2010 at 11:36 AM, Herbert J. Bernstein > wrote: > Dear Colleagues, > > ? ?As one might expect, I respectfully disagree with almost > everything > James has said, but the really critical point of disagreement is > > >(iii) It is extremely misleading to think that simply > substituting > >UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately > the > >same results as we had for CIF1. ?The 'any encoding' clause in > the > >CIF1 standard was essentially irrelevant - encodings used in > the > >overwhelming majority of systems producing CIF1 files coincided > with > >ASCII for CIF text, as I have said many times before, so > software had > >no trouble in turning a stream of CIF bytes from any unknown > source > >into the same text that the CIF writer was working from. ?If I > repeat > >this point endlessly, it is only because the CIF1 approach > continues > >to be invoked like magic fairy dust that will make everything > OK, when > >in fact the magic fairy dust was the dominance of ASCII > encoding for > >ASCII codepoints. ?There is *no such uniformity* in encoding of > >Unicode codepoints. ?We have a new problem for CIF, and > whatever we do > >will have *new* consequences, and that very much includes the > 'as for > >CIF1' proposal. ?So please, enough with the 'CIF1 has served us > well > >for 15 years' line. > > I vigorously disagree on this point. ?If the only change we were to > make in going to CIF2 were to be that we were inserting UTF8 in place > of ASCII there would be absolutely _no_ impact on any existing > CIF application or CIF data file, because for the characters that > are formally legal under CIF1, UTF8 and ASCII are identical encodings. > The relevant portion of the CIF1 syntax specification is: > > "22. Characters within a CIF are restricted to certain printable or > white-space characters. Specifically, these are the ones located in > the ASCII character set at decimal positions 09 (HT or horizontal > tab), 10 (LF or line feed), 13 (CR or carriage return) and the > letters, numerals and punctuation marks at positions 32-126." > > Any existing data file or application that conforms to that > restriction > in _any_ encoding, will be indistinguishable with the "UTF8 in place > of ASCII" > change. For those applications and data file, this is not a change. > > > Up to here, I absolutely agree with you. > > As James implies, the only new problems that arise are in > introducing > characters into CIFs drawn from codepoints 128 ff, but we > already have > that problem under CIF1. ? > > > No we don't have that problem at all in CIF1, because we don't accept > characters from above codepoint 128 in CIF1.? While it is indeed the only > new problem, it is a huge one. > ? > The use of "UTF8 in place of ASCII" simply > allows us to coherently consider how to handle those characters > in the > future. ? > > > How does it create this breathing space?? Because as far as I can see, 'as > for CIF1' is allowing any encoding to be used for Unicode codepoints. This > is where I was forced to invoke magic fairy dust, because the 'UTF8 in place > of ASCII' approach advocates a formulation that was largely irrelevant in > CIF1 to solve a problem that was not present in CIF1 in the first place, > with no justification beyond "it worked for CIF1", when in fact it worked > about as well as my most excellent elephant repellent does (look, no > elephants, it must work!).? Sorry for being obtuse, but I don't follow your > logic at all. > ? > If we don't use them in any IUCr-sanctioned dictionary tags > for the moment, we are in no worse shape going forward under my > proposal > with Brian's recommendations added than we are staying with CIF1 > and > ASCII, and, I believe, in much better shape. > > > Is this also part of your proposal - that the IUCr hold off on using > non-ASCII characters in tags (and, I assume, values) until we sort this > out?? > > > This is a serious matter, not appropriate from sarcastic "fairy > dust" > comments. ? > > > You will note that I am taking it extremely seriously (my wife despairs), to > the point that I am forced to write about castles in the sky.? Which is why > I really need to understand the thinking behind your statements and those of > others. > ? > It really is true that "CIF1 has served us well > for 15 years," > > > Yes it has.? This however does not justify adopting the same approach to > encoding.? See above for why. > ? > and we should take our time on the encoding issue > and be certain we are really improving things, not making them > worse by what we propose. > > > By adopting Unicode we immediately create big new problems, to which we need > new solutions to even get close to the current situation in CIF1. > ? > ?I agree that we need to discuss and > resolve the encoding issue, but it is not a new problem suddenly > introduced by using UTF8. ? > > > (I assume you mean Unicode, not UTF8).? No, the opposite is true. This is a > new problem arising entirely from adoption of Unicode, because in the CIF1 > ASCII range most encodings are identical.? This is not so for Unicode. > ? > In my opinion, however, a hasty, ill-considered > resolution to that serious problem would be a very bad idea, but > delaying all of CIF2 in order to wait until we work our way out > of > a thicket that has no clear exit yet also seems to me to be a > very bad > idea. ? > > > You are right, although hasty is a bit of an exaggeration for the pointy end > of a several-month-long discussion.? Unfortunately, your proposal does not > in my opinion create breathing room but simply ducks the new problem of > multiple encodings.? If you want breathing room, you allow only > self-identifying encodings now (UTF8 and UTF16) and then work on how you > would allow other encodings.? Something like: "Only self-identifying > encodings are currently supported for use in CIF2, but other encodings may > become available at a later date".? Then you can spend all the time in the > world working on systems for incorporating other encodings. > ? > I expect if we ever manage to meet face to face, or even in > a series of Skype meetings, we could come to closure fairly > quickly, > but as things now stand it seems unlikely that we will have a > chance > to do that before the IUCr meeting. > > > I have bought a web cam and when I spend less time researching encoding > issues I should be able to get set up and running. > > I use CIF as a text format and I use it as a binary format. ?I > use > both DDL1 CIF and DDL2 CIF. I am also a cross-platform and cross > version CIF programmer. ?I do not fool myself into thinking CIF1 > to be > perfect. ?It is not. ?But it is a very useful tool, and I would > like > to be certain that what we propose as CIF2 is a least as useful > as > CIF1 and hopefully more so. ?I do not believe that options 3, 4 > or > 5 are far enough along to provide such utility, and by being too > prescriptive at this stage, may well do harm. > > > Why can't we be open-ended, and say that "UTF8 and UTF16 are acceptable in > the whole Unicode range. All other encodings are acceptable in the ASCII > range only.? We are investigating ways of extending the range of > applicability of non UTF encodings".? Given that UTF8 and UTF16 are > definitely encodings that CIF2 will use, and require no extra CIF2 machinery > (hashcodes etc.) to identify them, they are ready for use now.? I am more > than happy for us to continue to investigate ways of allowing a wider or > infinite variety of encodings. > > I urge all concerned to support either options 1 or 2 or both, > so we > have get CIF2 out for the IUCr meeting this coming summer, and > to let the encoding issue take its own time. ?If by some chance > we come up with a solution before summer 2011, so much the > better, > but please don't make the perfect (CIF2 with all issues > including > the encoding issue resolved) the enemy of the good (the CIF2 we > have > now with the encoding issue left open). > > > Both options 3 and 4 can also be used as the basis of a proposal that allows > further work on the encoding issue (see previous paragraph).? Indeed, > proposals based on 3 and 4 allow Unicode code points to be used in a > controlled way and do not encourage proliferation of encodings before we are > able to manage that proliferation. > > I invite you to answer the question at the end of my previous email > (reproduced below).? Note that under proposals 3, 4 and 5 it has a simple > answer. > > >Under the 'as for CIF1' proposal, how does my > >program turn these bytes into text in the way that the writer of the > >bytes intended? ?If that is not yet resolved, how can anybody even > >write a CIF2 program? > > all the best, > James. > > Regards, > ? Herbert > > > At 9:46 AM +1000 9/27/10, James Hester wrote: > >Well, I didn't even manage to properly call a vote and everybody has > >piled in, Simon even managed to vote twice (and that's quite OK > Simon, > >we are trying to determine what the will of the group is and so I > >think it only reasonable that if somebody's assessment of the > >situation changes that they can 'update' their vote). ?I am however > >unhappy that both Brian and Simon introduced new concerns and nobody > >has had a chance to comment on how the various proposals under > >consideration might affect those concerns. ?I would therefore like to > >suggest that the voting period continues until the end of this week, > >and that we all endeavour to express any concerns or comments that we > >need to make in a timely fashion. ?I will be commenting on Brian and > >Simon's concerns presently, and also on Herbert's proposal, which I > >have not subjected to my hopefully not too long-winded scrutiny. > > > >None of us should feel steamrolled by a certain artifical urgency > that > >has appeared in the dialogue - while we do need to wrap things up in > a > >timely fashion, it has only been 4 days since I even started > >discussing the vote. > > > >Some initial general comments (I will comment separately on Brian and > >Simon's issues). > > > >(i) We are *not* in an infinite loop. ?The last few months have seen > >several proposals analysed and explored, and it is my perception that > >these discussions have led at least some participants (including > >myself) to a better understanding of the consequences of what they > are > >proposing. ?So nobody should feel that throwing out a new criticism > of > >an old or new proposal is somehow hindering progress by looping over > >old ground. ?Quite the reverse, it is making progress. ?What *is* > >important is to get your comments into the mix in a timely fashion, > >because time is indeed short. > > > >(ii) It is not correct to assume that we can figure out the encoding > >issues later. ?Maybe we can, but maybe we can't. Once CIF2 files are > >produced and software is distributed, you can't put the genie back in > >the bottle, by which I mean you can't easily change the way that > >distributed software behaves, and how files are interpreted. ?We have > >to therefore be confident that the standard we promulgate does not > >close off an avenue we need for solving encoding issues. > > > >(iii) It is extremely misleading to think that simply substituting > >UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the > >same results as we had for CIF1. ?The 'any encoding' clause in the > >CIF1 standard was essentially irrelevant - encodings used in the > >overwhelming majority of systems producing CIF1 files coincided with > >ASCII for CIF text, as I have said many times before, so software had > >no trouble in turning a stream of CIF bytes from any unknown source > >into the same text that the CIF writer was working from. ?If I repeat > >this point endlessly, it is only because the CIF1 approach continues > >to be invoked like magic fairy dust that will make everything OK, > when > >in fact the magic fairy dust was the dominance of ASCII encoding for > >ASCII codepoints. ?There is *no such uniformity* in encoding of > >Unicode codepoints. ?We have a new problem for CIF, and whatever we > do > >will have *new* consequences, and that very much includes the 'as for > >CIF1' proposal. ?So please, enough with the 'CIF1 has served us well > >for 15 years' line. > > > >(iv) The majority are currently in favour of the 'as for CIF1' > >approach, which if nobody changes their vote by the end of the week, > >is what we will be taking to the DDLm group and COMCIFS. ?This means > >we will have a pure text standard, and I mean really pure, because > >there is no predictable link between this beautiful textual castle in > >the sky and the solid ground of bytes on disk. > > > >I am a cross-platform CIF programmer. Looking forward to the halcyon > >'as for CIF1' days that await us, a small question occupies my mind. > >As my program does not operate in that glorious abstract space > >occupied by pure text standards that are most certainly not anybody's > >laughing stock, my program will be forced to (as briefly as possible) > >deal with humble plebiean bytes according to some encoding to obtain > >the exalted CIF text. ? Under the 'as for CIF1' proposal, how does my > >program turn these bytes into text in the way that the writer of the > >bytes intended? ?If that is not yet resolved, how can anybody even > >write a CIF2 program? > > > >-- > >T +61 (02) 9717 9907 > >F +61 (02) 9717 3145 > >M +61 (04) 0249 4148 > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > -- > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? ?Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > From John.Bollinger at STJUDE.ORG Mon Sep 27 15:45:51 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Mon, 27 Sep 2010 09:45:51 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE0@SJMEMXMBS11.stjude.sjcrh.local> Dear Colleagues, On Monday, September 27, 2010 5:49 AM, Herbert Bernstein wrote: > Under the CIF2 specification with UTF8 in place of ASCII there is >_no_ change in the use of elided ASCII sequences to represent non-ASCII >characters until and unless the IUCr publications office decides that, >for that particular application, they are ready to accept something >new. Absolutely correct. The character elides of CIF1 are among its "common semantic features", which are expressly *not* part of the CIF1 format standard. CIF2 explicitly omits them as well, leaving them in exactly the same place they are now. None of this is at all affected by which encoding option we choose. > It is _only_ if you go forward with options 3, 4 or 5 that you >are giving the green light to users to do precisely what you are >concerned about -- using the unicode characters instead instead >in possibly strange admixtures that nobody is ready to process. The only way I can see that being true is if "text file" or "text" is intended to be interpreted, at least in part, as "containing only ASCII characters." Is that your intended meaning, Herb? Otherwise, CIF2's expansion to the full (more or less) Unicode character set opens the door for users to insert literal characters into their (conformant) CIFs in place of or in addition to elides, and none of the alternatives on the table change that. What the various alternatives *do* affect is which byte-sequence representations of those characters will conform to CIF2, under which circumstances. Independent of this particular issue, my greatest problem with options (1) and (2) is the imprecision of describing CIF2 simply as "text". That this served well enough for CIF1 is irrelevant; CIF2's character set lends much more importance and impact to the interpretation of this aspect of the spec. I see two, maybe even three, viable and functionally distinct possible definitions. Would any of the proponents of that wording care to advance a definition of that term as it is intended to be interpreted in a CIF2 context? This is substantially equivalent to James's open question, so no need to answer both. [...] > My apologies to James, who I know is trying to do what he believes >to be right, but I believe James has things backwards -- the "deep >breath" is provided by my proposal -- taking the time to properly engineer >the use of the extra characters UTF8 allows us to discuss clearly, >while James' push for an immediate prescriptive use of UTF8 with >prescriptions that differ drastically from what has been adopted >by all other frameworks (HTML, XML, python, etc.) in ways that >are untested and unsupported by most existing software is >the untimely rush to judgement. [...] I doubt any of us could disagree that there is an engineering challenge here, but I have to agree with James that the only viable opportunity to leave this area for further development is to be explicitly restrictive now, ala option (3) or (4). Not even my most preferred option (5) allows sufficient latitude for future extension without potentially invalidating some CIF2 CIFs and programs. Furthermore, I don't think that "all other frameworks" adopt an entirely uniform approach, nor one that is necessarily equivalent to option (1) or (2). For example, Sun's various implementations of the Java compiler seem to use "local" (in my sense of the term) unless the user passes an option to tell it otherwise. XML and XHTML use UTF-8 unless a different encoding is explicitly named in the file, identified via a byte-order mark, or otherwise communicated at a higher level. HTML tends to rely on a higher-level protocol to communicate encoding, but provides a mechanism for communicating it in-line. ALL of the CIF2 options currently on the table share some characteristics with one or more of those. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From simonwestrip at btinternet.com Mon Sep 27 16:27:31 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 27 Sep 2010 15:27:31 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <93847.2110.qm@web87014.mail.ird.yahoo.com> Message-ID: <223950.86835.qm@web87002.mail.ird.yahoo.com> I see nothing wrong with a strategy to introduce CIF2 if necessary. My initial thoughts are that the current 'as for CIF1...' description is not best suited as base specification on which to build full unicode support, should such a strategy be pursued. However, I will reflect on this along with recent contributions from James and John... Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 27 September, 2010 14:45:16 Subject: Re: [Cif2-encoding] How we wrap this up The problem is that options 3,4 and 5 specifically prescribe the use of Unicode characters (that is the entire point of those options -- and that is the point in dispute -- whether we should be prescribing UTF8 or using is as we now use ASCII, as a way to be clear what we are talking about as in CIF1) and we simply are not ready to deal such a requirement yet. I take the blame for starting this discussion many years ago when I simply asked for just what my motion says, that we start using UTF8 in the same way we had been using ASCII. Unfortunately this discussion has turned into a strong push to focus CIF on that particular encoding, stop using Brian's elides, etc. With the current weak state of software support for CIF and the large investment at the IUCr and at the PDB in current workflows, I think it would be a very disruptive and expensive change to make right now. God and the Devil are in the details. Note that I am _not_ basing this argument on imgCIF. At this point it appears, unfortunately, that CIF2 and imgCIF will have to diverge. If we have enough face-to-face discussions, perhaps we can bring them together again, as we did in 1998, but that is an even more difficult discussion than the one we need to have on encodings. What is I we will do is to go at this in incremental stages: 1. Make the transition from CIF1 to CIF2 using new dictionaries but allowing most data files to remain unchanges, and providing simple algorithmic transformations for the rest, but keeping most of the current semantic extensions that we have in CIF1, focusing our enegry on getting the new dictionaries used and making use of dREL; 2. Work on a CIF2.1 that, by creative and well-supported use of Unicode, allows for a well organized transition from Brian's elides to use of Unicode characters 3. Then working in that context, whatever it turns out to be, work on having imgCIF make the transition to CIF2 in some reasonably compatible way. I see how to do item 1 for next summer. I don't see how to do 2 and 3 in that time frame, though I am sure we could make a dent in them if we could meet face to face. email tends to stiffen too many positions. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Mon, 27 Sep 2010, SIMON WESTRIP wrote: > Dear Herbert > > I do not understand why it is *only* options 3, 4 or 5 that allow users to > start using > unicode characters? > > More generally, are you suggesting that the use of anything but ASCII in a > data value is only allowed if > e.g. the dictionary definition of the data item permits, or even only if the > IUCr says that's OK? > > Fundamentally, I'm starting to infer that the purpose of the 'as for > CIF1...' approach to encoding is > to open the door to full unicode support, but not actually let anyone cross > the threshold? > > > Cheers > > Simon > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Monday, 27 September, 2010 11:48:49 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > Under the CIF2 specification with UTF8 in place of ASCII there is > _no_ change in the use of elided ASCII sequences to represent non-ASCII > characters until and unless the IUCr publications office decides that, > for that particular application, they are ready to accept something > new. > > It is _only_ if you go forward with options 3, 4 or 5 that you > are giving the green light to users to do precisely what you are > concerned about -- using the unicode characters instead instead > in possibly strange admixtures that nobody is ready to process. > > Remember, under the CIF2 specification as now written, it is > _not_ part of the CIF2 specification to determine the handling > of the characters in quoted strings other than to ensure that > those string do not contain illegal characters from the point > of view of CIF2. Dealing with the validity of particular character > sequences in strings users provide is, just as in CIF1, the > responsibility of the application (i.e. the IUCr journal flows > or the PDB archiving flows). > > My apologies to James, who I know is trying to do what he believes > to be right, but I believe James has things backwards -- the "deep > breath" is provided by my proposal -- taking the time to properly engineer > the use of the extra characters UTF8 allows us to discuss clearly, > while James' push for an immediate prescriptive use of UTF8 with > prescriptions that differ drastically from what has been adopted > by all other frameworks (HTML, XML, python, etc.) in ways that > are untested and unsupported by most existing software is > the untimely rush to judgement. > > I beg you to support options 1 and/or 2 to allow CIF2 to go forward > in all other respects while we all take a deep breath and deal > with the tricky issue you raised slowly and carefully without the > pressure of trying to have CIF2 itself ready for next summer. > > Regards, > Herbert > > At 9:34 AM +0000 9/27/10, SIMON WESTRIP wrote: > >I was not so concerned about invalidating existing CIFs, or even the > >likelihood > >that users will continue to write e.g. 'f\'oo' - this is a syntax > >error in CIF2 that is readily recoverable. > > > >Rather there is a large group of CIF1 users that are in the habit of > >using elided ASCII sequences to > >represent non-ASCII characters. With CIF2 these users will be able > >to use the unicode character itself. > >So we might end up with a mixture of esacaped sequences and unicode > >characters (e.g. a user may have a keyboard shortcut > >for an accented character that forms part of their name, but might > >still resort to \a for alpha, under the assumption that \a is still > >valid because CIF2 is basically the same as CIF1, and, rightly or > >wrongly, they perceive the eliding machanism as part of > >CIF syntax. > > > >I think this is an issue where we can't afford to take an 'as for > >CIF1...' approach, especially as the CIF1 specification > >isn't entirely satisfactory (e.g. there's an example in the > >line-folding protocal that uses elides in a file path to make a > >point, > >but actually these elides may easily be interpretted as escape > >sequences), and as the encoding issue is very much concerned with > >user practice, the large group of users that currently use elided > >character codes need to be aware what the situation is in > >CIF2? > > > >I'm not convinced this issue should be left for discussion later; > >it is relevant when considering how the move beyond ASCII is specified. > > > >Cheers > > > >Simon > > > > > > > > > >From: Herbert J. Bernstein > >To: Group for discussing encoding and content validation schemes for > >CIF2 > >Sent: Sunday, 26 September, 2010 23:14:55 > >Subject: Re: [Cif2-encoding] How we wrap this up > > > >Dear Simon, > > > > The current CIF2 spec, with or without the changes I have suggested > >to temporarily resolve the encoding issue is at best vague and > >confusing on the elide character issue. The interacting issue on > >which the CIF2 spec > >is clear is that we are changing the handling of quoted strings so > >that they end on the first occurrence of the quoting character and leaves > >the handling of elides to the calling application. > > > > This will be a problem -- the change from CIF1 in the termination of > >quoted strings along with the absence of a way of eliding the quotes > >will invalidate a significant number of existing CIFS without any simple > >mechanism to recover. Rather than reopen another endless discussion, > >I would suggest we simply add the python string concatenation character > >"+" to ensure we can map all current CIF1 files and use Brian's common > >semantic features for the moment. We can then deal with the full elides > >discussion at a future date. > > > > Regards, > > Herbert > > > > > > > > > > > >At 1:40 PM -0700 9/26/10, SIMON WESTRIP wrote: > >>Dear all > >> > >>While reviewing my hypothetical 'to do' list for implementing CIF2 > >>in current software, I realized that > >>the issue of current support for elided character codes hasnt really > >>been addressed in the context of CIF2. > >>My 'to do' list contains notes that software could treat them as > >>keyboard shortcuts, and their use could be > >>defined in the dictionary. However, that was based on a distinct > >>difference between CIF1 and CIF2, > >>while the current arguments for 'as for CIF1...' suggest that the > >>distinction between CIF1 and CIF2 > >>should almost be imperceptible. > >> > >>How is this issue to be addressed in the specification? > >> > >>Cheers > >> > >>Simon > >> > >> > >> > >>From: Herbert J. Bernstein > >><yaya at bernstein-plus-sons.com> > >>To: Group for discussing encoding and content validation schemes for > >>CIF2 <cif2-encoding at iucr.org> > >>Sent: Saturday, 25 September, 2010 20:37:46 > >>Subject: Re: [Cif2-encoding] How we wrap this up > >> > >>Thank you for your cooperation. -- Herbert > >> > >>===================================================== > >>Herbert J. Bernstein, Professor of Computer Science > >> Dowling College, Kramer Science Center, KSC 121 > >> Idle Hour Blvd, Oakdale, NY, 11769 > >> > >> +1-631-244-3035 > >> > >>yaya at dowling.edu> u>yaya at dowling.edu > >>===================================================== > >> > >>On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > >> > >>> OK - as promised, I wont pursue the matter :-) > >>> > >>> > >>> > >>>________________________________________________________________________ > ____ > >>> From: Herbert J. Bernstein > >>><yaya at bernstein-plus-sons.c > om>yaya at bernstein-plus-sons.com> > >>> To: Group for discussing encoding and content validation schemes for > CIF2 > >>> > >>><cif2-encoding at iucr.org> if2-encoding at iucr.org>cif2-encoding at iucr.org> > >>> Sent: Saturday, 25 September, 2010 19:18:54 > >>> Subject: Re: [Cif2-encoding] How we wrap this up > >>> > >>> Dear Simon, > >>> > >>> Unfortunately, that is likely to take us back into our infinite loop > or > >>> into a diverging spiral. Right now, we would have UTF8 as no > >>>more or less a > >>> default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad > >>>first guess as > >>> the likely default encoding for any given CIF, but not a formal > >>>constraint. > >>> I would suggest we leave the wording in that imprecise state, get CIF2 > out > >>> and accepted and then work further on the encoding issue. > >>> > >>> Regards, > >>> Herbert > >>> > >>> ===================================================== > >>> Herbert J. Bernstein, Professor of Computer Science > >>> Dowling College, Kramer Science Center, KSC 121 > >>> Idle Hour Blvd, Oakdale, NY, 11769 > >>> > >>> +1-631-244-3035 > >>> > >>>yaya at dowling.edu> du>yaya at dowling.edu > >>> ===================================================== > >>> > >>> On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > >>> > >>> > Dear all > >>> > > >>> > In the event that CIF2 adopts the 'any encoding' approach, > >>>would there be > >> > > any objections to > > >> > explicitly defining a default encoding in the specification, to be > >>> defaulted > >>> > to when there were no indications > >>> > to the contrary. At worst this would give CIF2 service > >>>providers an excuse > >>> > to interpret CIFs as e.g. UTF8 if they couldnt > >>> > determine the encoding by other means - but such intollerant service > >>> > providers would soon find that their service is > >>> > not successful - while at best this might raise awareness of the > issues > >>> > regarding encoding once non-ASCII is used in > >>> > a CIF. Essentially, it does not require users to change there working > >>> > practices, which is one of the main arguments for > >>> > 'any encoding'. > >>> > > >>> > So, CIF2 would remain 'any encoding', and specifications in > >>>terms of e.g. > >>> > "Herbert's as for CIF1..." > >>> > might only require a single sentence to define the default after > stating > >>> > what the 'preferred' encoding was; > >>> > the proposal might be phrased as "Herbert's as for CIF1..." + > "explicit > >>> > default encoding"? > >>> > > >>> > I do not wish to prolong this debate - if there are objections > >>>I will not > >>> > launch into an endless round of exchanges > >>> > that cover the same ground that has led us this far. > >>> > > >>> > Cheers > >>> > > >>> > Simon > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > >>>>_______________________________________________________________________ > ____ > >>> _ > >>> > From: SIMON WESTRIP > >>><simonwestrip at btinternet.com > >simonwestrip at btinternet.com> > >>> > To: Group for discussing encoding and content validation > >>>schemes for CIF2 > >>> > > >>><cif2-encoding at iucr.org> if2-encoding at iucr.org>cif2-encoding at iucr.org> > >>> > Sent: Friday, 24 September, 2010 20:10:13 > >>> > Subject: Re: [Cif2-encoding] How we wrap this up > >>> > > >>> > Dear James > >>> > > >>> > As you may have gathered I have been reconsidering my position on > this > >>> > issue. > >>> > Please forgive me, but I would like to change my vote if that is OK, > in > >>> > favour of the 'any encoding' camp. > >>> > This apparent U-turn is not a response to recent > >>>contributions; rather it > >>> is > >>> > the outcome of a meeting I had this morning > >>> > where I demonstrated some new software to the Managing Editor of IUCr > >>> > journals. > >>> > > >>> > By way of explanation: > >>> > > >>> > I have been developing a new docx template which the IUCr > >>>editorial office > >>> > is shortly to release for use by > >>> > authors. The template will be packaged with some tools to extract > data > >>> from > >>> > CIFs > >>> > and tabulate them in the Word document, e.g. open an mmCIF, click a > >>> button, > >>> > and standard > >>> > tables populated with data from the CIF will be included in > >>>the document, > >>> > acting as > >>> > table templates for the author to edit as appropriate for their > >>> manuscript. > >>> > > >>> > Inclusion of the mmCIF tools is part of an unofficial policy to > 'coax' > >>> > biologists to start using/accepting mmCIF > >>> > as a useful medium, rather than as a product of their deposition to > the > >>> PDB, > >>> > and to encourage them to become comfortable > >>> > with passing mmCIFs between applications, and even to edit the > >>>things (in > >>> > the same way as the core-CIF community > >>> > treats CIFs). For example, our perception is that there is no reason > why > >>> an > >>> > author should not feel free to take an mmCIF > >>> > that has been created by e.g. pdb_extract and populate it using > >>> third-party > >>> > software before uploading to the PDB for > >>> > deposition. > >>> > > >>> > This cause would not be furthered by effectively invalidating > >>>an mmCIF if > >>> it > >>> > were not to be encoded in one of > >>> > the specified encodings. > >>> > > >>> > So although I am uneasy about a specification that propogates > >>>uncertainty, > >>> > I'm also uneasy about alienating users, > >>> > especially when we are struggling to change their mindset as in the > case > >>> of > >>> > the biological community > >>> > (my perception of the biological community's attitude to mmCIF > >>>is based on > >>> > feedback from authors/coeditors to > >>> > IUCr journals). > >>> > > > >> > Granted this may not be the most compelling argument in favour of > 'any > >>> > encoding', but recognizing the hurdles that > >>> > may have to be overcome once we move beyond ASCII whatever the CIF2 > >>> > specification, I support 'any encoding' > >>> > as 'a means to an end'. > >>> > > >>> > I will not provide my preferences in terms of the numbered options > until > >> > you > >>> > say so; afterall, I have already voted and > >>> > all this has to be signed off by COMCIFs in any case. > >>> > > >>> > Cheers > >>> > > >>> > Simon > >>> > > >>> > > >>> > > >>> > > >>> > >>>>_______________________________________________________________________ > ____ > >>> _ > >>> > From: "Bollinger, John C" > >>><John.Bollinger at STJUDE.ORG> ilto:John.Bollinger at STJUDE.ORG>John.Bollinger at STJUDE.ORG> > >>> > To: Group for discussing encoding and content validation > >>>schemes for CIF2 > >>> > > >>><cif2-encoding at iucr.org> if2-encoding at iucr.org>cif2-encoding at iucr.org> > >>> > Sent: Friday, 24 September, 2010 14:50:57 > >>> > Subject: Re: [Cif2-encoding] How we wrap this up > >>> > > >>> > Dear Simon, > >>> > > >>> > It is exactly this sort of issue that drove me to support more > >>>permissive > >>> > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local > >>> proposal. > >>> > > >>> > Do please think about the considerations Herb raised. As you > reconsider > >>> > your votes, I urge you also to ask yourself what, *precisely*, a > "text > >>> file" > >>> > is, and to consider whether your answer is functionally > >>>different from my > >>> > "local". If you decide not, then please consider what that > >>>answer implies > >>> > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) > under > >>> > each option on the table, especially for CIFs containing non-ASCII > >>> > characters. Whatever you decide about the meaning of "text > >>>file", please > >>> > consider whether reasonable people might reach a different > >>>conclusion, as > >>> I > >>> > assert they might do, and to what extent the standard needs to > address > >>> that. > >>> > > >>> > > >>> > Regards, > >>> > > >>> > John > >>> > -- > >>> > John C. Bollinger, Ph.D. > >>> > Department of Structural Biology > >>> > St. Jude Children's Research Hospital > >>> > > >>> > > >>> > >From: > >>>cif2-encoding-bounces at iuc > r.org>cif2-encoding-bounces at iucr.org > > >>> > > >>>[mailto:cif2-encoding-bou > nces at iucr.org>cif2-encoding-bounces@ > iucr.org] > >>>On Behalf Of SIMON WESTRIP > >>> > >Sent: Friday, September 24, 2010 7:53 AM > >>> > >To: Group for discussing encoding and content validation > >>>schemes for CIF2 > >>> > >Subject: Re: [Cif2-encoding] How we wrap this up. . > >>> > > > >>> > >Dear Herbert > >>> > > > >>> > >Not for the first time, I find your arguement persuasive. Brian's > vote > >>> and > >>> > explanation have also raised some > >>> > >questions that I would like to look into. > >>> > > > >>> > >I will confirm or otherwise my vote as soon as possible, > >>>assuming that is > >>> > OK with James and assuming that > >>> > >this round of votes might wrap this up. > >>> > > > >>> > >Cheers > >>> > > > >>> > >Simon > >>> > > > >>> > >________________________________________ > >>> > >From: Herbert J. Bernstein > >>><yaya at bernstein-plus-sons.c > om>yaya at bernstein-plus-sons.com> > >>> > >To: Group for discussing encoding and content validation > >>>schemes for CIF2 > >>> > > >>><cif2-encoding at iucr.org> if2-encoding at iucr.org>cif2-encoding at iucr.org> > >>> > >Sent: Friday, 24 September, 2010 13:17:14 > >>> > >Subject: Re: [Cif2-encoding] How we wrap this up > >>> > > > >>> > >If he ignores the standard, in most cases all he has to do to > >>>comply with > >>> > CIF2 is to run whatever applications he currently runs to produce > CIF1 > >>> and, > >>> > perhaps, in some cases, run a minor edit pass at the end, to convert > for > >>> the > >>> > minor syntactive differences and/or changed tags required to comply > with > >>> > CIF2 and the new dictionaries, but he is unlikely to have to do > anything > > >> to > >>> > deal with the messy business of whether his encoding is really a > proper > >>> UTF8 > >>> > encoding or not. > >>> > > >>> > >The punishment if he tries to comply, is that he has to totally > uproot > >>> and > >>> > reconfigure the environment in which he produces CIFs from > >>>whatever he is > >>> > currently doing to create an enviroment in which he can reliably > create > >>> and, > >>> > more importantly, transmit compliant UTF8 files. This can be > >>>very tricky > >>> if > >>> > he does only a partial job, say fudging in one special > >>>application (yet to > >>> > be written), because if he stays with his old system, all kinds of > tools > >>> > will keep trying to transcode whatever he has produced back to > whatever > >>> his > >>> > system considers a standard. Those of us who have files, > >>>applications and > >>> > tools that have lived through several generations of macs are > >>>living proof > >>> > of the problem. Macs now have excellent UTF8/16 unicode > >>>support, but every > >> > > once in a while in working with a unicode file I find it has been > >>> strangely > >>> > and unexpectedly converted to something else, and it can be > >>>really tricky > >>> to > >>> > spot when the unaccented roman text part has been left > >>>untouched but just > >>> a > >>> > few accen > >>> > ted letters have gotten different accents. > >>> > > >>> > >Mandating UTF8 is simply trying to shift a serious software > >>>problem from > >>> > the central handlers of CIF (IUCr, PDB, etc.) to the external > >>>users. Most > >>> > users will probably have the good sense to simply ignore the demand > and > >>> > leave the burden just where it is now. A few sophisticated users > will > >>> > probably adapt with no trouble, but the punishment for those users > who > >>> > blindly follow orders before we have a complete multiplatform > supporting > >>> > infrastructure in place by mandating UTF8 is severe, expensive and > >>> > undeserved. Until and unless we have developed solid support, we > will > >>> just > >>> > be alienating people from CIF. I will continue to oppose such a > move. > >>> > > >>> > [...] > >>> > > >>> > > >>> > Email Disclaimer: > >>><http://www.stjude.org/emaildiscl > aimer>www.stjude.org/emaildisclaimer > > >>> > _______________________________________________ > >>> > cif2-encoding mailing list > >>> > > >>>cif2-encoding at iucr.org> f2-encoding at iucr.org>cif2-encoding at iucr.org > >>> > > >>><http://scripts. > iucr.org/mailman/listinfo/cif2-encoding> stinfo/cif2-encoding>http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > >>> > > >>> > > >>> > >>> > >> > >>_______________________________________________ > >>cif2-encoding mailing list > >>cif2-encoding at iucr.org > >>http://scripts.iu > cr.org/mailman/listinfo/cif2-encoding > > > > > >-- > >===================================================== > > Herbert J. Bernstein, Professor of Computer Science > > Dowling College, Kramer Science Center, KSC 121 > > Idle Hour Blvd, Oakdale, NY, 11769 > > > > +1-631-244-3035 > > yaya at dowling.edu > >===================================================== > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iuc > r.org/mailman/listinfo/cif2-encoding > > > > > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > -- > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100927/62e14ff0/attachment-0001.html From yaya at bernstein-plus-sons.com Mon Sep 27 17:23:32 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 27 Sep 2010 12:23:32 -0400 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <223950.86835.qm@web87002.mail.ird.yahoo.com> References: <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <93847.2110.qm@web87014.mail.ird.yahoo.com> <223950.86835.qm@web87002.mail.ird.yahoo.com> Message-ID: Dear Simon, We do not seem to be communicating effectively. Do you have a Skype account? We really need a meeting. Regards, Herbert At 3:27 PM +0000 9/27/10, SIMON WESTRIP wrote: >I see nothing wrong with a strategy to introduce CIF2 if necessary. >My initial thoughts are that the current 'as for CIF1...' description >is not best suited as base specification on which to build full >unicode support, should such a strategy be pursued. > >However, I will reflect on this along with recent contributions from >James and John... > >Cheers > >Simon > > > >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for >CIF2 >Sent: Monday, 27 September, 2010 14:45:16 >Subject: Re: [Cif2-encoding] How we wrap this up > >The problem is that options 3,4 and 5 specifically prescribe the >use of Unicode characters (that is the entire point of those >options -- and that is the point in dispute -- whether we should >be prescribing UTF8 or using is as we now use ASCII, as a way to >be clear what we are talking about as in CIF1) and we simply are not >ready to deal such a requirement yet. > >I take the blame for starting this discussion many years ago when >I simply asked for just what my motion says, that we start using >UTF8 in the same way we had been using ASCII. Unfortunately >this discussion has turned into a strong push to focus CIF on >that particular encoding, stop using Brian's elides, etc. With >the current weak state of software support for CIF and the large >investment at the IUCr and at the PDB in current workflows, I >think it would be a very disruptive and expensive change to make >right now. God and the Devil are in the details. > >Note that I am _not_ basing this argument on imgCIF. At this point >it appears, unfortunately, that CIF2 and imgCIF will have to diverge. >If we have enough face-to-face discussions, perhaps we can bring >them together again, as we did in 1998, but that is an even more >difficult discussion than the one we need to have on encodings. >What is I we will do is to go at this in incremental stages: > >1. Make the transition from CIF1 to CIF2 using new dictionaries >but allowing most data files to remain unchanges, and providing >simple algorithmic transformations for the rest, but keeping >most of the current semantic extensions that we have in CIF1, >focusing our enegry on getting the new dictionaries used and >making use of dREL; > >2. Work on a CIF2.1 that, by creative and well-supported use >of Unicode, allows for a well organized transition from Brian's >elides to use of Unicode characters > >3. Then working in that context, whatever it turns out to be, >work on having imgCIF make the transition to CIF2 in some >reasonably compatible way. > >I see how to do item 1 for next summer. I don't see how to do 2 and >3 in that time frame, though I am sure we could make a dent in >them if we could meet face to face. email tends to stiffen too >many positions. > >Regards, > Herbert > >===================================================== >Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu >===================================================== > >On Mon, 27 Sep 2010, SIMON WESTRIP wrote: > >> Dear Herbert >> >> I do not understand why it is *only* options 3, 4 or 5 that allow users to >> start using >> unicode characters? >> >> More generally, are you suggesting that the use of anything but ASCII in a >> data value is only allowed if >> e.g. the dictionary definition of the data item permits, or even only if the >> IUCr says that's OK? >> >> Fundamentally, I'm starting to infer that the purpose of the 'as for > > CIF1...' approach to encoding is >> to open the door to full unicode support, but not actually let anyone cross >> the threshold? >> >> >> Cheers >> >> Simon >> >> ____________________________________________________________________________ >> From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >> To: Group for discussing encoding and content validation schemes for CIF2 >> <cif2-encoding at iucr.org> >> Sent: Monday, 27 September, 2010 11:48:49 >> Subject: Re: [Cif2-encoding] How we wrap this up >> >> Dear Simon, >> >> Under the CIF2 specification with UTF8 in place of ASCII there is >> _no_ change in the use of elided ASCII sequences to represent non-ASCII >> characters until and unless the IUCr publications office decides that, >> for that particular application, they are ready to accept something >> new. >> >> It is _only_ if you go forward with options 3, 4 or 5 that you >> are giving the green light to users to do precisely what you are >> concerned about -- using the unicode characters instead instead >> in possibly strange admixtures that nobody is ready to process. >> >> Remember, under the CIF2 specification as now written, it is >> _not_ part of the CIF2 specification to determine the handling >> of the characters in quoted strings other than to ensure that >> those string do not contain illegal characters from the point >> of view of CIF2. Dealing with the validity of particular character >> sequences in strings users provide is, just as in CIF1, the >> responsibility of the application (i.e. the IUCr journal flows >> or the PDB archiving flows). >> >> My apologies to James, who I know is trying to do what he believes >> to be right, but I believe James has things backwards -- the "deep >> breath" is provided by my proposal -- taking the time to properly engineer >> the use of the extra characters UTF8 allows us to discuss clearly, >> while James' push for an immediate prescriptive use of UTF8 with >> prescriptions that differ drastically from what has been adopted >> by all other frameworks (HTML, XML, python, etc.) in ways that >> are untested and unsupported by most existing software is >> the untimely rush to judgement. >> >> I beg you to support options 1 and/or 2 to allow CIF2 to go forward >> in all other respects while we all take a deep breath and deal >> with the tricky issue you raised slowly and carefully without the >> pressure of trying to have CIF2 itself ready for next summer. >> >> Regards, >> Herbert >> >> At 9:34 AM +0000 9/27/10, SIMON WESTRIP wrote: >> >I was not so concerned about invalidating existing CIFs, or even the >> >likelihood >> >that users will continue to write e.g. 'f\'oo' - this is a syntax >> >error in CIF2 that is readily recoverable. >> > >> >Rather there is a large group of CIF1 users that are in the habit of >> >using elided ASCII sequences to >> >represent non-ASCII characters. With CIF2 these users will be able >> >to use the unicode character itself. >> >So we might end up with a mixture of esacaped sequences and unicode >> >characters (e.g. a user may have a keyboard shortcut >> >for an accented character that forms part of their name, but might >> >still resort to \a for alpha, under the assumption that \a is still >> >valid because CIF2 is basically the same as CIF1, and, rightly or >> >wrongly, they perceive the eliding machanism as part of >> >CIF syntax. >> > >> >I think this is an issue where we can't afford to take an 'as for >> >CIF1...' approach, especially as the CIF1 specification >> >isn't entirely satisfactory (e.g. there's an example in the >> >line-folding protocal that uses elides in a file path to make a >> >point, >> >but actually these elides may easily be interpretted as escape >> >sequences), and as the encoding issue is very much concerned with >> >user practice, the large group of users that currently use elided >> >character codes need to be aware what the situation is in >> >CIF2? >> > >> >I'm not convinced this issue should be left for discussion later; >> >it is relevant when considering how the move beyond ASCII is specified. >> > >> >Cheers > > > >> >Simon >> > >> > >> > >> > >> >From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >> >To: Group for discussing encoding and content validation schemes for >> >CIF2 <cif2-encoding at iucr.org> >> >Sent: Sunday, 26 September, 2010 23:14:55 >> >Subject: Re: [Cif2-encoding] How we wrap this up >> > >> >Dear Simon, >> > >> > The current CIF2 spec, with or without the changes I have suggested >> >to temporarily resolve the encoding issue is at best vague and >> >confusing on the elide character issue. The interacting issue on >> >which the CIF2 spec >> >is clear is that we are changing the handling of quoted strings so >> >that they end on the first occurrence of the quoting character and leaves >> >the handling of elides to the calling application. >> > >> > This will be a problem -- the change from CIF1 in the termination of >> >quoted strings along with the absence of a way of eliding the quotes >> >will invalidate a significant number of existing CIFS without any simple >> >mechanism to recover. Rather than reopen another endless discussion, >> >I would suggest we simply add the python string concatenation character >> >"+" to ensure we can map all current CIF1 files and use Brian's common >> >semantic features for the moment. We can then deal with the full elides >> >discussion at a future date. >> > >> > Regards, >> > Herbert >> > >> > >> > >> > >> > >> >At 1:40 PM -0700 9/26/10, SIMON WESTRIP wrote: >> >>Dear all >> >> >> >>While reviewing my hypothetical 'to do' list for implementing CIF2 >> >>in current software, I realized that >> >>the issue of current support for elided character codes hasnt really >> >>been addressed in the context of CIF2. >> >>My 'to do' list contains notes that software could treat them as >> >>keyboard shortcuts, and their use could be >> >>defined in the dictionary. However, that was based on a distinct >> >>difference between CIF1 and CIF2, >> >>while the current arguments for 'as for CIF1...' suggest that the >> >>distinction between CIF1 and CIF2 >> >>should almost be imperceptible. >> >> >> >>How is this issue to be addressed in the specification? >> >> >> >>Cheers >> >> >> >>Simon >> >> >> >> >> >> >> >>From: Herbert J. Bernstein >> >><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >> >>To: Group for discussing encoding and content validation schemes for >> >>CIF2 >><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >> >>Sent: Saturday, 25 September, 2010 20:37:46 >> >>Subject: Re: [Cif2-encoding] How we wrap this up >> >> >> >>Thank you for your cooperation. -- Herbert >> >> >> >>===================================================== >> >>Herbert J. Bernstein, Professor of Computer Science >> >> Dowling College, Kramer Science Center, KSC 121 >> >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> >> >> +1-631-244-3035 >> >> >> >>yaya at dowling.edu>yaya at dowling.edu>yaya at dowling.ed >> u>yaya at dowling.edu >> >>===================================================== >> >> >> >>On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >> >> >> >>> OK - as promised, I wont pursue the matter :-) >> >>> >> >>> >> >>> >> >>>________________________________________________________________________ >> ____ >> >>> From: Herbert J. Bernstein >> >>><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.c >> >>om>yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >> >>> To: Group for discussing encoding and content validation schemes for >> CIF2 >> >>> >> >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >>if2-encoding at iucr.org>cif2-encoding at iucr.org> > > >>> Sent: Saturday, 25 September, 2010 19:18:54 >> >>> Subject: Re: [Cif2-encoding] How we wrap this up >> >>> >> >>> Dear Simon, >> >>> >> >>> Unfortunately, that is likely to take us back into our infinite loop >> or >> >>> into a diverging spiral. Right now, we would have UTF8 as no >> >>>more or less a >> >>> default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad >> >>>first guess as >> >>> the likely default encoding for any given CIF, but not a formal >> >>>constraint. >> >>> I would suggest we leave the wording in that imprecise state, get CIF2 >> out >> >>> and accepted and then work further on the encoding issue. >> >>> >> >>> Regards, >> >>> Herbert >> >>> >> >>> ===================================================== >> >>> Herbert J. Bernstein, Professor of Computer Science >> >>> Dowling College, Kramer Science Center, KSC 121 >> >>> Idle Hour Blvd, Oakdale, NY, 11769 >> >>> >> >>> +1-631-244-3035 >> >>> >> >>>yaya at dowling.edu>yaya at dowling.edu>yaya at dowling.e >> du>yaya at dowling.edu >> >>> ===================================================== >> >>> >> >>> On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >> >>> >> >>> > Dear all >> >>> > >> >>> > In the event that CIF2 adopts the 'any encoding' approach, >> >>>would there be >> >> > > any objections to >> > >> > explicitly defining a default encoding in the specification, to be >> >>> defaulted >> >>> > to when there were no indications >> >>> > to the contrary. At worst this would give CIF2 service >> >>>providers an excuse >> >>> > to interpret CIFs as e.g. UTF8 if they couldnt >> >>> > determine the encoding by other means - but such intollerant service >> >>> > providers would soon find that their service is >> >>> > not successful - while at best this might raise awareness of the >> issues >> >>> > regarding encoding once non-ASCII is used in >> >>> > a CIF. Essentially, it does not require users to change there working >> >>> > practices, which is one of the main arguments for >> >>> > 'any encoding'. >> >>> > >> >>> > So, CIF2 would remain 'any encoding', and specifications in >> >>>terms of e.g. >> >>> > "Herbert's as for CIF1..." >> >>> > might only require a single sentence to define the default after >> stating >> >>> > what the 'preferred' encoding was; >> >>> > the proposal might be phrased as "Herbert's as for CIF1..." + >> "explicit >> >>> > default encoding"? >> >>> > >> >>> > I do not wish to prolong this debate - if there are objections >> >>>I will not >> >>> > launch into an endless round of exchanges >> >>> > that cover the same ground that has led us this far. >> >>> > >> >>> > Cheers >> >>> > >> >>> > Simon >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> >> >>>>_______________________________________________________________________ >> ____ >> >>> _ >> >>> > From: SIMON WESTRIP >> >>><simonwestrip at btinternet.com>simonwestrip at btinternet.com >> >simonwestrip at btinternet.com>simonwestrip at btinternet.com> >> >>> > To: Group for discussing encoding and content validation >> >>>schemes for CIF2 >> >>> > >> >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >>if2-encoding at iucr.org>cif2-encoding at iucr.org> >> >>> > Sent: Friday, 24 September, 2010 20:10:13 >> >>> > Subject: Re: [Cif2-encoding] How we wrap this up >> >>> > >> >>> > Dear James >> >>> > >> >>> > As you may have gathered I have been reconsidering my position on >> this >> >>> > issue. >> >>> > Please forgive me, but I would like to change my vote if that is OK, >> in >> >>> > favour of the 'any encoding' camp. >> >>> > This apparent U-turn is not a response to recent >> >>>contributions; rather it >> >>> is >> >>> > the outcome of a meeting I had this morning > > >>> > where I demonstrated some new software to the Managing >Editor of IUCr >> >>> > journals. >> >>> > >> >>> > By way of explanation: >> >>> > >> >>> > I have been developing a new docx template which the IUCr >> >>>editorial office >> >>> > is shortly to release for use by >> >>> > authors. The template will be packaged with some tools to extract >> data >> >>> from >> >>> > CIFs >> >>> > and tabulate them in the Word document, e.g. open an mmCIF, click a >> >>> button, >> >>> > and standard >> >>> > tables populated with data from the CIF will be included in >> >>>the document, >> >>> > acting as >> >>> > table templates for the author to edit as appropriate for their >> >>> manuscript. >> >>> > >> >>> > Inclusion of the mmCIF tools is part of an unofficial policy to >> 'coax' >> >>> > biologists to start using/accepting mmCIF >> >>> > as a useful medium, rather than as a product of their deposition to >> the >> >>> PDB, >> >>> > and to encourage them to become comfortable >> >>> > with passing mmCIFs between applications, and even to edit the >> >>>things (in >> >>> > the same way as the core-CIF community >> >>> > treats CIFs). For example, our perception is that there is no reason >> why >> >>> an >> >>> > author should not feel free to take an mmCIF >> >>> > that has been created by e.g. pdb_extract and populate it using >> >>> third-party >> >>> > software before uploading to the PDB for >> >>> > deposition. >> >>> > >> >>> > This cause would not be furthered by effectively invalidating >> >>>an mmCIF if >> >>> it >> >>> > were not to be encoded in one of >> >>> > the specified encodings. >> >>> > >> >>> > So although I am uneasy about a specification that propogates >> >>>uncertainty, >> >>> > I'm also uneasy about alienating users, >> >>> > especially when we are struggling to change their mindset as in the >> case >> >>> of >> >>> > the biological community >> >>> > (my perception of the biological community's attitude to mmCIF >> >>>is based on >> >>> > feedback from authors/coeditors to >> >>> > IUCr journals). >> >>> > >> > >> > Granted this may not be the most compelling argument in favour of >> 'any >> >>> > encoding', but recognizing the hurdles that >> >>> > may have to be overcome once we move beyond ASCII whatever the CIF2 >> >>> > specification, I support 'any encoding' >> >>> > as 'a means to an end'. >> >>> > >> >>> > I will not provide my preferences in terms of the numbered options >> until >> >> > you >> >>> > say so; afterall, I have already voted and >> >>> > all this has to be signed off by COMCIFs in any case. >> >>> > >> >>> > Cheers >> >>> > >> >>> > Simon >> >>> > >> >>> > >> >>> > >> >>> > >> >>> >> >>>>_______________________________________________________________________ >> ____ >> >>> _ >> >>> > From: "Bollinger, John C" >> >>><John.Bollinger at STJUDE.ORG>John.Bollinger at STJUDE.ORG>> >>ilto:John.Bollinger at STJUDE.ORG>John.Bollinger at STJUDE.ORG> >> >>> > To: Group for discussing encoding and content validation >> >>>schemes for CIF2 >> >>> > >> >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >>if2-encoding at iucr.org>cif2-encoding at iucr.org> >> >>> > Sent: Friday, 24 September, 2010 14:50:57 >> >>> > Subject: Re: [Cif2-encoding] How we wrap this up >> >>> > >> >>> > Dear Simon, >> >>> > >> >>> > It is exactly this sort of issue that drove me to support more >> >>>permissive >> >>> > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local >> >>> proposal. >> >>> > >> >>> > Do please think about the considerations Herb raised. As you >> reconsider >> >>> > your votes, I urge you also to ask yourself what, *precisely*, a >> "text >> >>> file" >> >>> > is, and to consider whether your answer is functionally >> >>>different from my >> >>> > "local". If you decide not, then please consider what that >> >>>answer implies > > >>> > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) >> under >> >>> > each option on the table, especially for CIFs containing non-ASCII >> >>> > characters. Whatever you decide about the meaning of "text >> >>>file", please >> >>> > consider whether reasonable people might reach a different >> >>>conclusion, as >> >>> I >> >>> > assert they might do, and to what extent the standard needs to >> address >> >>> that. >> >>> > >> >>> > >> >>> > Regards, >> >>> > >> >>> > John >> >>> > -- >> >>> > John C. Bollinger, Ph.D. >> >>> > Department of Structural Biology >> >>> > St. Jude Children's Research Hospital >> >>> > >> >>> > >> >>> > >From: >> >>>cif2-encoding-bounces at iucr.org>cif2-encoding-bounces at iuc >> >>r.org>cif2-encoding-bounces at iucr.org>cif2-encoding-bounces at iucr.org >> >> >>> > >> >>>[mailto:cif2-encoding-bounces at iucr.org>cif2-encoding-bou >> >>nces at iucr.org>cif2-encoding-bounces at iucr.org>cif2-encoding-bounces@ >> iucr.org] >> >>>On Behalf Of SIMON WESTRIP >> >>> > >Sent: Friday, September 24, 2010 7:53 AM >> >>> > >To: Group for discussing encoding and content validation >> >>>schemes for CIF2 >> >>> > >Subject: Re: [Cif2-encoding] How we wrap this up. . >> >>> > > >> >>> > >Dear Herbert >> >>> > > >> >>> > >Not for the first time, I find your arguement persuasive. Brian's >> vote >> >>> and >> >>> > explanation have also raised some >> >>> > >questions that I would like to look into. >> >>> > > >> >>> > >I will confirm or otherwise my vote as soon as possible, >> >>>assuming that is >> >>> > OK with James and assuming that >> >>> > >this round of votes might wrap this up. >> >>> > > >> >>> > >Cheers >> >>> > > >> >>> > >Simon >> >>> > > >> >>> > >________________________________________ >> >>> > >From: Herbert J. Bernstein >> >>><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.c >> >>om>yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >> >>> > >To: Group for discussing encoding and content validation >> >>>schemes for CIF2 >> >>> > >> >>><cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >>if2-encoding at iucr.org>cif2-encoding at iucr.org> >> >>> > >Sent: Friday, 24 September, 2010 13:17:14 >> >>> > >Subject: Re: [Cif2-encoding] How we wrap this up >> >>> > > >> >>> > >If he ignores the standard, in most cases all he has to do to >> >>>comply with >> >>> > CIF2 is to run whatever applications he currently runs to produce >> CIF1 >> >>> and, >> >>> > perhaps, in some cases, run a minor edit pass at the end, to convert >> for >> >>> the >> >>> > minor syntactive differences and/or changed tags required to comply >> with >> >>> > CIF2 and the new dictionaries, but he is unlikely to have to do >> anything >> > >> to >> >>> > deal with the messy business of whether his encoding is really a >> proper >> >>> UTF8 >> >>> > encoding or not. >> >>> > >> >>> > >The punishment if he tries to comply, is that he has to totally >> uproot >> >>> and >> >>> > reconfigure the environment in which he produces CIFs from >> >>>whatever he is >> >>> > currently doing to create an enviroment in which he can reliably >> create >> >>> and, >> >>> > more importantly, transmit compliant UTF8 files. This can be >> >>>very tricky >> >>> if >> >>> > he does only a partial job, say fudging in one special >> >>>application (yet to >> >>> > be written), because if he stays with his old system, all kinds of >> tools >> >>> > will keep trying to transcode whatever he has produced back to >> whatever >> >>> his >> >>> > system considers a standard. Those of us who have files, >> >>>applications and > > >>> > tools that have lived through several generations of macs are >> >>>living proof >> >>> > of the problem. Macs now have excellent UTF8/16 unicode >> >>>support, but every >> >> > > once in a while in working with a unicode file I find it has been >> >>> strangely >> >>> > and unexpectedly converted to something else, and it can be >> >>>really tricky >> >>> to >> >>> > spot when the unaccented roman text part has been left >> >>>untouched but just >> >>> a >> >>> > few accen >> >>> > ted letters have gotten different accents. >> >>> > >> >>> > >Mandating UTF8 is simply trying to shift a serious software >> >>>problem from >> >>> > the central handlers of CIF (IUCr, PDB, etc.) to the external >> >>>users. Most >> >>> > users will probably have the good sense to simply ignore the demand >> and >> >>> > leave the burden just where it is now. A few sophisticated users >> will >> >>> > probably adapt with no trouble, but the punishment for those users >> who >> >>> > blindly follow orders before we have a complete multiplatform >> supporting >> >>> > infrastructure in place by mandating UTF8 is severe, expensive and >> >>> > undeserved. Until and unless we have developed solid support, we >> will >> >>> just >> >>> > be alienating people from CIF. I will continue to oppose such a >> move. >> >>> > >> >>> > [...] >> >>> > >> >>> > >> >>> > Email Disclaimer: >> >>><<http://www.stjude.org/emaildisclaimer>http://www.stjude.org/emaildiscl >> >>aimer><http://www.stjude.org/emaildisclaimer>www.stjude.org/emaildisclaimer >> >> >>> > _______________________________________________ >> >>> > cif2-encoding mailing list >> >>> > >> >>>cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >>f2-encoding at iucr.org>cif2-encoding at iucr.org >> >>> > >> >>><<http://scripts.iucr.org/mailman/listinfo/cif2-encoding>http://scripts. >> >>iucr.org/mailman/listinfo/cif2-encoding><http://scripts.iucr.org/mailman/li >> >>stinfo/cif2-encoding>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> >>> > >> >>> > >> >>> >> >>> >> >> >> >>_______________________________________________ >> >>cif2-encoding mailing list >> >>cif2-encoding at iucr.org>cif2-encoding at iucr.org >> >><http://scripts.iucr.org/mailman/listinfo/cif2-encoding>http://scripts.iu >> cr.org/mailman/listinfo/cif2-encoding >> > >> > >> >-- >> >===================================================== >> > Herbert J. Bernstein, Professor of Computer Science >> > Dowling College, Kramer Science Center, KSC 121 >> > Idle Hour Blvd, Oakdale, NY, 11769 >> > >> > +1-631-244-3035 >> > >>yaya at dowling.edu>yaya at dowling.edu >> >===================================================== >> >_______________________________________________ >> >cif2-encoding mailing list >> >cif2-encoding at iucr.org>cif2-encoding at iucr.org >> ><http://scripts.iucr.org/mailman/listinfo/cif2-encoding>http://scripts.iuc >> r.org/mailman/listinfo/cif2-encoding >> > >> > >> >_______________________________________________ >> >cif2-encoding mailing list >> >cif2-encoding at iucr.org >> >http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> >> -- >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 > > Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> >>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From yaya at bernstein-plus-sons.com Mon Sep 27 18:06:58 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 27 Sep 2010 13:06:58 -0400 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE0@SJMEMXMBS11.stjude.sjcrh.local > References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE0@SJMEMXMBS11.stjude.sjcrh.local > Message-ID: >The only way I can see that being true is if "text file" or "text" >is intended to be interpreted, at least in part, as "containing only >ASCII characters." Is that your intended meaning, Herb? Otherwise, >CIF2's expansion to the full (more or less) Unicode character set >opens the door for users to insert literal characters into their >(conformant) CIFs in place of or in addition to elides, and none of >the alternatives on the table change that. What the various >alternatives *do* affect is which byte-sequence representations of >those characters will conform to CIF2, under which circumstances. Ah, now I begin to understand the difference in our view. I view CIF for journal use and PDB deposition as having a controlled vocabulary, via combinations of dictionaries, advice to authors, deposition standards, etc. You seem to few CIF as allowing completely arbitrary, uncontrolled text. In some contexts, e.g. in private use inside laboratories as part of a lab notebook or experiment data harvest, but in those contexts, people have been stretching CIF for years, and the change from ASCII to UTF8 is no change at all. Please note that proposals 1 and 2 do _not_ affect "which byte-sequence representations of those characters will conform to CIF2, under which circumstances" because they are not rigidly prescriptive about any particular byte sequences. It is only when we get to 3, 4 and 5 that we are trying to rigidly prescribe what users may write. Until and unless we have made certain that those prescriptions indeed do fit with the workflows of the IUCr and the PDB and are properly supported with software, I think it to be premature to promulgate those prescriptions. Please look at your own words: "that the only viable opportunity to leave this area for further development is to be explicitly restrictive now" To me, that is precisely backwards. We have an existing system called CIF. CIF2 was promised to the community as something that could work with _without_ the users having to make major changes in what they do, i.e. as an revision and extension of CIF1. We are now on the verge of suddenly telling them -- no we were just kidding, we want you to change all your editors and software to conform the this new "explicitly restrictive" model of CIF, but we don't have the editors and software ready. We're just telling you what to do, but are going to leave you up the creek without a paddle for a year or two while we figure out what we meant. That does not leave CIF2 open for further development, it leave it open to be completely replaced by something else. Modern, conservative software engineering practice for incremental development is to complete specify the user externals of the change you are making and to understand all its interactions with the existing system before you put it in place. The time to take your deep breath is _before_ make some maximally disruptive restrictive change. Proposals 1 and 2 leave what is working in the existing system in place until we have done our job on the encoding issue properly. This is really getting out of hand. We need a meeting. If everyone will send me their Skype id's, I will volunteer to set up a Skype conference call at some time that works for everybody (which I suspect will be 4 am EDT). My guess is that 1-2 hours of polite discussion will resolve this. What do we have to lose? Regards, Herbert At 9:45 AM -0500 9/27/10, Bollinger, John C wrote: >Dear Colleagues, > >On Monday, September 27, 2010 5:49 AM, Herbert Bernstein wrote: >> Under the CIF2 specification with UTF8 in place of ASCII there is >>_no_ change in the use of elided ASCII sequences to represent non-ASCII >>characters until and unless the IUCr publications office decides that, > >for that particular application, they are ready to accept something >>new. > >Absolutely correct. The character elides of CIF1 are among its >"common semantic features", which are expressly *not* part of the >CIF1 format standard. CIF2 explicitly omits them as well, leaving >them in exactly the same place they are now. None of this is at all >affected by which encoding option we choose. > >> It is _only_ if you go forward with options 3, 4 or 5 that you >>are giving the green light to users to do precisely what you are >>concerned about -- using the unicode characters instead instead >>in possibly strange admixtures that nobody is ready to process. > >The only way I can see that being true is if "text file" or "text" >is intended to be interpreted, at least in part, as "containing only >ASCII characters." Is that your intended meaning, Herb? Otherwise, >CIF2's expansion to the full (more or less) Unicode character set >opens the door for users to insert literal characters into their >(conformant) CIFs in place of or in addition to elides, and none of >the alternatives on the table change that. What the various >alternatives *do* affect is which byte-sequence representations of >those characters will conform to CIF2, under which circumstances. > >Independent of this particular issue, my greatest problem with >options (1) and (2) is the imprecision of describing CIF2 simply as >"text". That this served well enough for CIF1 is irrelevant; CIF2's >character set lends much more importance and impact to the >interpretation of this aspect of the spec. I see two, maybe even >three, viable and functionally distinct possible definitions. Would >any of the proponents of that wording care to advance a definition >of that term as it is intended to be interpreted in a CIF2 context? >This is substantially equivalent to James's open question, so no >need to answer both. > >[...] > >> My apologies to James, who I know is trying to do what he believes >>to be right, but I believe James has things backwards -- the "deep >>breath" is provided by my proposal -- taking the time to properly engineer >>the use of the extra characters UTF8 allows us to discuss clearly, >>while James' push for an immediate prescriptive use of UTF8 with >>prescriptions that differ drastically from what has been adopted >>by all other frameworks (HTML, XML, python, etc.) in ways that >>are untested and unsupported by most existing software is >>the untimely rush to judgement. > >[...] > >I doubt any of us could disagree that there is an engineering >challenge here, but I have to agree with James that the only viable >opportunity to leave this area for further development is to be >explicitly restrictive now, ala option (3) or (4). Not even my most >preferred option (5) allows sufficient latitude for future extension >without potentially invalidating some CIF2 CIFs and programs. > >Furthermore, I don't think that "all other frameworks" adopt an >entirely uniform approach, nor one that is necessarily equivalent to >option (1) or (2). For example, Sun's various implementations of >the Java compiler seem to use "local" (in my sense of the term) >unless the user passes an option to tell it otherwise. XML and >XHTML use UTF-8 unless a different encoding is explicitly named in >the file, identified via a byte-order mark, or otherwise >communicated at a higher level. HTML tends to rely on a >higher-level protocol to communicate encoding, but provides a >mechanism for communicating it in-line. ALL of the CIF2 options >currently on the table share some characteristics with one or more >of those. > > >Regards, > >John >-- >John C. Bollinger, Ph.D. >Department of Structural Biology >St. Jude Children's Research Hospital > > >Email Disclaimer: www.stjude.org/emaildisclaimer > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From simonwestrip at btinternet.com Mon Sep 27 19:50:43 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Mon, 27 Sep 2010 11:50:43 -0700 (PDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <93847.2110.qm@web87014.mail.ird.yahoo.com> <223950.86835.qm@web87002.mail.ird.yahoo.com> Message-ID: <807704.52817.qm@web87015.mail.ird.yahoo.com> Dear Herbert Unfortunately I do not have a Skype account. I will look into this, but am happy to respect the outcome of such a meeting in my absence. Even though email can easily lead to misunderstanding (a few times I suspect we've actually been in agreement, even though the written word has suggested otherwise), at least the 'issues' have been presented. In an attempt to clarify my current thinking: I think each of us has at some point indicated a willingness to compromise their own views in order to reach a workable agreement that allows us to move forward - afterall CIF2 is far more than allowing non-ASCII in a CIF. Taking your 'As for CIF1...' description as a basis for compromise (and recognizing that 'As for CIF1...' was built on compromise), possible building blocks might include (in order of restrictiveness, but hopefully still respecting the motive of flexibility): 1) specification of a default encoding in the absence of any indication to the contrary, which at least highlights that there may be uncertainty in including non-ASCII text in the CIF; 2) specifying that UTF8 or any self-identifying encoding should be used if non-ASCII text is included in the CIF; 3) specifying that UTF8 or UTF16 should be used if non-ASCII text is included in the CIF. There may be other 'building blocks' that I haven't summarized here. Overall, I get the impression that the 'compromised positions' that have been indicated thus far all respect the spirit of the 'As for CIF1...' approach, but cannot support it without stronger wording that addresses the uncertainties that I think all of us agree are presented by moving beyond ASCII text. To me, the above 'building blocks' in no way alter the fact that users can continue to do what they have always done; it is only when they want to do something new (i.e. use 'unicode'), that they will have to be aware that the new CIF specification has something to say about it. Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Monday, 27 September, 2010 17:23:32 Subject: Re: [Cif2-encoding] How we wrap this up Dear Simon, We do not seem to be communicating effectively. Do you have a Skype account? We really need a meeting. Regards, Herbert At 3:27 PM +0000 9/27/10, SIMON WESTRIP wrote: >I see nothing wrong with a strategy to introduce CIF2 if necessary. >My initial thoughts are that the current 'as for CIF1...' description >is not best suited as base specification on which to build full >unicode support, should such a strategy be pursued. > >However, I will reflect on this along with recent contributions from >James and John... > >Cheers > >Simon > > > >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for >CIF2 >Sent: Monday, 27 September, 2010 14:45:16 >Subject: Re: [Cif2-encoding] How we wrap this up > >The problem is that options 3,4 and 5 specifically prescribe the >use of Unicode characters (that is the entire point of those >options -- and that is the point in dispute -- whether we should >be prescribing UTF8 or using is as we now use ASCII, as a way to >be clear what we are talking about as in CIF1) and we simply are not >ready to deal such a requirement yet. > >I take the blame for starting this discussion many years ago when >I simply asked for just what my motion says, that we start using >UTF8 in the same way we had been using ASCII. Unfortunately >this discussion has turned into a strong push to focus CIF on >that particular encoding, stop using Brian's elides, etc. With >the current weak state of software support for CIF and the large >investment at the IUCr and at the PDB in current workflows, I >think it would be a very disruptive and expensive change to make >right now. God and the Devil are in the details. > >Note that I am _not_ basing this argument on imgCIF. At this point >it appears, unfortunately, that CIF2 and imgCIF will have to diverge. >If we have enough face-to-face discussions, perhaps we can bring >them together again, as we did in 1998, but that is an even more >difficult discussion than the one we need to have on encodings. >What is I we will do is to go at this in incremental stages: > >1. Make the transition from CIF1 to CIF2 using new dictionaries >but allowing most data files to remain unchanges, and providing >simple algorithmic transformations for the rest, but keeping >most of the current semantic extensions that we have in CIF1, >focusing our enegry on getting the new dictionaries used and >making use of dREL; > >2. Work on a CIF2.1 that, by creative and well-supported use >of Unicode, allows for a well organized transition from Brian's >elides to use of Unicode characters > >3. Then working in that context, whatever it turns out to be, >work on having imgCIF make the transition to CIF2 in some >reasonably compatible way. > >I see how to do item 1 for next summer. I don't see how to do 2 and >3 in that time frame, though I am sure we could make a dent in >them if we could meet face to face. email tends to stiffen too >many positions. > >Regards, > Herbert > >===================================================== >Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu >===================================================== > >On Mon, 27 Sep 2010, SIMON WESTRIP wrote: > >> Dear Herbert >> >> I do not understand why it is *only* options 3, 4 or 5 that allow users to >> start using >> unicode characters? >> >> More generally, are you suggesting that the use of anything but ASCII in a >> data value is only allowed if >> e.g. the dictionary definition of the data item permits, or even only if the >> IUCr says that's OK? >> >> Fundamentally, I'm starting to infer that the purpose of the 'as for > > CIF1...' approach to encoding is >> to open the door to full unicode support, but not actually let anyone cross >> the threshold? >> >> >> Cheers >> >> Simon >> >> ____________________________________________________________________________ >> From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >> To: Group for discussing encoding and content validation schemes for CIF2 >> <cif2-encoding at iucr.org> >> Sent: Monday, 27 September, 2010 11:48:49 >> Subject: Re: [Cif2-encoding] How we wrap this up >> >> Dear Simon, >> >> Under the CIF2 specification with UTF8 in place of ASCII there is >> _no_ change in the use of elided ASCII sequences to represent non-ASCII >> characters until and unless the IUCr publications office decides that, >> for that particular application, they are ready to accept something >> new. >> >> It is _only_ if you go forward with options 3, 4 or 5 that you >> are giving the green light to users to do precisely what you are >> concerned about -- using the unicode characters instead instead >> in possibly strange admixtures that nobody is ready to process. >> >> Remember, under the CIF2 specification as now written, it is >> _not_ part of the CIF2 specification to determine the handling >> of the characters in quoted strings other than to ensure that >> those string do not contain illegal characters from the point >> of view of CIF2. Dealing with the validity of particular character >> sequences in strings users provide is, just as in CIF1, the >> responsibility of the application (i.e. the IUCr journal flows >> or the PDB archiving flows). >> >> My apologies to James, who I know is trying to do what he believes >> to be right, but I believe James has things backwards -- the "deep >> breath" is provided by my proposal -- taking the time to properly engineer >> the use of the extra characters UTF8 allows us to discuss clearly, >> while James' push for an immediate prescriptive use of UTF8 with >> prescriptions that differ drastically from what has been adopted >> by all other frameworks (HTML, XML, python, etc.) in ways that >> are untested and unsupported by most existing software is >> the untimely rush to judgement. >> >> I beg you to support options 1 and/or 2 to allow CIF2 to go forward >> in all other respects while we all take a deep breath and deal >> with the tricky issue you raised slowly and carefully without the >> pressure of trying to have CIF2 itself ready for next summer. >> >> Regards, >> Herbert >> >> At 9:34 AM +0000 9/27/10, SIMON WESTRIP wrote: >> >I was not so concerned about invalidating existing CIFs, or even the >> >likelihood >> >that users will continue to write e.g. 'f\'oo' - this is a syntax >> >error in CIF2 that is readily recoverable. >> > >> >Rather there is a large group of CIF1 users that are in the habit of >> >using elided ASCII sequences to >> >represent non-ASCII characters. With CIF2 these users will be able >> >to use the unicode character itself. >> >So we might end up with a mixture of esacaped sequences and unicode >> >characters (e.g. a user may have a keyboard shortcut >> >for an accented character that forms part of their name, but might >> >still resort to \a for alpha, under the assumption that \a is still >> >valid because CIF2 is basically the same as CIF1, and, rightly or >> >wrongly, they perceive the eliding machanism as part of >> >CIF syntax. >> > >> >I think this is an issue where we can't afford to take an 'as for >> >CIF1...' approach, especially as the CIF1 specification >> >isn't entirely satisfactory (e.g. there's an example in the >> >line-folding protocal that uses elides in a file path to make a >> >point, >> >but actually these elides may easily be interpretted as escape >> >sequences), and as the encoding issue is very much concerned with >> >user practice, the large group of users that currently use elided >> >character codes need to be aware what the situation is in >> >CIF2? >> > >> >I'm not convinced this issue should be left for discussion later; >> >it is relevant when considering how the move beyond ASCII is specified. >> > >> >Cheers > > > >> >Simon >> > >> > >> > >> > >> >From: Herbert J. Bernstein >><yaya at bernstein-plus-sons.com> >> >To: Group for discussing encoding and content validation schemes for >> >CIF2 <cif2-encoding at iucr.org> >> >Sent: Sunday, 26 September, 2010 23:14:55 >> >Subject: Re: [Cif2-encoding] How we wrap this up >> > >> >Dear Simon, >> > >> > The current CIF2 spec, with or without the changes I have suggested >> >to temporarily resolve the encoding issue is at best vague and >> >confusing on the elide character issue. The interacting issue on >> >which the CIF2 spec >> >is clear is that we are changing the handling of quoted strings so >> >that they end on the first occurrence of the quoting character and leaves >> >the handling of elides to the calling application. >> > >> > This will be a problem -- the change from CIF1 in the termination of >> >quoted strings along with the absence of a way of eliding the quotes >> >will invalidate a significant number of existing CIFS without any simple >> >mechanism to recover. Rather than reopen another endless discussion, >> >I would suggest we simply add the python string concatenation character >> >"+" to ensure we can map all current CIF1 files and use Brian's common >> >semantic features for the moment. We can then deal with the full elides >> >discussion at a future date. >> > >> > Regards, >> > Herbert >> > >> > >> > >> > >> > >> >At 1:40 PM -0700 9/26/10, SIMON WESTRIP wrote: >> >>Dear all >> >> >> >>While reviewing my hypothetical 'to do' list for implementing CIF2 >> >>in current software, I realized that >> >>the issue of current support for elided character codes hasnt really >> >>been addressed in the context of CIF2. >> >>My 'to do' list contains notes that software could treat them as >> >>keyboard shortcuts, and their use could be >> >>defined in the dictionary. However, that was based on a distinct >> >>difference between CIF1 and CIF2, >> >>while the current arguments for 'as for CIF1...' suggest that the >> >>distinction between CIF1 and CIF2 >> >>should almost be imperceptible. >> >> >> >>How is this issue to be addressed in the specification? >> >> >> >>Cheers >> >> >> >>Simon >> >> >> >> >> >> >> >>From: Herbert J. Bernstein >> >>>><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >> >> >>To: Group for discussing encoding and content validation schemes for >> >>CIF2 >><cif2-encoding at iucr.org>cif2-encoding at iucr.org> >> >> >>Sent: Saturday, 25 September, 2010 20:37:46 >> >>Subject: Re: [Cif2-encoding] How we wrap this up >> >> >> >>Thank you for your cooperation. -- Herbert >> >> >> >>===================================================== >> >>Herbert J. Bernstein, Professor of Computer Science >> >> Dowling College, Kramer Science Center, KSC 121 >> >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> >> >> +1-631-244-3035 >> >> >> >>>>yaya at dowling.edu>yaya at dowling.edu>yaya at dowling.ed >> >> u>yaya at dowling.edu >> >>===================================================== >> >> >> >>On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >> >> >> >>> OK - as promised, I wont pursue the matter :-) >> >>> >> >>> >> >>> >> >>>________________________________________________________________________ >> ____ >> >>> From: Herbert J. Bernstein >> >>>>><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.c >> >> >>om>yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >> >> >>> To: Group for discussing encoding and content validation schemes for >> CIF2 >> >>> >> >>>>><cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >> >>if2-encoding at iucr.org>cif2-encoding at iucr.org> >> > > >>> Sent: Saturday, 25 September, 2010 19:18:54 >> >>> Subject: Re: [Cif2-encoding] How we wrap this up >> >>> >> >>> Dear Simon, >> >>> >> >>> Unfortunately, that is likely to take us back into our infinite loop >> or >> >>> into a diverging spiral. Right now, we would have UTF8 as no >> >>>more or less a >> >>> default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad >> >>>first guess as >> >>> the likely default encoding for any given CIF, but not a formal >> >>>constraint. >> >>> I would suggest we leave the wording in that imprecise state, get CIF2 >> out >> >>> and accepted and then work further on the encoding issue. >> >>> >> >>> Regards, >> >>> Herbert >> >>> >> >>> ===================================================== >> >>> Herbert J. Bernstein, Professor of Computer Science >> >>> Dowling College, Kramer Science Center, KSC 121 >> >>> Idle Hour Blvd, Oakdale, NY, 11769 >> >>> >> >>> +1-631-244-3035 >> >>> >> >>>>>yaya at dowling.edu>yaya at dowling.edu>yaya at dowling.e >> >> du>yaya at dowling.edu >> >>> ===================================================== >> >>> >> >>> On Sat, 25 Sep 2010, SIMON WESTRIP wrote: >> >>> >> >>> > Dear all >> >>> > >> >>> > In the event that CIF2 adopts the 'any encoding' approach, >> >>>would there be >> >> > > any objections to >> > >> > explicitly defining a default encoding in the specification, to be >> >>> defaulted >> >>> > to when there were no indications >> >>> > to the contrary. At worst this would give CIF2 service >> >>>providers an excuse >> >>> > to interpret CIFs as e.g. UTF8 if they couldnt >> >>> > determine the encoding by other means - but such intollerant service >> >>> > providers would soon find that their service is >> >>> > not successful - while at best this might raise awareness of the >> issues >> >>> > regarding encoding once non-ASCII is used in >> >>> > a CIF. Essentially, it does not require users to change there working >> >>> > practices, which is one of the main arguments for >> >>> > 'any encoding'. >> >>> > >> >>> > So, CIF2 would remain 'any encoding', and specifications in >> >>>terms of e.g. >> >>> > "Herbert's as for CIF1..." >> >>> > might only require a single sentence to define the default after >> stating >> >>> > what the 'preferred' encoding was; >> >>> > the proposal might be phrased as "Herbert's as for CIF1..." + >> "explicit >> >>> > default encoding"? >> >>> > >> >>> > I do not wish to prolong this debate - if there are objections >> >>>I will not >> >>> > launch into an endless round of exchanges >> >>> > that cover the same ground that has led us this far. >> >>> > >> >>> > Cheers >> >>> > >> >>> > Simon >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> >> >>>>_______________________________________________________________________ >> ____ >> >>> _ >> >>> > From: SIMON WESTRIP >> >>>>><simonwestrip at btinternet.com>simonwestrip at btinternet.com >> >> >>>simonwestrip at btinternet.com>simonwestrip at btinternet.com> >> >> >>> > To: Group for discussing encoding and content validation >> >>>schemes for CIF2 >> >>> > >> >>>>><cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >> >>if2-encoding at iucr.org>cif2-encoding at iucr.org> >> >> >>> > Sent: Friday, 24 September, 2010 20:10:13 >> >>> > Subject: Re: [Cif2-encoding] How we wrap this up >> >>> > >> >>> > Dear James >> >>> > >> >>> > As you may have gathered I have been reconsidering my position on >> this >> >>> > issue. >> >>> > Please forgive me, but I would like to change my vote if that is OK, >> in >> >>> > favour of the 'any encoding' camp. >> >>> > This apparent U-turn is not a response to recent >> >>>contributions; rather it >> >>> is >> >>> > the outcome of a meeting I had this morning > > >>> > where I demonstrated some new software to the Managing >Editor of IUCr >> >>> > journals. >> >>> > >> >>> > By way of explanation: >> >>> > >> >>> > I have been developing a new docx template which the IUCr >> >>>editorial office >> >>> > is shortly to release for use by >> >>> > authors. The template will be packaged with some tools to extract >> data >> >>> from >> >>> > CIFs >> >>> > and tabulate them in the Word document, e.g. open an mmCIF, click a >> >>> button, >> >>> > and standard >> >>> > tables populated with data from the CIF will be included in >> >>>the document, >> >>> > acting as >> >>> > table templates for the author to edit as appropriate for their >> >>> manuscript. >> >>> > >> >>> > Inclusion of the mmCIF tools is part of an unofficial policy to >> 'coax' >> >>> > biologists to start using/accepting mmCIF >> >>> > as a useful medium, rather than as a product of their deposition to >> the >> >>> PDB, >> >>> > and to encourage them to become comfortable >> >>> > with passing mmCIFs between applications, and even to edit the >> >>>things (in >> >>> > the same way as the core-CIF community >> >>> > treats CIFs). For example, our perception is that there is no reason >> why >> >>> an >> >>> > author should not feel free to take an mmCIF >> >>> > that has been created by e.g. pdb_extract and populate it using >> >>> third-party >> >>> > software before uploading to the PDB for >> >>> > deposition. >> >>> > >> >>> > This cause would not be furthered by effectively invalidating >> >>>an mmCIF if >> >>> it >> >>> > were not to be encoded in one of >> >>> > the specified encodings. >> >>> > >> >>> > So although I am uneasy about a specification that propogates >> >>>uncertainty, >> >>> > I'm also uneasy about alienating users, >> >>> > especially when we are struggling to change their mindset as in the >> case >> >>> of >> >>> > the biological community >> >>> > (my perception of the biological community's attitude to mmCIF >> >>>is based on >> >>> > feedback from authors/coeditors to >> >>> > IUCr journals). >> >>> > >> > >> > Granted this may not be the most compelling argument in favour of >> 'any >> >>> > encoding', but recognizing the hurdles that >> >>> > may have to be overcome once we move beyond ASCII whatever the CIF2 >> >>> > specification, I support 'any encoding' >> >>> > as 'a means to an end'. >> >>> > >> >>> > I will not provide my preferences in terms of the numbered options >> until >> >> > you >> >>> > say so; afterall, I have already voted and >> >>> > all this has to be signed off by COMCIFs in any case. >> >>> > >> >>> > Cheers >> >>> > >> >>> > Simon >> >>> > >> >>> > >> >>> > >> >>> > >> >>> >> >>>>_______________________________________________________________________ >> ____ >> >>> _ >> >>> > From: "Bollinger, John C" >> >>>>><John.Bollinger at STJUDE.ORG>John.Bollinger at STJUDE.ORG>> >> >>ilto:John.Bollinger at STJUDE.ORG>John.Bollinger at STJUDE.ORG> >> >> >>> > To: Group for discussing encoding and content validation >> >>>schemes for CIF2 >> >>> > >> >>>>><cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >> >>if2-encoding at iucr.org>cif2-encoding at iucr.org> >> >> >>> > Sent: Friday, 24 September, 2010 14:50:57 >> >>> > Subject: Re: [Cif2-encoding] How we wrap this up >> >>> > >> >>> > Dear Simon, >> >>> > >> >>> > It is exactly this sort of issue that drove me to support more >> >>>permissive >> >>> > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local >> >>> proposal. >> >>> > >> >>> > Do please think about the considerations Herb raised. As you >> reconsider >> >>> > your votes, I urge you also to ask yourself what, *precisely*, a >> "text >> >>> file" >> >>> > is, and to consider whether your answer is functionally >> >>>different from my >> >>> > "local". If you decide not, then please consider what that >> >>>answer implies > > >>> > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) >> under >> >>> > each option on the table, especially for CIFs containing non-ASCII >> >>> > characters. Whatever you decide about the meaning of "text >> >>>file", please >> >>> > consider whether reasonable people might reach a different >> >>>conclusion, as >> >>> I >> >>> > assert they might do, and to what extent the standard needs to >> address >> >>> that. >> >>> > >> >>> > >> >>> > Regards, >> >>> > >> >>> > John >> >>> > -- >> >>> > John C. Bollinger, Ph.D. >> >>> > Department of Structural Biology >> >>> > St. Jude Children's Research Hospital >> >>> > >> >>> > >> >>> > >From: >> >>>>>cif2-encoding-bounces at iucr.org>cif2-encoding-bounces at iuc >> >> >>r.org>cif2-encoding-bounces at iucr.org>cif2-encoding-bounces at iucr.org >> >> >> >>> > >> >>>>>[mailto:cif2-encoding-bounces at iucr.org>cif2-encoding-bou >> >> >>nces at iucr.org>cif2-encoding-bounces at iucr.org>cif2-encoding-bounces@ >> >> iucr.org] >> >>>On Behalf Of SIMON WESTRIP >> >>> > >Sent: Friday, September 24, 2010 7:53 AM >> >>> > >To: Group for discussing encoding and content validation >> >>>schemes for CIF2 >> >>> > >Subject: Re: [Cif2-encoding] How we wrap this up. . >> >>> > > >> >>> > >Dear Herbert >> >>> > > >> >>> > >Not for the first time, I find your arguement persuasive. Brian's >> vote >> >>> and >> >>> > explanation have also raised some >> >>> > >questions that I would like to look into. >> >>> > > >> >>> > >I will confirm or otherwise my vote as soon as possible, >> >>>assuming that is >> >>> > OK with James and assuming that >> >>> > >this round of votes might wrap this up. >> >>> > > >> >>> > >Cheers >> >>> > > >> >>> > >Simon >> >>> > > >> >>> > >________________________________________ >> >>> > >From: Herbert J. Bernstein >> >>>>><yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.c >> >> >>om>yaya at bernstein-plus-sons.com>yaya at bernstein-plus-sons.com> >> >> >>> > >To: Group for discussing encoding and content validation >> >>>schemes for CIF2 >> >>> > >> >>>>><cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >> >>if2-encoding at iucr.org>cif2-encoding at iucr.org> >> >> >>> > >Sent: Friday, 24 September, 2010 13:17:14 >> >>> > >Subject: Re: [Cif2-encoding] How we wrap this up >> >>> > > >> >>> > >If he ignores the standard, in most cases all he has to do to >> >>>comply with >> >>> > CIF2 is to run whatever applications he currently runs to produce >> CIF1 >> >>> and, >> >>> > perhaps, in some cases, run a minor edit pass at the end, to convert >> for >> >>> the >> >>> > minor syntactive differences and/or changed tags required to comply >> with >> >>> > CIF2 and the new dictionaries, but he is unlikely to have to do >> anything >> > >> to >> >>> > deal with the messy business of whether his encoding is really a >> proper >> >>> UTF8 >> >>> > encoding or not. >> >>> > >> >>> > >The punishment if he tries to comply, is that he has to totally >> uproot >> >>> and >> >>> > reconfigure the environment in which he produces CIFs from >> >>>whatever he is >> >>> > currently doing to create an enviroment in which he can reliably >> create >> >>> and, >> >>> > more importantly, transmit compliant UTF8 files. This can be >> >>>very tricky >> >>> if >> >>> > he does only a partial job, say fudging in one special >> >>>application (yet to >> >>> > be written), because if he stays with his old system, all kinds of >> tools >> >>> > will keep trying to transcode whatever he has produced back to >> whatever >> >>> his >> >>> > system considers a standard. Those of us who have files, >> >>>applications and > > >>> > tools that have lived through several generations of macs are >> >>>living proof >> >>> > of the problem. Macs now have excellent UTF8/16 unicode >> >>>support, but every >> >> > > once in a while in working with a unicode file I find it has been >> >>> strangely >> >>> > and unexpectedly converted to something else, and it can be >> >>>really tricky >> >>> to >> >>> > spot when the unaccented roman text part has been left >> >>>untouched but just >> >>> a >> >>> > few accen >> >>> > ted letters have gotten different accents. >> >>> > >> >>> > >Mandating UTF8 is simply trying to shift a serious software >> >>>problem from >> >>> > the central handlers of CIF (IUCr, PDB, etc.) to the external >> >>>users. Most >> >>> > users will probably have the good sense to simply ignore the demand >> and >> >>> > leave the burden just where it is now. A few sophisticated users >> will >> >>> > probably adapt with no trouble, but the punishment for those users >> who >> >>> > blindly follow orders before we have a complete multiplatform >> supporting >> >>> > infrastructure in place by mandating UTF8 is severe, expensive and >> >>> > undeserved. Until and unless we have developed solid support, we >> will >> >>> just >> >>> > be alienating people from CIF. I will continue to oppose such a >> move. >> >>> > >> >>> > [...] >> >>> > >> >>> > >> >>> > Email Disclaimer: >> >>>>><<http://www.stjude.org/emaildisclaimer>http://www.stjude.org/emaildiscl >> >> >>aimer><http://www.stjude.org/emaildisclaimer>www.stjude.org/emaildisclaimer >> >> >> >>> > _______________________________________________ >> >>> > cif2-encoding mailing list >> >>> > >> >>>>>cif2-encoding at iucr.org>cif2-encoding at iucr.org>> >> >>f2-encoding at iucr.org>cif2-encoding at iucr.org >> >> >>> > >> >>>>><<http://scripts.iucr.org/mailman/listinfo/cif2-encoding>http://scripts. >> >> >>iucr.org/mailman/listinfo/cif2-encoding><http://scripts.iucr.org/mailman/li >> >> >>stinfo/cif2-encoding>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> >> >>> > >> >>> > >> >>> >> >>> >> >> >> >>_______________________________________________ >> >>cif2-encoding mailing list >> >>>>cif2-encoding at iucr.org>cif2-encoding at iucr.org >> >> >>>><http://scripts.iucr.org/mailman/listinfo/cif2-encoding>http://scripts.iu >> >> cr.org/mailman/listinfo/cif2-encoding >> > >> > >> >-- >> >===================================================== >> > Herbert J. Bernstein, Professor of Computer Science >> > Dowling College, Kramer Science Center, KSC 121 >> > Idle Hour Blvd, Oakdale, NY, 11769 >> > >> > +1-631-244-3035 >> > >>yaya at dowling.edu>yaya at dowling.edu >> >> >===================================================== >> >_______________________________________________ >> >cif2-encoding mailing list >> >>>cif2-encoding at iucr.org>cif2-encoding at iucr.org >> >> >>><http://scripts.iucr.org/mailman/listinfo/cif2-encoding>http://scripts.iuc >> >> r.org/mailman/listinfo/cif2-encoding >> > >> > >> >_______________________________________________ >> >cif2-encoding mailing list >> >cif2-encoding at iucr.org >> >>>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> >> >> -- >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 > > Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> >>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> >> > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100927/a429b0e2/attachment-0001.html From yaya at bernstein-plus-sons.com Mon Sep 27 20:15:59 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 27 Sep 2010 15:15:59 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <807704.52817.qm@web87015.mail.ird.yahoo.com> References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <93847.2110.qm@web87014.mail.ird.yahoo.com> <223950.86835.qm@web87002.mail.ird.yahoo.com> <807704.52817.qm@web87015.mail.ird.yahoo.com> Message-ID: Dear Simon, I personally have no objection to specification of UTF-8 as a default encoding in the absence of indications of some different encoding. Until we achieve more clarity on what a "self-identifying encoding' means, I am uncomfortable with mandating such an encoding for non-ASCII text, but am very happy to recommend it for non-ASCII text. And again, with UTF/8 or UTF/16 I am very uncomfortable mandating those particular encodings for non-ascii text, but am very happy to recommend them for non-ASCII text. The problem with all of this is, if we say more than a very few words, somebody has strong objections to some part of what is saids. If we could meet, I expect would could reach some appropriate compromises. Everything you have suggested and my variations have all be put forward at some point during this endless discussion, and somebody absolutely could not live with some aspect of it. At this point the only way out of this seems to be to say as little as possible, but even that now seems to raise strong objection. It is all a great shame. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Mon, 27 Sep 2010, SIMON WESTRIP wrote: > Dear Herbert > > Unfortunately I do not have a Skype account. I will look into this, but am happy to respect the outcome of > such a meeting in my absence. Even though email can easily lead to misunderstanding > (a few times I suspect we've actually been in agreement, even though the written word has suggested > otherwise), > at least the 'issues' have been presented. > > In an attempt to clarify my current thinking: > > I think each of us has at some point indicated a willingness to compromise their own > views in order to reach a workable agreement that allows us to move forward - afterall > CIF2 is far more than allowing non-ASCII in a CIF. > > Taking your 'As for CIF1...' description as a basis for compromise (and recognizing that > 'As for CIF1...' was built on compromise), possible building blocks might include > (in order of restrictiveness, but hopefully still respecting the motive of flexibility): > > 1) specification of a default encoding in the absence of any indication to the contrary, which at least > highlights that > there may be uncertainty in including non-ASCII text in the CIF; > > 2) specifying that UTF8 or any self-identifying encoding should be used if non-ASCII text is included in > the CIF; > > 3) specifying that UTF8 or UTF16 should be used if non-ASCII text is included in the CIF. > > There may be other 'building blocks' that I haven't summarized here. > > Overall, I get the impression that the 'compromised positions' that have been indicated thus far all > respect the spirit > of the 'As for CIF1...' approach, but cannot support it without stronger wording that addresses the > uncertainties that I think all of us agree are presented by moving beyond ASCII text. > > To me, the above 'building blocks' in no way alter the fact that users can continue to do what they have > always done; > it is only when they want to do something new (i.e. use 'unicode'), that they will have to be aware that > the new CIF specification > has something to say about it. > > Cheers > > Simon > > > > > > __________________________________________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > Sent: Monday, 27 September, 2010 17:23:32 > Subject: Re: [Cif2-encoding] How we wrap this up > > Dear Simon, > > ? We do not seem to be communicating effectively.? Do you have a > Skype account?? We really need a meeting. > > ? Regards, > ? ? Herbert > > > At 3:27 PM +0000 9/27/10, SIMON WESTRIP wrote: > >I see nothing wrong with a strategy to introduce CIF2 if necessary. > >My initial thoughts are that the current 'as for CIF1...' description > >is not best suited as base specification on which to build full > >unicode support, should such a strategy be pursued. > > > >However, I will reflect on this along with recent contributions from > >James and John... > > > >Cheers > > > >Simon > > > > > > > >From: Herbert J. Bernstein > >To: Group for discussing encoding and content validation schemes for > >CIF2 > >Sent: Monday, 27 September, 2010 14:45:16 > >Subject: Re: [Cif2-encoding] How we wrap this up > > > >The problem is that options 3,4 and 5 specifically prescribe the > >use of Unicode characters (that is the entire point of those > >options -- and that is the point in dispute -- whether we should > >be prescribing UTF8 or using is as we now use ASCII, as a way to > >be clear what we are talking about as in CIF1) and we simply are not > >ready to deal such a requirement yet. > > > >I take the blame for starting this discussion many years ago when > >I simply asked for just what my motion says, that we start using > >UTF8 in the same way we had been using ASCII.? Unfortunately > >this discussion has turned into a strong push to focus CIF on > >that particular encoding, stop using Brian's elides, etc.? With > >the current weak state of software support for CIF and the large > >investment at the IUCr and at the PDB in current workflows, I > >think it would be a very disruptive and expensive change to make > >right now.? God and the Devil are in the details. > > > >Note that I am _not_ basing this argument on imgCIF.? At this point > >it appears, unfortunately, that CIF2 and imgCIF will have to diverge. > >If we have enough face-to-face discussions, perhaps we can bring > >them together again, as we did in 1998, but that is an even more > >difficult discussion than the one we need to have on encodings. > >What is I we will do is to go at this in incremental stages: > > > >1.? Make the transition from CIF1 to CIF2 using new dictionaries > >but allowing most data files to remain unchanges, and providing > >simple algorithmic transformations for the rest, but keeping > >most of the current semantic extensions that we have in CIF1, > >focusing our enegry on getting the new dictionaries used and > >making use of dREL; > > > >2.? Work on a CIF2.1 that, by creative and well-supported use > >of Unicode, allows for a well organized transition from Brian's > >elides to use of Unicode characters > > > >3.? Then working in that context, whatever it turns out to be, > >work on having imgCIF make the transition to CIF2 in some > >reasonably compatible way. > > > >I see how to do item 1 for next summer.? I don't see how to do 2 and > >3 in that time frame, though I am sure we could make a dent in > >them if we could meet face to face.? email tends to stiffen too > >many positions. > > > >Regards, > >? Herbert > > > >===================================================== > >Herbert J. Bernstein, Professor of Computer Science > >? Dowling College, Kramer Science Center, KSC 121 > >? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > >? ? ? ? ? ? ? ? +1-631-244-3035 > >? ? ? ? ? ? ? ? yaya at dowling.edu > >===================================================== > > > >On Mon, 27 Sep 2010, SIMON WESTRIP wrote: > > > >>? Dear Herbert > >> > >>? I do not understand why it is *only* options 3, 4 or 5 that allow users to > >>? start using > >>? unicode characters? > >> > >>? More generally, are you suggesting that the use of anything but ASCII in a > >>? data value is only allowed if > >>? e.g. the dictionary definition of the data item permits, or even only if the > >>? IUCr says that's OK? > >> > >>? Fundamentally, I'm starting to infer that the purpose of the 'as for > >? > CIF1...' approach to encoding is > >>? to open the door to full unicode support, but not actually let anyone cross > >>? the threshold? > >> > >> > >>? Cheers > >> > >>? Simon > >> > >>? ____________________________________________________________________________ > >>? From: Herbert J. Bernstein > >><yaya at bernstein-plus-sons.com> > >>? To: Group for discussing encoding and content validation schemes for CIF2 > >>? <cif2-encoding at iucr.org> > >>? Sent: Monday, 27 September, 2010 11:48:49 > >>? Subject: Re: [Cif2-encoding] How we wrap this up > >> > >>? Dear Simon, > >> > >>? ? Under the CIF2 specification with UTF8 in place of ASCII there is > >>? _no_ change in the use of elided ASCII sequences to represent non-ASCII > >>? characters until and unless the IUCr publications office decides that, > >>? for that particular application, they are ready to accept something > >>? new. > >> > >>? ? It is _only_ if you go forward with options 3, 4 or 5 that you > >>? are giving the green light to users to do precisely what you are > >>? concerned about -- using the unicode characters instead instead > >>? in possibly strange admixtures that nobody is ready to process. > >> > >>? ? Remember, under the CIF2 specification as now written, it is > >>? _not_ part of the CIF2 specification to determine the handling > >>? of the characters in quoted strings other than to ensure that > >>? those string do not contain illegal characters from the point > >>? of view of CIF2.? Dealing with the validity of particular character > >>? sequences in strings users provide is, just as in CIF1, the > >>? responsibility of the application (i.e. the IUCr journal flows > >>? or the PDB archiving flows). > >> > >>? ? My apologies to James, who I know is trying to do what he believes > >>? to be right, but I believe James has things backwards -- the "deep > >>? breath" is provided by my proposal -- taking the time to properly engineer > >>? the use of the extra characters UTF8 allows us to discuss clearly, > >>? while James' push for an immediate prescriptive use of UTF8 with > >>? prescriptions that differ drastically from what has been adopted > >>? by all other frameworks (HTML, XML, python, etc.) in ways that > >>? are untested and unsupported by most existing software is > >>? the untimely rush to judgement. > >> > >>? ? I beg you to support options 1 and/or 2 to allow CIF2 to go forward > >>? in all other respects while we all take a deep breath and deal > >>? with the tricky issue you raised slowly and carefully without the > >>? pressure of trying to have CIF2 itself ready for next summer. > >> > >>? ? Regards, > >>? ? ? Herbert > >> > >>? At 9:34 AM +0000 9/27/10, SIMON WESTRIP wrote: > >>? >I was not so concerned about invalidating existing CIFs, or even the > >>? >likelihood > >>? >that users will continue to write e.g. 'f\'oo' - this is a syntax > >>? >error in CIF2 that is readily recoverable. > >>? > > >>? >Rather there is a large group of CIF1 users that are in the habit of > >>? >using elided ASCII sequences to > >>? >represent non-ASCII characters. With CIF2 these users will be able > >>? >to use the unicode character itself. > >>? >So we might end up with a mixture of esacaped sequences and unicode > >>? >characters (e.g. a user may have a keyboard shortcut > >>? >for an accented character that forms part of their name, but might > >>? >still resort to \a for alpha, under the assumption that \a is still > >>? >valid because CIF2 is basically the same as CIF1, and, rightly or > >>? >wrongly, they perceive the eliding machanism as part of > >>? >CIF syntax. > >>? > > >>? >I think this is an issue where we can't afford to take an 'as for > >>? >CIF1...' approach, especially as the CIF1 specification > >>? >isn't entirely satisfactory (e.g. there's an example in the > >>? >line-folding protocal that uses elides in a file path to make a > >>? >point, > >>? >but actually these elides may easily be interpretted as escape > >>? >sequences), and as the encoding issue is very much concerned with > >>? >user practice, the large group of users that currently use elided > >>? >character codes need to be aware what the situation is in > >>? >CIF2? > >>? > > >>? >I'm not convinced this issue should be left for discussion later; > >>? >it is relevant when considering how the move beyond ASCII is specified. > >>? > > >>? >Cheers > >? > > > >>? >Simon > >>? > > >>? > > >>? > > >>? > > >>? >From: Herbert J. Bernstein > >><yaya at bernstein-plus-sons.com> > >>? >To: Group for discussing encoding and content validation schemes for > >>? >CIF2 <cif2-encoding at iucr.org> > >>? >Sent: Sunday, 26 September, 2010 23:14:55 > >>? >Subject: Re: [Cif2-encoding] How we wrap this up > >>? > > >>? >Dear Simon, > >>? > > >>? >? The current CIF2 spec, with or without the changes I have suggested > >>? >to temporarily resolve the encoding issue is at best vague and > >>? >confusing on the elide character issue.? The interacting issue on > >>? >which the CIF2 spec > >>? >is clear is that we are changing the handling of quoted strings so > >>? >that they end on the first occurrence of the quoting character and leaves > >>? >the handling of elides to the calling application. > >>? > > >>? >? This will be a problem -- the change from CIF1 in the termination of > >>? >quoted strings along with the absence of a way of eliding the quotes > >>? >will invalidate a significant number of existing CIFS without any simple > >>? >mechanism to recover.? Rather than reopen another endless discussion, > >>? >I would suggest we simply add the python string concatenation character > >>? >"+" to ensure we can map all current CIF1 files and use Brian's common > >>? >semantic features for the moment.? We can then deal with the full elides > >>? >discussion at a future date. > >>? > > >>? >? Regards, > >>? >? ? Herbert > >>? > > >>? > > >>? > > >>? > > >>? > > >>? >At 1:40 PM -0700 9/26/10, SIMON WESTRIP wrote: > >>? >>Dear all > >>? >> > >>? >>While reviewing my hypothetical 'to do' list for implementing CIF2 > >>? >>in current software, I realized that > >>? >>the issue of current support for elided character codes hasnt really > >>? >>been addressed in the context of CIF2. > >>? >>My 'to do' list contains notes that software could treat them as > >>? >>keyboard shortcuts, and their use could be > >>? >>defined in the dictionary. However, that was based on a distinct > >>? >>difference between CIF1 and CIF2, > >>? >>while the current arguments for 'as for CIF1...' suggest that the > >>? >>distinction between CIF1 and CIF2 > >>? >>should almost be imperceptible. > >>? >> > >>? >>How is this issue to be addressed in the specification? > >>? >> > >>? >>Cheers > >>? >> > >>? >>Simon > >>? >> > >>? >> > >>? >> > >>? >>From: Herbert J. Bernstein > >>?>><yaya at bernstein-plus-sons.com> sons.com>yaya at bernstein-plus-sons.com> > >>? >>To: Group for discussing encoding and content validation schemes for > >>? >>CIF2 > >><cif2-encoding at iucr.org>cif2-enco > ding at iucr.org> > >>? >>Sent: Saturday, 25 September, 2010 20:37:46 > >>? >>Subject: Re: [Cif2-encoding] How we wrap this up > >>? >> > >>? >>Thank you for your cooperation. -- Herbert > >>? >> > >>? >>===================================================== > >>? >>Herbert J. Bernstein, Professor of Computer Science > >>? >>? Dowling College, Kramer Science Center, KSC 121 > >>? >>? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > >>? >> > >>? >>? ? ? ? ? ? ? ? +1-631-244-3035 > >>? >> > >>?>>yaya at dowling.edu>yaya at dowling.edu> ilto:yaya at dowling.ed > >>? u>yaya at dowling.edu > >>? >>===================================================== > >>? >> > >>? >>On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > >>? >> > >>? >>>? OK - as promised, I wont pursue the matter :-) > >>? >>> > >>? >>> > >>? >>> > >>? >>>________________________________________________________________________ > >>? ____ > >>? >>>? From: Herbert J. Bernstein > >>?>>><yaya at bernstein-plus-sons.com> ein-plus-sons.c>yaya at bernstein-plus-sons.c > >> > >>om>yaya at bernstein-plus-sons.com> s-sons.com>yaya at bernstein-plus-sons.com> > >>? >>>? To: Group for discussing encoding and content validation schemes for > >>? CIF2 > >>? >>> > >>?>>><cif2-encoding at iucr.org> > cif2-encoding at iucr.org> >> > >>if2-encoding at iucr.org>cif2-encoding at iucr.o > rg> > >? > >>>? Sent: Saturday, 25 September, 2010 19:18:54 > >>? >>>? Subject: Re: [Cif2-encoding] How we wrap this up > >>? >>> > >>? >>>? Dear Simon, > >>? >>> > >>? >>>? ? Unfortunately, that is likely to take us back into our infinite loop > >>? or > >>? >>>? into a diverging spiral.? Right now, we would have UTF8 as no > >>? >>>more or less a > >>? >>>? default for CIF2 than ASCII is for CIF1 -- i.e. a not too bad > >>? >>>first guess as > >>? >>>? the likely default encoding for any given CIF, but not a formal > >>? >>>constraint. > >>? >>>? I would suggest we leave the wording in that imprecise state, get CIF2 > >>? out > >>? >>>? and accepted and then work further on the encoding issue. > >>? >>> > >>? >>>? ? Regards, > >>? >>>? ? ? Herbert > >>? >>> > >>? >>>? ===================================================== > >>? >>>? Herbert J. Bernstein, Professor of Computer Science > >>? >>>? ? Dowling College, Kramer Science Center, KSC 121 > >>? >>>? ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > >>? >>> > >>? >>>? ? ? ? ? ? ? ? ? +1-631-244-3035 > >>? >>> > >>?>>>yaya at dowling.edu>yaya at dowling.edu> ailto:yaya at dowling.e > >>? du>yaya at dowling.edu > >>? >>>? ===================================================== > >>? >>> > >>? >>>? On Sat, 25 Sep 2010, SIMON WESTRIP wrote: > >>? >>> > >>? >>>? > Dear all > >>? >>>? > > >>? >>>? > In the event that CIF2 adopts the 'any encoding' approach, > >>? >>>would there be > >>? >>? > > any objections to > >>? >? >>? > explicitly defining a default encoding in the specification, to be > >>? >>>? defaulted > >>? >>>? > to when there were no indications > >>? >>>? > to the contrary. At worst this would give CIF2 service > >>? >>>providers an excuse > >>? >>>? > to interpret CIFs as e.g. UTF8 if they couldnt > >>? >>>? > determine the encoding by other means - but such intollerant service > >>? >>>? > providers would soon find that their service is > >>? >>>? > not successful - while at best this might raise awareness of the > >>? issues > >>? >>>? > regarding encoding once non-ASCII is used in > >>? >>>? > a CIF. Essentially, it does not require users to change there working > >>? >>>? > practices, which is one of the main arguments for > >>? >>>? > 'any encoding'. > >>? >>>? > > >>? >>>? > So, CIF2 would remain 'any encoding', and specifications in > >>? >>>terms of e.g. > >>? >>>? > "Herbert's as for CIF1..." > >>? >>>? > might only require a single sentence to define the default after > >>? stating > >>? >>>? > what the 'preferred' encoding was; > >>? >>>? > the proposal might be phrased as "Herbert's as for CIF1..." + > >>? "explicit > >>? >>>? > default encoding"? > >>? >>>? > > >>? >>>? > I do not wish to prolong this debate - if there are objections > >>? >>>I will not > >>? >>>? > launch into an endless round of exchanges > >>? >>>? > that cover the same ground that has led us this far. > >>? >>>? > > >>? >>>? > Cheers > >>? >>>? > > >>? >>>? > Simon > >>? >>>? > > >>? >>>? > > >>? >>>? > > >>? >>>? > > >>? >>>? > > >>? >>>? > > >>? >>> > >>? >>>>_______________________________________________________________________ > >>? ____ > >>? >>>? _ > >>? >>>? > From: SIMON WESTRIP > >>?>>><simonwestrip at btinternet.com> btinternet.com>simonwestrip at btinternet.com > >>?>simonwestrip at btinternet.com> com>simonwestrip at btinternet.com> > >>? >>>? > To: Group for discussing encoding and content validation > >>? >>>schemes for CIF2 > >>? >>>? > > >>?>>><cif2-encoding at iucr.org> > cif2-encoding at iucr.org> >> > >>if2-encoding at iucr.org>cif2-encoding at iucr.o > rg> > >>? >>>? > Sent: Friday, 24 September, 2010 20:10:13 > >>? >>>? > Subject: Re: [Cif2-encoding] How we wrap this up > >>? >>>? > > >>? >>>? > Dear James > >>? >>>? > > >>? >>>? > As you may have gathered I have been reconsidering my position on > >>? this > >>? >>>? > issue. > >>? >>>? > Please forgive me, but I would like to change my vote if that is OK, > >>? in > >>? >>>? > favour of the 'any encoding' camp. > >>? >>>? > This apparent U-turn is not a response to recent > >>? >>>contributions; rather it > >>? >>>? is > >>? >>>? > the outcome of a meeting I had this morning > >? > >>>? > where I demonstrated some new software to the Managing > >Editor of IUCr > >>? >>>? > journals. > >>? >>>? > > >>? >>>? > By way of explanation: > >>? >>>? > > >>? >>>? > I have been developing a new docx template which the IUCr > >>? >>>editorial office > >>? >>>? > is shortly to release for use by > >>? >>>? > authors. The template will be packaged with some tools to extract > >>? data > >>? >>>? from > >>? >>>? > CIFs > >>? >>>? > and tabulate them in the Word document, e.g. open an mmCIF, click a > >>? >>>? button, > >>? >>>? > and standard > >>? >>>? > tables populated with data from the CIF will be included in > >>? >>>the document, > >>? >>>? > acting as > >>? >>>? > table templates for the author to edit as appropriate for their > >>? >>>? manuscript. > >>? >>>? > > >>? >>>? > Inclusion of the mmCIF tools is part of an unofficial policy to > >>? 'coax' > >>? >>>? > biologists to start using/accepting mmCIF > >>? >>>? > as a useful medium, rather than as a product of their deposition to > >>? the > >>? >>>? PDB, > >>? >>>? > and to encourage them to become comfortable > >>? >>>? > with passing mmCIFs between applications, and even to edit the > >>? >>>things (in > >>? >>>? > the same way as the core-CIF community > >>? >>>? > treats CIFs). For example, our perception is that there is no reason > >>? why > >>? >>>? an > >>? >>>? > author should not feel free to take an mmCIF > >>? >>>? > that has been created by e.g. pdb_extract and populate it using > >>? >>>? third-party > >>? >>>? > software before uploading to the PDB for > >>? >>>? > deposition. > >>? >>>? > > >>? >>>? > This cause would not be furthered by effectively invalidating > >>? >>>an mmCIF if > >>? >>>? it > >>? >>>? > were not to be encoded in one of > >>? >>>? > the specified encodings. > >>? >>>? > > >>? >>>? > So although I am uneasy about a specification that propogates > >>? >>>uncertainty, > >>? >>>? > I'm also uneasy about alienating users, > >>? >>>? > especially when we are struggling to change their mindset as in the > >>? case > >>? >>>? of > >>? >>>? > the biological community > >>? >>>? > (my perception of the biological community's attitude to mmCIF > >>? >>>is based on > >>? >>>? > feedback from authors/coeditors to > >>? >>>? > IUCr journals). > >>? >>>? > > >>? >? >>? > Granted this may not be the most compelling argument in favour of > >>? 'any > >>? >>>? > encoding', but recognizing the hurdles that > >>? >>>? > may have to be overcome once we move beyond ASCII whatever the CIF2 > >>? >>>? > specification, I support 'any encoding' > >>? >>>? > as 'a means to an end'. > >>? >>>? > > >>? >>>? > I will not provide my preferences in terms of the numbered options > >>? until > >>? >>? > you > >>? >>>? > say so; afterall, I have already voted and > >>? >>>? > all this has to be signed off by COMCIFs in any case. > >>? >>>? > > >>? >>>? > Cheers > >>? >>>? > > >>? >>>? > Simon > >>? >>>? > > >>? >>>? > > >>? >>>? > > >>? >>>? > > >>? >>> > >>? >>>>_______________________________________________________________________ > >>? ____ > >>? >>>? _ > >>? >>>? > From: "Bollinger, John C" > >>?>>><John.Bollinger at STJUDE.ORG> JUDE.ORG>John.Bollinger at STJUDE.ORG> >> > >>ilto:John.Bollinger at STJUDE.ORG>John > .Bollinger at STJUDE.ORG> > >>? >>>? > To: Group for discussing encoding and content validation > >>? >>>schemes for CIF2 > >>? >>>? > > >>?>>><cif2-encoding at iucr.org> > cif2-encoding at iucr.org> >> > >>if2-encoding at iucr.org>cif2-encoding at iucr.o > rg> > >>? >>>? > Sent: Friday, 24 September, 2010 14:50:57 > >>? >>>? > Subject: Re: [Cif2-encoding] How we wrap this up > >>? >>>? > > >>? >>>? > Dear Simon, > >>? >>>? > > >>? >>>? > It is exactly this sort of issue that drove me to support more > >>? >>>permissive > >>? >>>? > encoding rules and ultimately to devise the UTF-8 + UTF-16 + local > >>? >>>? proposal. > >>? >>>? > > >>? >>>? > Do please think about the considerations Herb raised.? As you > >>? reconsider > >>? >>>? > your votes, I urge you also to ask yourself what, *precisely*, a > >>? "text > >>? >>>? file" > >>? >>>? > is, and to consider whether your answer is functionally > >>? >>>different from my > >>? >>>? > "local".? If you decide not, then please consider what that > >>? >>>answer implies > >? > >>>? > about CIF2 support of UTF-8 and UTF-16 (which evidently you favor) > >>? under > >>? >>>? > each option on the table, especially for CIFs containing non-ASCII > >>? >>>? > characters.? Whatever you decide about the meaning of "text > >>? >>>file", please > >>? >>>? > consider whether reasonable people might reach a different > >>? >>>conclusion, as > >>? >>>? I > >>? >>>? > assert they might do, and to what extent the standard needs to > >>? address > >>? >>>? that. > >>? >>>? > > >>? >>>? > > >>? >>>? > Regards, > >>? >>>? > > >>? >>>? > John > >>? >>>? > -- > >>? >>>? > John C. Bollinger, Ph.D. > >>? >>>? > Department of Structural Biology > >>? >>>? > St. Jude Children's Research Hospital > >>? >>>? > > >>? >>>? > > >>? >>>? > >From: > >>?>>>cif2-encoding-bounces at iucr.org>cif2-encoding-bo > unces at iuc > >> > >>r.org>cif2-encoding-bounces at iucr.org> ng-bounces at iucr.org>cif2-encoding-bounces at iucr.org > >> > >>? >>>? > > >>?>>>[mailto:cif2-encoding-bounces at iucr.org>cif2-enc > oding-bou > >> > >>nces at iucr.org>cif2-encoding-bounce > s at iucr.org>cif2-encoding-bounces@ > >>? iucr.org] > >>? >>>On Behalf Of SIMON WESTRIP > >>? >>>? > >Sent: Friday, September 24, 2010 7:53 AM > >>? >>>? > >To: Group for discussing encoding and content validation > >>? >>>schemes for CIF2 > >>? >>>? > >Subject: Re: [Cif2-encoding] How we wrap this up. . > >>? >>>? > > > >>? >>>? > >Dear Herbert > >>? >>>? > > > >>? >>>? > >Not for the first time, I find your arguement persuasive. Brian's > >>? vote > >>? >>>? and > >>? >>>? > explanation have also raised some > >>? >>>? > >questions that I would like to look into. > >>? >>>? > > > >>? >>>? > >I will confirm or otherwise my vote as soon as possible, > >>? >>>assuming that is > >>? >>>? > OK with James and assuming that > >>? >>>? > >this round of votes might wrap this up. > >>? >>>? > > > >>? >>>? > >Cheers > >>? >>>? > > > >>? >>>? > >Simon > >>? >>>? > > > >>? >>>? > >________________________________________ > >>? >>>? > >From: Herbert J. Bernstein > >>?>>><yaya at bernstein-plus-sons.com> ein-plus-sons.c>yaya at bernstein-plus-sons.c > >> > >>om>yaya at bernstein-plus-sons.com> s-sons.com>yaya at bernstein-plus-sons.com> > >>? >>>? > >To: Group for discussing encoding and content validation > >>? >>>schemes for CIF2 > >>? >>>? > > >>?>>><cif2-encoding at iucr.org> > cif2-encoding at iucr.org> >> > >>if2-encoding at iucr.org>cif2-encoding at iucr.o > rg> > >>? >>>? > >Sent: Friday, 24 September, 2010 13:17:14 > >>? >>>? > >Subject: Re: [Cif2-encoding] How we wrap this up > >>? >>>? > > > >>? >>>? > >If he ignores the standard, in most cases all he has to do to > >>? >>>comply with > >>? >>>? > CIF2 is to run whatever applications he currently runs to produce > >>? CIF1 > >>? >>>? and, > >>? >>>? > perhaps, in some cases, run a minor edit pass at the end, to convert > >>? for > >>? >>>? the > >>? >>>? > minor syntactive differences and/or changed tags required to comply > >>? with > >>? >>>? > CIF2 and the new dictionaries, but he is unlikely to have to do > >>? anything > >>? >? >>? to > >>? >>>? > deal with the messy business of whether his encoding is really a > >>? proper > >>? >>>? UTF8 > >>? >>>? > encoding or not. > >>? >>>? > > >>? >>>? > >The punishment if he tries to comply, is that he has to totally > >>? uproot > >>? >>>? and > >>? >>>? > reconfigure the environment in which he produces CIFs from > >>? >>>whatever he is > >>? >>>? > currently doing to create an enviroment in which he can reliably > >>? create > >>? >>>? and, > >>? >>>? > more importantly, transmit compliant UTF8 files.? This can be > >>? >>>very tricky > >>? >>>? if > >>? >>>? > he does only a partial job, say fudging in one special > >>? >>>application (yet to > >>? >>>? > be written), because if he stays with his old system, all kinds of > >>? tools > >>? >>>? > will keep trying to transcode whatever he has produced back to > >>? whatever > >>? >>>? his > >>? >>>? > system considers a standard. Those of us who have files, > >>? >>>applications and > >? > >>>? > tools that have lived through several generations of macs are > >>? >>>living proof > >>? >>>? > of the problem. Macs now have excellent UTF8/16 unicode > >>? >>>support, but every > >>? >>? > > once in a while in working with a unicode file I find it has been > >>? >>>? strangely > >>? >>>? > and unexpectedly converted to something else, and it can be > >>? >>>really tricky > >>? >>>? to > >>? >>>? > spot when the unaccented roman text part has been left > >>? >>>untouched but just > >>? >>>? a > >>? >>>? > few accen > >>? >>>? > ted letters have gotten different accents. > >>? >>>? > > >>? >>>? > >Mandating UTF8 is simply trying to shift a serious software > >>? >>>problem from > >>? >>>? > the central handlers of CIF (IUCr, PDB, etc.) to the external > >>? >>>users. Most > >>? >>>? > users will probably have the good sense to simply ignore the demand > >>? and > >>? >>>? > leave the burden just where it is now.? A few sophisticated users > >>? will > >>? >>>? > probably adapt with no trouble, but the punishment for those users > >>? who > >>? >>>? > blindly follow orders before we have a complete multiplatform > >>? supporting > >>? >>>? > infrastructure in place by mandating UTF8 is severe, expensive and > >>? >>>? > undeserved.? Until and unless we have developed solid support, we > >>? will > >>? >>>? just > >>? >>>? > be alienating people from CIF.? I will continue to oppose such a > >>? move. > >>? >>>? > > >>? >>>? > [...] > >>? >>>? > > >>? >>>? > > >>? >>>? > Email Disclaimer: > >>?>>><<http://www.stjude.org/emaildisclaimer> emaildiscl>http://www.stjude.org/emaildiscl > >> > >>aimer><http://www.stjude.org/emaildisclaimer> org/emaildisclaimer>www.stjude.org/emaildisclaimer > >> > >>? >>>? > _______________________________________________ > >>? >>>? > cif2-encoding mailing list > >>? >>>? > > >>?>>>cif2-encoding at iucr.org>c > if2-encoding at iucr.org> >> > >>f2-encoding at iucr.org>cif2-encoding at iucr.org > > >>? >>>? > > >>?>>><<http://scripts.iucr.org/mailman/listinfo/cif > 2-encoding>http://scripts. > >> > >>iucr.org/mailman/listinfo/cif2-encoding><http://scripts.iucr.org/ma > ilman/li > >> > >>stinfo/cif2-encoding>http://scripts.iucr.org/ma > ilman/listinfo/cif2-encoding > >> > >>? >>>? > > >>? >>>? > > >>? >>> > >>? >>> > >>? >> > >>? >>_______________________________________________ > >>? >>cif2-encoding mailing list > >>?>>cif2-encoding at iucr.org>cif2-encod > ing at iucr.org > >>?>><http://scripts.iucr.org/mailman/listinfo/cif2- > encoding>http://scripts.iu > >>? cr.org/mailman/listinfo/cif2-encoding > >>? > > >>? > > >>? >-- > >>? >===================================================== > >>? >? Herbert J. Bernstein, Professor of Computer Science > >>? >? ? Dowling College, Kramer Science Center, KSC 121 > >>? >? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > >>? > > >>? >? ? ? ? ? ? ? ? ? +1-631-244-3035 > >>? > > >>yaya at dowling.edu>yaya at dowling.edu > >>? >===================================================== > >>? >_______________________________________________ > >>? >cif2-encoding mailing list > >>?>cif2-encoding at iucr.org>cif2-encodi > ng at iucr.org > >>?><http://scripts.iucr.org/mailman/listinfo/cif2-e > ncoding>http://scripts.iuc > >>? r.org/mailman/listinfo/cif2-encoding > >>? > > >>? > > >>? >_______________________________________________ > >>? >cif2-encoding mailing list > >>? >cif2-encoding at iucr.org > >>?>http://scripts.iucr.org/mailman/listinfo/cif2-en > coding > >> > >> > >>? -- > >>? ===================================================== > >>? ? Herbert J. Bernstein, Professor of Computer Science > >>? ? ? Dowling College, Kramer Science Center, KSC 121 > >? >? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > >> > >>? ? ? ? ? ? ? ? ? ? +1-631-244-3035 > >>? ? ? ? ? ? ? ? ? ? yaya at dowling.edu > >>? ===================================================== > >>? _______________________________________________ > >>? cif2-encoding mailing list > >>? cif2-encoding at iucr.org > >> > >>http://scripts.iucr.org/mailman/listinfo/cif2-e > ncoding > >> > >> > > > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > -- > ===================================================== > ? Herbert J. Bernstein, Professor of Computer Science > ? ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > From John.Bollinger at STJUDE.ORG Mon Sep 27 22:55:37 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Mon, 27 Sep 2010 16:55:37 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <63870.31508.qm@web87006.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDC@SJMEMXMBS11.stjude.sjcrh.local > <80062.82001.qm@web87012.mail.ird.yahoo.com> <162941.37460.qm@web87004.mail.ird.yahoo.com> <780727.99055.qm@web87010.mail.ird.yahoo.com> <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE0@SJMEMXMBS11.stjude.sjcrh.local > Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> Dear Herb, On Monday, September 27, 2010 8:45 AM, Herbert J. Bernstein wrote: >The problem is that options 3,4 and 5 specifically prescribe the use of Unicode characters (that is the entire point of those options -- and that is the point in dispute -- whether we should be prescribing UTF8 or using is as we now use ASCII, as a way to be clear what we are talking about as in CIF1) and we simply are not ready to deal such a requirement yet. I think I have reached my own epiphany regarding your position. Do correct me if I am wrong, but I now think you're saying that you don't want to distinguish any particular encoding(s) as universally acceptable (much less universally required), correct? If so, would it be fair to describe that as just the "local" part of option 5? [...] On Monday, September 27, 2010 12:07 PM, Herbert J. Bernstein wrote: >Ah, now I begin to understand the difference in our view. I view CIF for >journal use and PDB deposition as having a controlled vocabulary, via >combinations of dictionaries, advice to authors, deposition standards, >etc. You seem to few CIF as allowing completely arbitrary, uncontrolled >text. [...] Yes, I intentionally take an unilluminated view of the problem, but that is both purposeful and useful. Text is the foundation on which CIF is built. The bulk of the spec is devoted to defining which text conforms and which does not. The "F" in "CIF" stands for "file," however, and if the spec is to answer the question of which *files* conform, or the related question of what a particular file means, then it needs to address the mapping between "text" and "file". Options (1) and (2) seem crafted specifically to avoid doing so. I understand using local convention to fill the gap (ala "local"), but I fail to see how any amount of author instructions, deposition standards, etc. can adequately do the same. At best that moves a burden that rightfully should be borne by the format spec onto application-dependent external documents, some outside IUCr's control. I have shown by my advocacy for option (5) that I am willing to make the definition of a conformant CIF system-dependent. I acknowledge that that various applications place different demands on the data content of CIFs they consume. I am not, however, willing to make the basic definition of CIF conformance application-dependent. >Please note that proposals 1 and 2 do _not_ affect "which >byte-sequence representations of those characters will conform to >CIF2, under which circumstances" because they are not rigidly >prescriptive about any >particular byte sequences. Options (1) and (2) certainly DO affect that question, if only by leaving it open to later, possibly conflicting, interpretation by COMCIFS, individual developers, and others. Option 5 is about as permissive as it reasonably can be regarding the binary form a CIF may take, while still being definitive enough that general-purpose software can be written to read conformant CIFs. If my new understanding of your viewpoint is correct, however, then your objection may be that option 5 is *too* permissive on account of its explicit allowance for UTF-8 and UTF-16. I would be willing to drop the explicit UTF-16 support (though UTF-16 might nevertheless squeeze in as "local" in some environments). I will under no circumstances, however, support an alternative that allows any file to be found non-conformant on account of its being encoded in UTF-8. [...] >This is really getting out of hand. We need a meeting. If >everyone will send me their Skype id's, I will volunteer to >set up a Skype conference call at some time that works for >everybody (which I suspect will be 4 am EDT). My guess is that >1-2 hours of polite discussion will resolve this. What >do we have to lose? Is there anything to gain? The last few days have been more illuminating than the last several weeks, but it still seems evident to me that there is a fundamental difference of opinion. I will not support an alternative that fails to make UTF-8 a universally supported character encoding for CIF, and it seems clear that James will not, either. You seem adamant that there be no such universal requirement. I think I understand your position better than I used to do, but I don't see where there is any scope for a consensus. My best offer is already on the table in option (5) +- UTF-16. Respectfully, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Mon Sep 27 23:18:39 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 27 Sep 2010 18:18:39 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear Colleagues, John declines to join in a meeting because ... "Is there anything to gain? The last few days have been more illuminating than the last several weeks, but it still seems evident to me that there is a fundamental difference of opinion. I will not support an alternative that fails to make UTF-8 a universally supported character encoding for CIF, and it seems clear that James will not, either. You seem adamant that there be no such universal requirement. I think I understand your position better than I used to do, but I don't see where there is any scope for a consensus. My best offer is already on the table in option (5) +- UTF-16." I a very sorry to hear that. I hope the rest of you will have the courtesy to participate in a Skype meeting. Perhaps no new facts or logic will come to light. Perhaps something will come to light that leads to better common understanding and concensus. We'll never know unless we try. I for one think that most of us are open minded and willing to try to reach an accomodation that serves the community well. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Mon, 27 Sep 2010, Bollinger, John C wrote: > Dear Herb, > > On Monday, September 27, 2010 8:45 AM, Herbert J. Bernstein wrote: > >> The problem is that options 3,4 and 5 specifically prescribe the use of Unicode characters (that is the entire point of those options -- and that is the point in dispute -- whether we should be prescribing UTF8 or using is as we now use ASCII, as a way to be clear what we are talking about as in CIF1) and we simply are not ready to deal such a requirement yet. > > I think I have reached my own epiphany regarding your position. Do correct me if I am wrong, but I now think you're saying that you don't want to distinguish any particular encoding(s) as universally acceptable (much less universally required), correct? If so, would it be fair to describe that as just the "local" part of option 5? > > [...] > > On Monday, September 27, 2010 12:07 PM, Herbert J. Bernstein wrote: > >> Ah, now I begin to understand the difference in our view. I view CIF for >> journal use and PDB deposition as having a controlled vocabulary, via >> combinations of dictionaries, advice to authors, deposition standards, >> etc. You seem to few CIF as allowing completely arbitrary, uncontrolled >> text. [...] > > Yes, I intentionally take an unilluminated view of the problem, but that is both purposeful and useful. Text is the foundation on which CIF is built. The bulk of the spec is devoted to defining which text conforms and which does not. The "F" in "CIF" stands for "file," however, and if the spec is to answer the question of which *files* conform, or the related question of what a particular file means, then it needs to address the mapping between "text" and "file". Options (1) and (2) seem crafted specifically to avoid doing so. > > I understand using local convention to fill the gap (ala "local"), but I fail to see how any amount of author instructions, deposition standards, etc. can adequately do the same. At best that moves a burden that rightfully should be borne by the format spec onto application-dependent external documents, some outside IUCr's control. I have shown by my advocacy for option (5) that I am willing to make the definition of a conformant CIF system-dependent. I acknowledge that that various applications place different demands on the data content of CIFs they consume. I am not, however, willing to make the basic definition of CIF conformance application-dependent. > >> Please note that proposals 1 and 2 do _not_ affect "which >> byte-sequence representations of those characters will conform to >> CIF2, under which circumstances" because they are not rigidly >> prescriptive about any >> particular byte sequences. > > Options (1) and (2) certainly DO affect that question, if only by leaving it open to later, possibly conflicting, interpretation by COMCIFS, individual developers, and others. Option 5 is about as permissive as it reasonably can be regarding the binary form a CIF may take, while still being definitive enough that general-purpose software can be written to read conformant CIFs. If my new understanding of your viewpoint is correct, however, then your objection may be that option 5 is *too* permissive on account of its explicit allowance for UTF-8 and UTF-16. I would be willing to drop the explicit UTF-16 support (though UTF-16 might nevertheless squeeze in as "local" in some environments). I will under no circumstances, however, support an alternative that allows any file to be found non-conformant on account of its being encoded in UTF-8. > > [...] > >> This is really getting out of hand. We need a meeting. If >> everyone will send me their Skype id's, I will volunteer to >> set up a Skype conference call at some time that works for >> everybody (which I suspect will be 4 am EDT). My guess is that >> 1-2 hours of polite discussion will resolve this. What >> do we have to lose? > > Is there anything to gain? The last few days have been more illuminating than the last several weeks, but it still seems evident to me that there is a fundamental difference of opinion. I will not support an alternative that fails to make UTF-8 a universally supported character encoding for CIF, and it seems clear that James will not, either. You seem adamant that there be no such universal requirement. I think I understand your position better than I used to do, but I don't see where there is any scope for a consensus. My best offer is already on the table in option (5) +- UTF-16. > > > Respectfully, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From yaya at bernstein-plus-sons.com Tue Sep 28 00:14:12 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 27 Sep 2010 19:14:12 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Now to the substance of John's argument, which he also attributes to James: "I will not support an alternative that fails to make UTF-8 a universally supported character encoding for CIF, and it seems clear that James will not, either." To me it seems the key to this position is "universally supported" which would seem to imply that there be full documentation and software to allow CIF users to work within that context. The practicality of that position comes down to the IUCr and the PDB working out workflows and the necessary support infrastructure to get the entire small and macromlecular CIF community to make the transition. Do the IUCr journal operation and the PDB have plans and a realistic timeline to do this? If not, then are CIF2 and dREL to wait on the sidelines until then? If so, may we please see them to get a sense of when the transition could happen? I suspect we are 90% of the way to where we need to be, but in the same sense as 90% is used in the old saw: The first 90% of the effort takes the first 90% of the time, and the last 10% of the effort takes the last 90% of the time. But I may be wrong. Let's see the plans for the proposed transition for the IUCr journal operation and the PDB, dealing with both core CIF and mmCIF in a UTF8 CIF2 world. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Mon, 27 Sep 2010, Herbert J. Bernstein wrote: > Dear Colleagues, > > John declines to join in a meeting because ... > > "Is there anything to gain? The last few days have been more illuminating > than the last several weeks, but it still seems evident to me that there > is a fundamental difference of opinion. I will not support an alternative > that fails to make UTF-8 a universally supported character encoding for > CIF, and it seems clear that James will not, either. You seem adamant > that there be no such universal requirement. I think I understand your > position better than I used to do, but I don't see where there is any > scope for a consensus. My best offer is already on the table in option > (5) +- UTF-16." > > I a very sorry to hear that. I hope the rest of you will have the > courtesy to participate in a Skype meeting. Perhaps no new facts > or logic will come to light. Perhaps something will come to light > that leads to better common understanding and concensus. We'll never know > unless we try. I for one think that most of us are open minded and > willing to try to reach an accomodation that serves the community well. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Mon, 27 Sep 2010, Bollinger, John C wrote: > >> Dear Herb, >> >> On Monday, September 27, 2010 8:45 AM, Herbert J. Bernstein wrote: >> >>> The problem is that options 3,4 and 5 specifically prescribe the use of Unicode characters (that is the entire point of those options -- and that is the point in dispute -- whether we should be prescribing UTF8 or using is as we now use ASCII, as a way to be clear what we are talking about as in CIF1) and we simply are not ready to deal such a requirement yet. >> >> I think I have reached my own epiphany regarding your position. Do correct me if I am wrong, but I now think you're saying that you don't want to distinguish any particular encoding(s) as universally acceptable (much less universally required), correct? If so, would it be fair to describe that as just the "local" part of option 5? >> >> [...] >> >> On Monday, September 27, 2010 12:07 PM, Herbert J. Bernstein wrote: >> >>> Ah, now I begin to understand the difference in our view. I view CIF for >>> journal use and PDB deposition as having a controlled vocabulary, via >>> combinations of dictionaries, advice to authors, deposition standards, >>> etc. You seem to few CIF as allowing completely arbitrary, uncontrolled >>> text. [...] >> >> Yes, I intentionally take an unilluminated view of the problem, but that is both purposeful and useful. Text is the foundation on which CIF is built. The bulk of the spec is devoted to defining which text conforms and which does not. The "F" in "CIF" stands for "file," however, and if the spec is to answer the question of which *files* conform, or the related question of what a particular file means, then it needs to address the mapping between "text" and "file". Options (1) and (2) seem crafted specifically to avoid doing so. >> >> I understand using local convention to fill the gap (ala "local"), but I fail to see how any amount of author instructions, deposition standards, etc. can adequately do the same. At best that moves a burden that rightfully should be borne by the format spec onto application-dependent external documents, some outside IUCr's control. I have shown by my advocacy for option (5) that I am willing to make the definition of a conformant CIF system-dependent. I acknowledge that that various applications place different demands on the data content of CIFs they consume. I am not, however, willing to make the basic definition of CIF conformance application-dependent. >> >>> Please note that proposals 1 and 2 do _not_ affect "which >>> byte-sequence representations of those characters will conform to >>> CIF2, under which circumstances" because they are not rigidly >>> prescriptive about any >>> particular byte sequences. >> >> Options (1) and (2) certainly DO affect that question, if only by leaving it open to later, possibly conflicting, interpretation by COMCIFS, individual developers, and others. Option 5 is about as permissive as it reasonably can be regarding the binary form a CIF may take, while still being definitive enough that general-purpose software can be written to read conformant CIFs. If my new understanding of your viewpoint is correct, however, then your objection may be that option 5 is *too* permissive on account of its explicit allowance for UTF-8 and UTF-16. I would be willing to drop the explicit UTF-16 support (though UTF-16 might nevertheless squeeze in as "local" in some environments). I will under no circumstances, however, support an alternative that allows any file to be found non-conformant on account of its being encoded in UTF-8. >> >> [...] >> >>> This is really getting out of hand. We need a meeting. If >>> everyone will send me their Skype id's, I will volunteer to >>> set up a Skype conference call at some time that works for >>> everybody (which I suspect will be 4 am EDT). My guess is that >>> 1-2 hours of polite discussion will resolve this. What >>> do we have to lose? >> >> Is there anything to gain? The last few days have been more illuminating than the last several weeks, but it still seems evident to me that there is a fundamental difference of opinion. I will not support an alternative that fails to make UTF-8 a universally supported character encoding for CIF, and it seems clear that James will not, either. You seem adamant that there be no such universal requirement. I think I understand your position better than I used to do, but I don't see where there is any scope for a consensus. My best offer is already on the table in option (5) +- UTF-16. >> >> >> Respectfully, >> >> John >> -- >> John C. Bollinger, Ph.D. >> Department of Structural Biology >> St. Jude Children's Research Hospital >> >> >> Email Disclaimer: www.stjude.org/emaildisclaimer >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From jamesrhester at gmail.com Tue Sep 28 07:40:40 2010 From: jamesrhester at gmail.com (James Hester) Date: Tue, 28 Sep 2010 16:40:40 +1000 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: I happily confess to agreeing with John on this one. I am beginning to think I might understand Herbert's viewpoint. Correct me Herbert if I am wrong, but you are suggesting that current CIF1 workflows are implicitly or explicitly choosing some encoding when they operate with CIF1 files. If we mandate some other encoding for CIF2, then those workflows will be adversely affected as they will need to choose a different encoding. I'll critique this further if Herbert confirms that this is what he is getting at. On Tue, Sep 28, 2010 at 9:14 AM, Herbert J. Bernstein < yaya at bernstein-plus-sons.com> wrote: > Now to the substance of John's argument, which he also attributes to > James: > > "I will not support an alternative that fails to make UTF-8 a universally > supported character encoding for CIF, and it seems clear that James will > not, either." > > To me it seems the key to this position is "universally supported" which > would seem to imply that there be full documentation and software to allow > CIF users to work within that context. The practicality of that position > comes down to the IUCr and the PDB working out workflows and the necessary > support infrastructure to get the entire small and macromlecular CIF > community to make the transition. Do the IUCr journal operation and the > PDB have plans and a realistic timeline to do this? If not, then are CIF2 > and dREL to wait on the sidelines until then? If so, may we please see > them to get a sense of when the transition could happen? > > I suspect we are 90% of the way to where we need to be, but in the same > sense as 90% is used in the old saw: The first 90% of the effort takes > the first 90% of the time, and the last 10% of the effort takes the last > 90% of the time. But I may be wrong. Let's see the plans for the > proposed transition for the IUCr journal operation and the PDB, dealing > with both core CIF and mmCIF in a UTF8 CIF2 world. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Mon, 27 Sep 2010, Herbert J. Bernstein wrote: > > > Dear Colleagues, > > > > John declines to join in a meeting because ... > > > > "Is there anything to gain? The last few days have been more > illuminating > > than the last several weeks, but it still seems evident to me that there > > is a fundamental difference of opinion. I will not support an > alternative > > that fails to make UTF-8 a universally supported character encoding for > > CIF, and it seems clear that James will not, either. You seem adamant > > that there be no such universal requirement. I think I understand your > > position better than I used to do, but I don't see where there is any > > scope for a consensus. My best offer is already on the table in option > > (5) +- UTF-16." > > > > I a very sorry to hear that. I hope the rest of you will have the > > courtesy to participate in a Skype meeting. Perhaps no new facts > > or logic will come to light. Perhaps something will come to light > > that leads to better common understanding and concensus. We'll never > know > > unless we try. I for one think that most of us are open minded and > > willing to try to reach an accomodation that serves the community well. > > > > Regards, > > Herbert > > > > ===================================================== > > Herbert J. Bernstein, Professor of Computer Science > > Dowling College, Kramer Science Center, KSC 121 > > Idle Hour Blvd, Oakdale, NY, 11769 > > > > +1-631-244-3035 > > yaya at dowling.edu > > ===================================================== > > > > On Mon, 27 Sep 2010, Bollinger, John C wrote: > > > >> Dear Herb, > >> > >> On Monday, September 27, 2010 8:45 AM, Herbert J. Bernstein wrote: > >> > >>> The problem is that options 3,4 and 5 specifically prescribe the use of > Unicode characters (that is the entire point of those options -- and that is > the point in dispute -- whether we should be prescribing UTF8 or using is as > we now use ASCII, as a way to be clear what we are talking about as in CIF1) > and we simply are not ready to deal such a requirement yet. > >> > >> I think I have reached my own epiphany regarding your position. Do > correct me if I am wrong, but I now think you're saying that you don't want > to distinguish any particular encoding(s) as universally acceptable (much > less universally required), correct? If so, would it be fair to describe > that as just the "local" part of option 5? > >> > >> [...] > >> > >> On Monday, September 27, 2010 12:07 PM, Herbert J. Bernstein wrote: > >> > >>> Ah, now I begin to understand the difference in our view. I view CIF > for > >>> journal use and PDB deposition as having a controlled vocabulary, via > >>> combinations of dictionaries, advice to authors, deposition standards, > >>> etc. You seem to few CIF as allowing completely arbitrary, > uncontrolled > >>> text. [...] > >> > >> Yes, I intentionally take an unilluminated view of the problem, but that > is both purposeful and useful. Text is the foundation on which CIF is > built. The bulk of the spec is devoted to defining which text conforms and > which does not. The "F" in "CIF" stands for "file," however, and if the > spec is to answer the question of which *files* conform, or the related > question of what a particular file means, then it needs to address the > mapping between "text" and "file". Options (1) and (2) seem crafted > specifically to avoid doing so. > >> > >> I understand using local convention to fill the gap (ala "local"), but I > fail to see how any amount of author instructions, deposition standards, > etc. can adequately do the same. At best that moves a burden that > rightfully should be borne by the format spec onto application-dependent > external documents, some outside IUCr's control. I have shown by my > advocacy for option (5) that I am willing to make the definition of a > conformant CIF system-dependent. I acknowledge that that various > applications place different demands on the data content of CIFs they > consume. I am not, however, willing to make the basic definition of CIF > conformance application-dependent. > >> > >>> Please note that proposals 1 and 2 do _not_ affect "which > >>> byte-sequence representations of those characters will conform to > >>> CIF2, under which circumstances" because they are not rigidly > >>> prescriptive about any > >>> particular byte sequences. > >> > >> Options (1) and (2) certainly DO affect that question, if only by > leaving it open to later, possibly conflicting, interpretation by COMCIFS, > individual developers, and others. Option 5 is about as permissive as it > reasonably can be regarding the binary form a CIF may take, while still > being definitive enough that general-purpose software can be written to read > conformant CIFs. If my new understanding of your viewpoint is correct, > however, then your objection may be that option 5 is *too* permissive on > account of its explicit allowance for UTF-8 and UTF-16. I would be willing > to drop the explicit UTF-16 support (though UTF-16 might nevertheless > squeeze in as "local" in some environments). I will under no circumstances, > however, support an alternative that allows any file to be found > non-conformant on account of its being encoded in UTF-8. > >> > >> [...] > >> > >>> This is really getting out of hand. We need a meeting. If > >>> everyone will send me their Skype id's, I will volunteer to > >>> set up a Skype conference call at some time that works for > >>> everybody (which I suspect will be 4 am EDT). My guess is that > >>> 1-2 hours of polite discussion will resolve this. What > >>> do we have to lose? > >> > >> Is there anything to gain? The last few days have been more > illuminating than the last several weeks, but it still seems evident to me > that there is a fundamental difference of opinion. I will not support an > alternative that fails to make UTF-8 a universally supported character > encoding for CIF, and it seems clear that James will not, either. You seem > adamant that there be no such universal requirement. I think I understand > your position better than I used to do, but I don't see where there is any > scope for a consensus. My best offer is already on the table in option (5) > +- UTF-16. > >> > >> > >> Respectfully, > >> > >> John > >> -- > >> John C. Bollinger, Ph.D. > >> Department of Structural Biology > >> St. Jude Children's Research Hospital > >> > >> > >> Email Disclaimer: www.stjude.org/emaildisclaimer > >> > >> _______________________________________________ > >> cif2-encoding mailing list > >> cif2-encoding at iucr.org > >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > >> > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100928/112127cc/attachment.html From jamesrhester at gmail.com Tue Sep 28 08:05:49 2010 From: jamesrhester at gmail.com (James Hester) Date: Tue, 28 Sep 2010 17:05:49 +1000 Subject: [Cif2-encoding] Addressing Brian's concerns Message-ID: In this email I address Brian's comments. I have reproduced the email in full at the end for reference. To save reading the whole email, you may download 'juffed', based on Qt, and cut and paste between foreign language webpages in your browser and juffed to see multiple-encoding cut and paste in action. There is thus no impediment to publCIF operating in a multiple-encoding environment. Brian writes: I sympathise greatly with James's desire for a prescriptive, "binary" approach, but its corollary is that a CIF application must take full responsibility for expressing any supported extended character set (I mean accented Latin letters, Greek characters, Cyrillic or Chinese alphabets). This is not correct. A typical application at minimum will only need to parse the CIF file such that each tag has an associated string value. What higher levels of the application do with those tags and strings is application-dependent. The only applications that need to worry about actually displaying glyphs are those concerned with text display. Most Unicode-aware software (e.g. web browsers) simply displays a default character if it is not able to display a particular glyph. That does not mean that that code point disappears, simply that it is not displayed. So I do not see being required to display all of Unicode is a valid criticism of the 'binary' approach, as displaying all of Unicode is not a requirement. First off, I don't know how difficult that is technically. I would guess that rather than trying to handle arbitrary keyboard mappings, the natural approach would be to pick from a graphical character grid. (What are the implications for this of glyph rendering - does a CIF editor have to be compiled with its own large font library?) If I understand correctly, this paragraph is not relevant if there is no implicit requirement to display all of Unicode. I will just add that there is no need to choose a less comprehensive encoding just because you don't want to display characters outside a certain range. The various mappings involved in text display are all decoupled. Just to list them for clarity: (1) Mapping from keyboard code to code point; (2) Mapping from code point to on-disk binary representation; (this is the encoding we have been talking about) (3) Mapping from code point to font coordinate (display glyph); You can restrict the set of display glyphs without changing the underlying file encoding. For example, a CIF-aware application for handling CIF2 text could bundle different language packs as Adobe does for PDF. But that's a laborious method of authoring if relatively large amounts of "non-standard" text are involved, and the way that authors would prefer to work, surely, is by copying and pasting text from Word or some other tool of choice. Permitting that necessarily pollutes the "binary" approach with byte streams delivered by text-oriented applications. I agree that authors would probably prefer to cut and paste. Does this pollute the 'binary' approach more than the 'any encoding' approach? How much of the cut-and-pastability of CIF1 text is due to the commonality of ASCII encoding for the CIF text codepoints, and how much to magic CIF1 pixie dust that translates the encoding of the cut text to that of the text document it is pasted into? Anyway, more about cutting and pasting is included below. If I could be sure that publCIF, say, can be compiled with libraries that reliably transcode byte streams imported from clipboards and file import (across the mess of SMB/NFS mounts etc. that exist in the real world) - and equally reliably transcode its UTF8 encoded text to the author's locale-based clipboard, then I'd be more willing to promote option 3 to the top as the starting point at least for CIF 2.0 (but its "enforcement" does depend on the availability of such a robust CIF-editing tool). Short form of my answer: publCIF will be able to work well under both proposals as it is interactive and uses Qt. Long form: You don't even need any new libraries for this! Qt (upon which publCIF is built) aims to do it all transparently for you. Let me be your software architect. Assume publCIF always handles UTF8 text as per my preferred option. Cut and paste (clipboard import): When text is imported from the QClipboard (an abstraction of the system clipboard) into publCIF, publCIF should always request QMimeData, which will return an object containing the text and the encoding of the text. Other standard Qt text transcoding functions can then be applied to convert the text in one encoding to text in the target encoding. I estimate about 10 lines of new code. The other direction is even more of a no-brainer, as publCIF need simply set the encoding of its source text in the mimeData object it passes to the clipboard, and the clipboard will transparently handle transcoding as needed. I would note that this description applies equally well to the 'as for CIF1' proposal, with the potential simplification that, if source and target texts are known to be in the same encoding, no transcoding is necessary. I suggest you download the free Qt-based editor called 'Juffed' and play with cutting and pasting from international web pages. I have just pasted bits of text encoded in euc-jp, utf-8 and win-1251 into Juffed and all displayed correctly. As this is based on the same libraries and technology that publCIF is, I think your worries are unfounded. Note also that cutting and pasting is a user-mediated operation, so the user sees both the input text and the output text. This means that transcoding errors (which may occasionally occur every now and then for single characters, others have reported) are more likely to be caught than a situation where transcoding is done silently in the background. Import CIF file from some undefined location: under my UTF8/16 proposal, there is no issue doing this, as the file is supposed to be UTF8/16. Under the 'as for CIF1' proposal (which Brain paradoxically supports?), or even the 'local + UTF8/16' proposal, you are *on your own* as far as figuring out the source file encoding and I know of no automated solution. As a practical matter, because publCIF is interactive, you could prompt the user to specify the encoding when UTF8/16 are not found, in the same way that browsers allow encoding to be set. But that latter behaviour is entirely your decision. While I'm rewriting publCIF in my head, I will note in passing that in terms of fonts, publCIF is already well set up. The Linux version of publCIF allows me to choose a Unicode font, for example, which displays the greek symbols perfectly, and Windows will have its own Unicode font available. So, I do not think that publCIF can be used as a way to distinguish between the competing proposals on the table. all the best, James. ========================== Brian's email in full for reference My vote: Preference Option 1 2. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII', together with Brian's *recommendations* 2 1. Herbert's 'as for CIF1 proposal with UTF8 in place of ASCII' recently posted here and to COMCIFS. 3 4. UTF8 + UTF16 4 3. UTF8-only as in the original draft 5 5. UTF8, UTF16 + "local" Rationale: I still feel this argument is at heart a "binary/text" dichotomy, where "binary" implies that one can prescribe specific byte-level representations of every distinct character; "text" implies that you're at the mercy of external libraries and mappings between encoding conventions - and those mappings are not always explicit or easy to identify. I sympathise greatly with James's desire for a prescriptive, "binary" approach, but its corollary is that a CIF application must take full responsibility for expressing any supported extended character set (I mean accented Latin letters, Greek characters, Cyrillic or Chinese alphabets). First off, I don't know how difficult that is technically. I would guess that rather than trying to handle arbitrary keyboard mappings, the natural approach would be to pick from a graphical character grid. (What are the implications for this of glyph rendering - does a CIF editor have to be compiled with its own large font library?) But that's a laborious method of authoring if relatively large amounts of "non-standard" text are involved, and the way that authors would prefer to work, surely, is by copying and pasting text from Word or some other tool of choice. Permitting that necessarily pollutes the "binary" approach with byte streams delivered by text-oriented applications. If I could be sure that publCIF, say, can be compiled with libraries that reliably transcode byte streams imported from clipboards and file import (across the mess of SMB/NFS mounts etc. that exist in the real world) - and equally reliably transcode its UTF8 encoded text to the author's locale-based clipboard, then I'd be more willing to promote option 3 to the top as the starting point at least for CIF 2.0 (but its "enforcement" does depend on the availability of such a robust CIF-editing tool). I prefer the UTF8 + UTF16 option over UTF8-only because of the real-world use case that Herbert has described before; and in existing imgCIF applications the UTF16 encoding is being done rather carefully and for a specific purpose. I put option 5 at the bottom because of the non-portability of a "local" encoding. Note, though, that whatever the outcome I would still favour the discussion of character set encodings to be presented as a Part 3 to the complete CIF2 spec. Best wishes Brian _________________________________________________________________________ Brian McMahon tel: +44 1244 342878 Research and Development Officer fax: +44 1244 314888 International Union of Crystallography e-mail: bm at iucr.org 5 Abbey Square, Chester CH1 2HU, England -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100928/3c3b25b1/attachment-0001.html From simonwestrip at btinternet.com Tue Sep 28 10:54:27 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Tue, 28 Sep 2010 09:54:27 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <646265.82162.qm@web87004.mail.ird.yahoo.com> "I [John] will not support an alternative that fails to make UTF-8 a universally supported character encoding for CIF, and it seems clear that James will not, either." Herbert has recently stated "I personally have no objection to specification of UTF-8 as a default encoding in the absence of indications of some different encoding." To me, this supports making UTF-8 a universally supported character encoding. I would hope that a developer reading this might conclude that if the encoding is not recognizable (UTF16 is recognizable), then they should default to UTF8, so CIF software must be able to handle UTF8. A user reading this, especially in the context of recommendations, would also hopefully conclude that UTF8 is the default encoding, and be aware that if they wish to make use of the new non-ASCII support offered by CIF2, they may have to pay attention to how that new feature should be used. In the end they can continue to do what thay have always done when editing CIFs, which is to work with ASCII. 'As for CIF1...' allows them to do this without having to worry about the encoding; while the specification of a default encoding gives both developers and users the means to make use of non-ASCII, without uncertainty. So I think the 'As for CIF1...' proposals with this explicit default encoding is certainly heading towards a workable compromise. Herbert is unhappy to mandate a particular encoding for non-ASCII use, but has agreed to recommend UTF8 and UTF16 in such cases. Such recommendations along with a default encoding that should be adopted in the absence of any pointers to the contrary could boil down to UTF8/16 + local in all intents and purposes, and could boil down to UTF8/16 if you want to use non-ASCII text. Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Tuesday, 28 September, 2010 7:40:40 Subject: Re: [Cif2-encoding] How we wrap this up I happily confess to agreeing with John on this one. I am beginning to think I might understand Herbert's viewpoint. Correct me Herbert if I am wrong, but you are suggesting that current CIF1 workflows are implicitly or explicitly choosing some encoding when they operate with CIF1 files. If we mandate some other encoding for CIF2, then those workflows will be adversely affected as they will need to choose a different encoding. I'll critique this further if Herbert confirms that this is what he is getting at. On Tue, Sep 28, 2010 at 9:14 AM, Herbert J. Bernstein wrote: Now to the substance of John's argument, which he also attributes to >James: > > >"I will not support an alternative that fails to make UTF-8 a universally >supported character encoding for CIF, and it seems clear that James will >not, either." > >To me it seems the key to this position is "universally supported" which >would seem to imply that there be full documentation and software to allow >CIF users to work within that context. The practicality of that position >comes down to the IUCr and the PDB working out workflows and the necessary >support infrastructure to get the entire small and macromlecular CIF >community to make the transition. Do the IUCr journal operation and the >PDB have plans and a realistic timeline to do this? If not, then are CIF2 >and dREL to wait on the sidelines until then? If so, may we please see >them to get a sense of when the transition could happen? > >I suspect we are 90% of the way to where we need to be, but in the same >sense as 90% is used in the old saw: The first 90% of the effort takes >the first 90% of the time, and the last 10% of the effort takes the last >90% of the time. But I may be wrong. Let's see the plans for the >proposed transition for the IUCr journal operation and the PDB, dealing >with both core CIF and mmCIF in a UTF8 CIF2 world. > > >Regards, > Herbert > >===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu >===================================================== > > >On Mon, 27 Sep 2010, Herbert J. Bernstein wrote: > >> Dear Colleagues, >> >> John declines to join in a meeting because ... >> >> "Is there anything to gain? The last few days have been more illuminating >> than the last several weeks, but it still seems evident to me that there >> is a fundamental difference of opinion. I will not support an alternative >> that fails to make UTF-8 a universally supported character encoding for >> CIF, and it seems clear that James will not, either. You seem adamant >> that there be no such universal requirement. I think I understand your >> position better than I used to do, but I don't see where there is any >> scope for a consensus. My best offer is already on the table in option >> (5) +- UTF-16." >> >> I a very sorry to hear that. I hope the rest of you will have the >> courtesy to participate in a Skype meeting. Perhaps no new facts >> or logic will come to light. Perhaps something will come to light >> that leads to better common understanding and concensus. We'll never know >> unless we try. I for one think that most of us are open minded and >> willing to try to reach an accomodation that serves the community well. >> >> Regards, >> Herbert >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> >> On Mon, 27 Sep 2010, Bollinger, John C wrote: >> >>> Dear Herb, >>> >>> On Monday, September 27, 2010 8:45 AM, Herbert J. Bernstein wrote: >>> >>>> The problem is that options 3,4 and 5 specifically prescribe the use of Unicode >>>>characters (that is the entire point of those options -- and that is the point >>>>in dispute -- whether we should be prescribing UTF8 or using is as we now use >>>>ASCII, as a way to be clear what we are talking about as in CIF1) and we simply >>>>are not ready to deal such a requirement yet. >>> >>> I think I have reached my own epiphany regarding your position. Do correct me >>>if I am wrong, but I now think you're saying that you don't want to distinguish >>>any particular encoding(s) as universally acceptable (much less universally >>>required), correct? If so, would it be fair to describe that as just the >>>"local" part of option 5? >>> >>> [...] >>> >>> On Monday, September 27, 2010 12:07 PM, Herbert J. Bernstein wrote: >>> >>>> Ah, now I begin to understand the difference in our view. I view CIF for >>>> journal use and PDB deposition as having a controlled vocabulary, via >>>> combinations of dictionaries, advice to authors, deposition standards, >>>> etc. You seem to few CIF as allowing completely arbitrary, uncontrolled >>>> text. [...] >>> >>> Yes, I intentionally take an unilluminated view of the problem, but that is >>>both purposeful and useful. Text is the foundation on which CIF is built. The >>>bulk of the spec is devoted to defining which text conforms and which does not. >>> The "F" in "CIF" stands for "file," however, and if the spec is to answer the >>>question of which *files* conform, or the related question of what a particular >>>file means, then it needs to address the mapping between "text" and "file". >>> Options (1) and (2) seem crafted specifically to avoid doing so. >>> >>> I understand using local convention to fill the gap (ala "local"), but I fail >>>to see how any amount of author instructions, deposition standards, etc. can >>>adequately do the same. At best that moves a burden that rightfully should be >>>borne by the format spec onto application-dependent external documents, some >>>outside IUCr's control. I have shown by my advocacy for option (5) that I am >>>willing to make the definition of a conformant CIF system-dependent. I >>>acknowledge that that various applications place different demands on the data >>>content of CIFs they consume. I am not, however, willing to make the basic >>>definition of CIF conformance application-dependent. >>> >>>> Please note that proposals 1 and 2 do _not_ affect "which >>>> byte-sequence representations of those characters will conform to >>>> CIF2, under which circumstances" because they are not rigidly >>>> prescriptive about any >>>> particular byte sequences. >>> >>> Options (1) and (2) certainly DO affect that question, if only by leaving it >>>open to later, possibly conflicting, interpretation by COMCIFS, individual >>>developers, and others. Option 5 is about as permissive as it reasonably can be >>>regarding the binary form a CIF may take, while still being definitive enough >>>that general-purpose software can be written to read conformant CIFs. If my new >>>understanding of your viewpoint is correct, however, then your objection may be >>>that option 5 is *too* permissive on account of its explicit allowance for UTF-8 >>>and UTF-16. I would be willing to drop the explicit UTF-16 support (though >>>UTF-16 might nevertheless squeeze in as "local" in some environments). I will >>>under no circumstances, however, support an alternative that allows any file to >>>be found non-conformant on account of its being encoded in UTF-8. >>> >>> [...] >>> >>>> This is really getting out of hand. We need a meeting. If >>>> everyone will send me their Skype id's, I will volunteer to >>>> set up a Skype conference call at some time that works for >>>> everybody (which I suspect will be 4 am EDT). My guess is that >>>> 1-2 hours of polite discussion will resolve this. What >>>> do we have to lose? >>> >>> Is there anything to gain? The last few days have been more illuminating than >>>the last several weeks, but it still seems evident to me that there is a >>>fundamental difference of opinion. I will not support an alternative that fails >>>to make UTF-8 a universally supported character encoding for CIF, and it seems >>>clear that James will not, either. You seem adamant that there be no such >>>universal requirement. I think I understand your position better than I used to >>>do, but I don't see where there is any scope for a consensus. My best offer is >>>already on the table in option (5) +- UTF-16. >>> >>> >>> Respectfully, >>> >>> John >>> -- >>> John C. Bollinger, Ph.D. >>> Department of Structural Biology >>> St. Jude Children's Research Hospital >>> >>> >>> Email Disclaimer: www.stjude.org/emaildisclaimer >>> >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >>> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100928/b9fb66ff/attachment-0001.html From yaya at bernstein-plus-sons.com Tue Sep 28 11:08:26 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 28 Sep 2010 06:08:26 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear James, No, my position is more complex and pragmatic than that. I am not asserting that the IUCr and the PDB have workflows that are implicitly or explicity choosing some (single) other encoding when they operate with CIF1 files. I am asserting that the systems of journal submission and PDB operations have gotten tangled with the current non-encoding specific text-based definition of CIF1 which involves editors, applications and practices in systems that are partially in (and under the control of) the central operations and partially in (an not completely under the control of) the user shops, and that the specification of _any_ single encoding without consideration of and provision of supporting editors, applications and practices to support the restrictive specification of that encoding at all points of this distributed system is pointlessly disruptive and wasteful. I am sorry that is not short and pithy, but it is my point of view. The short and pithy version is Murphy's Law -- "anything that can go wrong will" so before you make a change like this you need to work your way through all the work flows involved and be certain you have covered all the cases. We'd be in a very different position, if what was being proposed paralleled some well-supported existing approach but that idea has been repeatedly rejected in the belief that to be a standard, a specification has to be rigid with no alternatives -- a most unusual approach to software standards. By adopting that unusual position, we find ourselves with very little pre-existing support infrastructure. We can put ourselves in a very different position by developing the needed support documentation, editors and applications to give to the community to make the transition, but there is very little chance of doing that in time for next summer. In the present circumstances, until and unless demonstrably workable workflows supported by real software infrastracture is in hand, I believe it is premature to commit to effectively make CIF2 into a binary format and commit to just a single encoding. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Tue, 28 Sep 2010, James Hester wrote: > I happily confess to agreeing with John on this one. > > I am beginning to think I might understand Herbert's viewpoint.? Correct me Herbert > if I am wrong, but you are suggesting that current CIF1 workflows are implicitly or > explicitly choosing some encoding when they operate with CIF1 files.? If we mandate > some other encoding for CIF2, then those workflows will be adversely affected as they > will need to choose a different encoding. > > I'll critique this further if Herbert confirms that this is what he is getting at. > > On Tue, Sep 28, 2010 at 9:14 AM, Herbert J. Bernstein > wrote: > Now to the substance of John's argument, which he also attributes to > James: > > "I will not support an alternative that fails to make UTF-8 a universally > supported character encoding for ?CIF, and it seems clear that James will > not, either." > > To me it seems the key to this position is "universally supported" which > would seem to imply that there be full documentation and software to allow > CIF users to work within that context. ?The practicality of that position > comes down to the IUCr and the PDB working out workflows and the necessary > support infrastructure to get the entire small and macromlecular CIF > community to make the transition. ?Do the IUCr journal operation and the > PDB have plans and a realistic timeline to do this? ?If not, then are CIF2 > and dREL to wait on the sidelines until then? ?If so, may we please see > them to get a sense of when the transition could happen? > > I suspect we are 90% of the way to where we need to be, but in the same > sense as 90% is used in the old saw: ?The first 90% of the effort takes > the first 90% of the time, and the last 10% of the effort takes the last > 90% of the time. ?But I may be wrong. ?Let's see the plans for the > proposed transition for the IUCr journal operation and the PDB, dealing > with both core CIF and mmCIF in a UTF8 CIF2 world. > > Regards, > ? Herbert > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? ?Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > ===================================================== > > On Mon, 27 Sep 2010, Herbert J. Bernstein wrote: > > > Dear Colleagues, > > > > ? John declines to join in a meeting because ... > > > > "Is there anything to gain? ?The last few days have been more illuminating > > than the last several weeks, but it still seems evident to me that there > > is a fundamental difference of opinion. ?I will not support an alternative > > that fails to make UTF-8 a universally supported character encoding for > > CIF, and it seems clear that James will not, either. ?You seem adamant > > that there be no such universal requirement. ?I think I understand your > > position better than I used to do, but I don't see where there is any > > scope for a consensus. ?My best offer is already on the table in option > > (5) +- UTF-16." > > > > I a very sorry to hear that. ?I hope the rest of you will have the > > courtesy to participate in a Skype meeting. ?Perhaps no new facts > > or logic will come to light. ?Perhaps something will come to light > > that leads to better common understanding and concensus. ?We'll never know > > unless we try. ?I for one think that most of us are open minded and > > willing to try to reach an accomodation that serves the community well. > > > > ? Regards, > > ? ? Herbert > > > > ===================================================== > > ?Herbert J. Bernstein, Professor of Computer Science > > ? ?Dowling College, Kramer Science Center, KSC 121 > > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > > ===================================================== > > > > On Mon, 27 Sep 2010, Bollinger, John C wrote: > > > >> Dear Herb, > >> > >> On Monday, September 27, 2010 8:45 AM, Herbert J. Bernstein wrote: > >> > >>> The problem is that options 3,4 and 5 specifically prescribe the use of > Unicode characters (that is the entire point of those options -- and that is > the point in dispute -- whether we should be prescribing UTF8 or using is as we > now use ASCII, as a way to be clear what we are talking about as in CIF1) and > we simply are not ready to deal such a requirement yet. > >> > >> I think I have reached my own epiphany regarding your position. ?Do correct > me if I am wrong, but I now think you're saying that you don't want to > distinguish any particular encoding(s) as universally acceptable (much less > universally required), correct? ?If so, would it be fair to describe that as > just the "local" part of option 5? > >> > >> [...] > >> > >> On Monday, September 27, 2010 12:07 PM, Herbert J. Bernstein wrote: > >> > >>> Ah, now I begin to understand the difference in our view. ?I view CIF for > >>> journal use and PDB deposition as having a controlled vocabulary, via > >>> combinations of dictionaries, advice to authors, deposition standards, > >>> etc. ?You seem to few CIF as allowing completely arbitrary, uncontrolled > >>> text. ?[...] > >> > >> Yes, I intentionally take an unilluminated view of the problem, but that is > both purposeful and useful. ?Text is the foundation on which CIF is built. ?The > bulk of the spec is devoted to defining which text conforms and which does not. > ?The "F" in "CIF" stands for "file," however, and if the spec is to answer the > question of which *files* conform, or the related question of what a particular > file means, then it needs to address the mapping between "text" and "file". > ?Options (1) and (2) seem crafted specifically to avoid doing so. > >> > >> I understand using local convention to fill the gap (ala "local"), but I > fail to see how any amount of author instructions, deposition standards, etc. > can adequately do the same. ?At best that moves a burden that rightfully should > be borne by the format spec onto application-dependent external documents, some > outside IUCr's control. ?I have shown by my advocacy for option (5) that I am > willing to make the definition of a conformant CIF system-dependent. ?I > acknowledge that that various applications place different demands on the data > content of CIFs they consume. ?I am not, however, willing to make the basic > definition of CIF conformance application-dependent. > >> > >>> Please note that proposals 1 and 2 do _not_ affect "which > >>> byte-sequence representations of those characters will conform to > >>> CIF2, under which circumstances" because they are not rigidly > >>> prescriptive about any > >>> particular byte sequences. > >> > >> Options (1) and (2) certainly DO affect that question, if only by leaving it > open to later, possibly conflicting, interpretation by COMCIFS, individual > developers, and others. ?Option 5 is about as permissive as it reasonably can > be regarding the binary form a CIF may take, while still being definitive > enough that general-purpose software can be written to read conformant CIFs. > ?If my new understanding of your viewpoint is correct, however, then your > objection may be that option 5 is *too* permissive on account of its explicit > allowance for UTF-8 and UTF-16. ?I would be willing to drop the explicit UTF-16 > support (though UTF-16 might nevertheless squeeze in as "local" in some > environments). ?I will under no circumstances, however, support an alternative > that allows any file to be found non-conformant on account of its being encoded > in UTF-8. > >> > >> [...] > >> > >>> This is really getting out of hand. ?We need a meeting. ?If > >>> everyone will send me their Skype id's, I will volunteer to > >>> set up a Skype conference call at some time that works for > >>> everybody (which I suspect will be 4 am EDT). ?My guess is that > >>> 1-2 hours of polite discussion will resolve this. ?What > >>> do we have to lose? > >> > >> Is there anything to gain? ?The last few days have been more illuminating > than the last several weeks, but it still seems evident to me that there is a > fundamental difference of opinion. ?I will not support an alternative that > fails to make UTF-8 a universally supported character encoding for CIF, and it > seems clear that James will not, either. ?You seem adamant that there be no > such universal requirement. ?I think I understand your position better than I > used to do, but I don't see where there is any scope for a consensus. ?My best > offer is already on the table in option (5) +- UTF-16. > >> > >> > >> Respectfully, > >> > >> John > >> -- > >> John C. Bollinger, Ph.D. > >> Department of Structural Biology > >> St. Jude Children's Research Hospital > >> > >> > >> Email Disclaimer: ?www.stjude.org/emaildisclaimer > >> > >> _______________________________________________ > >> cif2-encoding mailing list > >> cif2-encoding at iucr.org > >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > >> > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > From yaya at bernstein-plus-sons.com Tue Sep 28 12:58:10 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 28 Sep 2010 07:58:10 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <646265.82162.qm@web87004.mail.ird.yahoo.com> References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> Message-ID: Dear Colleagues, I am puzzled. People do not seem to want to have a meeting, but they do seem to want to keep "talking" in the form of emails that repeat points we all have discussed many, many times. Please recgnize that this is not working. A meeting or e-meeting also may not work, but it is something we have not tried, and many other times in the history of CIF, meetings have resolved similarly seemingly intractable issues. Please, let's try having a Skype conference call. Have to go now -- time for my next Skype conference call, second one this morning. They really do seem to help. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Tue, 28 Sep 2010, SIMON WESTRIP wrote: > "I [John] will not support an alternative that fails to make UTF-8 a universally > supported character encoding for? CIF, and it seems clear that James will > not, either." > > Herbert has recently stated > > "I personally have no objection to specification of UTF-8 > as a default encoding in the absence of indications of some > different encoding." > > To me, this supports making UTF-8 a universally supported character encoding. > I would hope that a developer reading this might conclude that if the encoding is > not recognizable (UTF16 is recognizable), then they should default to UTF8, > so CIF software must be able to handle UTF8. > > A user reading this, especially in the context of recommendations, would also > hopefully > conclude that UTF8 is the default encoding, and be aware that if they wish to make > use > of the new non-ASCII support offered by CIF2, they may have to pay attention to how > that > new feature should be used. In the end they can continue to do what thay have always > done > when editing CIFs, which is to work with ASCII.? 'As for CIF1...' allows them to do > this without > ?having to worry about the encoding; while the specification of a default encoding > gives > both developers and users the means to make use of non-ASCII, without uncertainty. > > So I think the 'As for CIF1...' proposals with this explicit default encoding is > certainly > heading towards a workable compromise. Herbert is unhappy to mandate a particular > encoding > for non-ASCII use, but has agreed to recommend UTF8 and UTF16 in such cases. > Such recommendations along with a default encoding that should be adopted in the > absence of > any pointers to the contrary could boil down to UTF8/16 + local in all intents and > purposes, > and could boil down to UTF8/16 if you want to use non-ASCII text. > > Cheers > > Simon > > > _____________________________________________________________________________________ > From: James Hester > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Tuesday, 28 September, 2010 7:40:40 > Subject: Re: [Cif2-encoding] How we wrap this up > > I happily confess to agreeing with John on this one. > > I am beginning to think I might understand Herbert's viewpoint.? Correct me Herbert > if I am wrong, but you are suggesting that current CIF1 workflows are implicitly or > explicitly choosing some encoding when they operate with CIF1 files.? If we mandate > some other encoding for CIF2, then those workflows will be adversely affected as they > will need to choose a different encoding. > > I'll critique this further if Herbert confirms that this is what he is getting at. > > On Tue, Sep 28, 2010 at 9:14 AM, Herbert J. Bernstein > wrote: > Now to the substance of John's argument, which he also attributes to > James: > > "I will not support an alternative that fails to make UTF-8 a universally > supported character encoding for ?CIF, and it seems clear that James will > not, either." > > To me it seems the key to this position is "universally supported" which > would seem to imply that there be full documentation and software to allow > CIF users to work within that context. ?The practicality of that position > comes down to the IUCr and the PDB working out workflows and the necessary > support infrastructure to get the entire small and macromlecular CIF > community to make the transition. ?Do the IUCr journal operation and the > PDB have plans and a realistic timeline to do this? ?If not, then are CIF2 > and dREL to wait on the sidelines until then? ?If so, may we please see > them to get a sense of when the transition could happen? > > I suspect we are 90% of the way to where we need to be, but in the same > sense as 90% is used in the old saw: ?The first 90% of the effort takes > the first 90% of the time, and the last 10% of the effort takes the last > 90% of the time. ?But I may be wrong. ?Let's see the plans for the > proposed transition for the IUCr journal operation and the PDB, dealing > with both core CIF and mmCIF in a UTF8 CIF2 world. > > Regards, > ? Herbert > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? ?Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > ===================================================== > > On Mon, 27 Sep 2010, Herbert J. Bernstein wrote: > > > Dear Colleagues, > > > > ? John declines to join in a meeting because ... > > > > "Is there anything to gain? ?The last few days have been more illuminating > > than the last several weeks, but it still seems evident to me that there > > is a fundamental difference of opinion. ?I will not support an alternative > > that fails to make UTF-8 a universally supported character encoding for > > CIF, and it seems clear that James will not, either. ?You seem adamant > > that there be no such universal requirement. ?I think I understand your > > position better than I used to do, but I don't see where there is any > > scope for a consensus. ?My best offer is already on the table in option > > (5) +- UTF-16." > > > > I a very sorry to hear that. ?I hope the rest of you will have the > > courtesy to participate in a Skype meeting. ?Perhaps no new facts > > or logic will come to light. ?Perhaps something will come to light > > that leads to better common understanding and concensus. ?We'll never know > > unless we try. ?I for one think that most of us are open minded and > > willing to try to reach an accomodation that serves the community well. > > > > ? Regards, > > ? ? Herbert > > > > ===================================================== > > ?Herbert J. Bernstein, Professor of Computer Science > > ? ?Dowling College, Kramer Science Center, KSC 121 > > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > > ===================================================== > > > > On Mon, 27 Sep 2010, Bollinger, John C wrote: > > > >> Dear Herb, > >> > >> On Monday, September 27, 2010 8:45 AM, Herbert J. Bernstein wrote: > >> > >>> The problem is that options 3,4 and 5 specifically prescribe the use of > Unicode characters (that is the entire point of those options -- and that is > the point in dispute -- whether we should be prescribing UTF8 or using is as we > now use ASCII, as a way to be clear what we are talking about as in CIF1) and > we simply are not ready to deal such a requirement yet. > >> > >> I think I have reached my own epiphany regarding your position. ?Do correct > me if I am wrong, but I now think you're saying that you don't want to > distinguish any particular encoding(s) as universally acceptable (much less > universally required), correct? ?If so, would it be fair to describe that as > just the "local" part of option 5? > >> > >> [...] > >> > >> On Monday, September 27, 2010 12:07 PM, Herbert J. Bernstein wrote: > >> > >>> Ah, now I begin to understand the difference in our view. ?I view CIF for > >>> journal use and PDB deposition as having a controlled vocabulary, via > >>> combinations of dictionaries, advice to authors, deposition standards, > >>> etc. ?You seem to few CIF as allowing completely arbitrary, uncontrolled > >>> text. ?[...] > >> > >> Yes, I intentionally take an unilluminated view of the problem, but that is > both purposeful and useful. ?Text is the foundation on which CIF is built. ?The > bulk of the spec is devoted to defining which text conforms and which does not. > ?The "F" in "CIF" stands for "file," however, and if the spec is to answer the > question of which *files* conform, or the related question of what a particular > file means, then it needs to address the mapping between "text" and "file". > ?Options (1) and (2) seem crafted specifically to avoid doing so. > >> > >> I understand using local convention to fill the gap (ala "local"), but I > fail to see how any amount of author instructions, deposition standards, etc. > can adequately do the same. ?At best that moves a burden that rightfully should > be borne by the format spec onto application-dependent external documents, some > outside IUCr's control. ?I have shown by my advocacy for option (5) that I am > willing to make the definition of a conformant CIF system-dependent. ?I > acknowledge that that various applications place different demands on the data > content of CIFs they consume. ?I am not, however, willing to make the basic > definition of CIF conformance application-dependent. > >> > >>> Please note that proposals 1 and 2 do _not_ affect "which > >>> byte-sequence representations of those characters will conform to > >>> CIF2, under which circumstances" because they are not rigidly > >>> prescriptive about any > >>> particular byte sequences. > >> > >> Options (1) and (2) certainly DO affect that question, if only by leaving it > open to later, possibly conflicting, interpretation by COMCIFS, individual > developers, and others. ?Option 5 is about as permissive as it reasonably can > be regarding the binary form a CIF may take, while still being definitive > enough that general-purpose software can be written to read conformant CIFs. > ?If my new understanding of your viewpoint is correct, however, then your > objection may be that option 5 is *too* permissive on account of its explicit > allowance for UTF-8 and UTF-16. ?I would be willing to drop the explicit UTF-16 > support (though UTF-16 might nevertheless squeeze in as "local" in some > environments). ?I will under no circumstances, however, support an alternative > that allows any file to be found non-conformant on account of its being encoded > in UTF-8. > >> > >> [...] > >> > >>> This is really getting out of hand. ?We need a meeting. ?If > >>> everyone will send me their Skype id's, I will volunteer to > >>> set up a Skype conference call at some time that works for > >>> everybody (which I suspect will be 4 am EDT). ?My guess is that > >>> 1-2 hours of polite discussion will resolve this. ?What > >>> do we have to lose? > >> > >> Is there anything to gain? ?The last few days have been more illuminating > than the last several weeks, but it still seems evident to me that there is a > fundamental difference of opinion. ?I will not support an alternative that > fails to make UTF-8 a universally supported character encoding for CIF, and it > seems clear that James will not, either. ?You seem adamant that there be no > such universal requirement. ?I think I understand your position better than I > used to do, but I don't see where there is any scope for a consensus. ?My best > offer is already on the table in option (5) +- UTF-16. > >> > >> > >> Respectfully, > >> > >> John > >> -- > >> John C. Bollinger, Ph.D. > >> Department of Structural Biology > >> St. Jude Children's Research Hospital > >> > >> > >> Email Disclaimer: ?www.stjude.org/emaildisclaimer > >> > >> _______________________________________________ > >> cif2-encoding mailing list > >> cif2-encoding at iucr.org > >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > >> > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > From John.Bollinger at STJUDE.ORG Tue Sep 28 14:57:27 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Tue, 28 Sep 2010 08:57:27 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE5@SJMEMXMBS11.stjude.sjcrh.local> Dear Herb, On Monday, September 27, 2010 5:19 PM, Herbert J. Bernstein wrote: > I hope the rest of you will have the >courtesy to participate in a Skype meeting. Perhaps no new facts >or logic will come to light. Perhaps something will come to light >that leads to better common understanding and concensus. We'll never know >unless we try. I for one think that most of us are open minded and >willing to try to reach an accomodation that serves the community well. I apologize for any discourtesy you perceive, and I assure you that none is intended. At the same time, I am confident that the opportunities for careful consideration, research, and revision inherent in the written form and extended time frame of our discussion to date have already afforded all of us ample opportunity to communicate our positions, to explore each others', and to attempt to reach a consensus compromise. I have no general objection to conference calls or face-to-face meetings as forums for discussion, and I have participated in many of both. I do not, however, see anything to be gained by moving this discussion to such a venue at this point. How will there be anything other than more of the "endless repetition" of positions you so deplore? What topic do you wish to discuss that we have not already covered in detail? If there is such a topic then we can save the discussion for a call, but it will go better if we all have the opportunity to prepare. As for being "open minded and willing to try to reach an accomodation that serves the community well," I point to the record of our discussion. If my efforts to find such an accommodation are not plainly evident, if my attempts to understand the various positions and viewpoints are not clearly visible, and if my willingness to consider alternative approaches is not a matter of record, then I shall have to accept your insinuation. I deeply regret that I have left that impression on you. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Tue Sep 28 16:07:27 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 28 Sep 2010 11:07:27 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE5@SJMEMXMBS11.stjude.sjcrh.local> References: <613218.81205.qm@web87011.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDDE@SJMEMXMBS11.stjude.sjcrh.local > <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <206078.51827.qm@web87010.mail.ird.yah oo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE0@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE5@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear John, The topic I wish to discuss is how to move CIF2 forward. The norm I am used to in standards work is one of continuity and gradual change. I am used to the approach of taking existing features and deprecating them for some period of time (often a few years) rather than dropping them abruptly. I find many aspects of the current approach to CIF2 troubling because of the discontinuity from CIF1 practice without such a transition. My motion is an effort to try to bring the transition to CIF2 more in line with the traditional standards approach of gradual change through deprecation of non-unicode encodings rather than trying to abruptly wipe them out. XML has not managed to complete that transition after many years for favoring UTF8+UTF16, and we have much weaker support infrastructure than does XML. So, the point of the meeting is not so much to refight well-discussed technical issues, but to do the critical work of finding a process to allow the community to move forward with CIF2 in a way that actually works. I think my motion does that. I would suggest you reread it. It stops just short of formally deprecating non-unicode encodings -- very similar to the current approach in XML. My guess is that that is a far as we can go right now. Maybe by next summer it would be possible to actually formally deprecate non-Unicode encodings. I doubt it, but I could turn out to be wrong. Having a meeting should help to clarify this. Regards, Herbert =============================================================== Proposed position on CIF2 character encodings submitted to COMCIFS for a vote as an interim agreement on what can be agreed thus far, subject to extension and refinement in the future. =============================================================== Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U+xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8 as the preferred concrete representation of the information in a CIF2 document. Reference to ASCII characters means characters U+0000 through U+007F, or, equivalently the first 128 characters of the ISO-8859-1 (LATIN-1) character set. Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). Reference to whitespace means the characters ASCII space (U+0020), ASCII horizontal tab (U+0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. References to Unicode and UTF-8 are specifically to identify characters and a concrete representation of those characters in an established and widely available standard. It is understood that CIF2 documents may be constructed and maintained on computer that implements other character encodings. However, for maximum portability only the clearly identified equivalents to the Unicode characters identified above and below should be used and use of UTF-8 for a concrete representation is highly recommended. A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is, #\#CIF_2.0 followed immediately by whitespace. The addition of further information to assist in disambiguation among multiple characters sets is under discussion. Encodings, such a UTF-16, which prefix a file by a BOM (byte-order-message) or other encoding disambiguation prefix are not precluded. In such a case, the magic code should follow the encoding disambiguation prefix. In keeping with XML restrictions we allow the characters U+0009 U+000A U+000D U+0020 -- U+007E U+00A0 -- U+D7FF U+E000 -- U+FDCF U+FDF0 -- U+FFFD U+10000 -- U+10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 - F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. CIF2 processors are required to treat , and as newline characters, by normalising them to on read. No other characters or character sequences may represent newline. In particular, CIF2 processors should not interpret the Unicode characters U+2028 (line separator) or U+2029 (paragraph separator) as newline. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Tue, 28 Sep 2010, Bollinger, John C wrote: > Dear Herb, > > On Monday, September 27, 2010 5:19 PM, Herbert J. Bernstein wrote: > >> I hope the rest of you will have the >> courtesy to participate in a Skype meeting. Perhaps no new facts >> or logic will come to light. Perhaps something will come to light >> that leads to better common understanding and concensus. We'll never know >> unless we try. I for one think that most of us are open minded and >> willing to try to reach an accomodation that serves the community well. > > I apologize for any discourtesy you perceive, and I assure you that none > is intended. At the same time, I am confident that the opportunities > for careful consideration, research, and revision inherent in the > written form and extended time frame of our discussion to date have > already afforded all of us ample opportunity to communicate our > positions, to explore each others', and to attempt to reach a consensus > compromise. > > I have no general objection to conference calls or face-to-face meetings > as forums for discussion, and I have participated in many of both. I do > not, however, see anything to be gained by moving this discussion to > such a venue at this point. How will there be anything other than more > of the "endless repetition" of positions you so deplore? What topic do > you wish to discuss that we have not already covered in detail? If > there is such a topic then we can save the discussion for a call, but it > will go better if we all have the opportunity to prepare. > > As for being "open minded and willing to try to reach an accomodation > that serves the community well," I point to the record of our > discussion. If my efforts to find such an accommodation are not plainly > evident, if my attempts to understand the various positions and > viewpoints are not clearly visible, and if my willingness to consider > alternative approaches is not a matter of record, then I shall have to > accept your insinuation. I deeply regret that I have left that > impression on you. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From John.Bollinger at STJUDE.ORG Tue Sep 28 17:55:34 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Tue, 28 Sep 2010 11:55:34 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <646265.82162.qm@web87004.mail.ird.yahoo.com> References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> On Tuesday, September 28, 2010 4:54 AM, SIMON WESTRIP wrote: [...] >So I think the 'As for CIF1...' proposals with this explicit default encoding is certainly >heading towards a workable compromise. Herbert is unhappy to mandate a particular encoding >for non-ASCII use, but has agreed to recommend UTF8 and UTF16 in such cases. >Such recommendations along with a default encoding that should be adopted in the absence of >any pointers to the contrary could boil down to UTF8/16 + local in all intents and purposes, >and could boil down to UTF8/16 if you want to use non-ASCII text. Recommending UTF-8 and / or UTF-16 without mandating support for one or both does not get us where I insist we need to be. In particular, the point of requiring support for at least one specific encoding applicable to the entire CIF2 character repertoire is to provide a means *wholly within the standard* by which conforming parties can be certain of communicating arbitrary CIF content accurately. The various Unicode Transformation Formats have additional desirable properties in that regard that we have covered extensively. If establishing UTF-8 as the default encoding confers a mandate to support it then where indeed is the great distinction between 'As for CIF1...' and UTF-8 (+- UTF-16) + local? If there is one then it can only be in the definition of "text" on the one hand and "local" on the other, which is to say in the details of support for non-UTF-x encodings. That is an area where perhaps we could find a consensus, or at least a strong majority opinion. For that to happen, I require definitions of "text" and "text file" sufficient to program to. James has asked for the same. "local" already provides such definitions, intended to cover the cases that CIF1 allows and UTF-8 +- UTF-16 does not. Are there cases it misses that should be covered? Are there cases it covers that should be missed? Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From simonwestrip at btinternet.com Tue Sep 28 20:19:22 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Tue, 28 Sep 2010 12:19:22 -0700 (PDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh. local> Message-ID: <310698.3619.qm@web87010.mail.ird.yahoo.com> Dear John My main difficulty with the term 'local' is that its difference from 'any encoding' is possibly too subtle to find a place in a specification of a standard that is not novel (i.e. CIF2 follows CIF1). I've used Herbert's actual 'As for CIF1...' description as a basis to build upon partly because it is modelled on this established standard. Perhaps it might further your cause to present your proposal more completely, and define "text" and "text file" sufficient to program to. Certainly, although I think I understand your use of "local", I would like to see it presented as part of a full specification of CIF encoding so that I might determine more clearly how the concept will be received by current CIF users/developers. Cheers Simon ________________________________ From: "Bollinger, John C" To: Group for discussing encoding and content validation schemes for CIF2 Sent: Tuesday, 28 September, 2010 17:55:34 Subject: Re: [Cif2-encoding] How we wrap this up On Tuesday, September 28, 2010 4:54 AM, SIMON WESTRIP wrote: [...] >So I think the 'As for CIF1...' proposals with this explicit default encoding is >certainly >heading towards a workable compromise. Herbert is unhappy to mandate a >particular encoding >for non-ASCII use, but has agreed to recommend UTF8 and UTF16 in such cases. >Such recommendations along with a default encoding that should be adopted in the >absence of >any pointers to the contrary could boil down to UTF8/16 + local in all intents >and purposes, >and could boil down to UTF8/16 if you want to use non-ASCII text. Recommending UTF-8 and / or UTF-16 without mandating support for one or both does not get us where I insist we need to be. In particular, the point of requiring support for at least one specific encoding applicable to the entire CIF2 character repertoire is to provide a means *wholly within the standard* by which conforming parties can be certain of communicating arbitrary CIF content accurately. The various Unicode Transformation Formats have additional desirable properties in that regard that we have covered extensively. If establishing UTF-8 as the default encoding confers a mandate to support it then where indeed is the great distinction between 'As for CIF1...' and UTF-8 (+- UTF-16) + local? If there is one then it can only be in the definition of "text" on the one hand and "local" on the other, which is to say in the details of support for non-UTF-x encodings. That is an area where perhaps we could find a consensus, or at least a strong majority opinion. For that to happen, I require definitions of "text" and "text file" sufficient to program to. James has asked for the same. "local" already provides such definitions, intended to cover the cases that CIF1 allows and UTF-8 +- UTF-16 does not. Are there cases it misses that should be covered? Are there cases it covers that should be missed? Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100928/691f5970/attachment-0001.html From yaya at bernstein-plus-sons.com Tue Sep 28 20:40:46 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 28 Sep 2010 15:40:46 -0400 (EDT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> References: <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear John, The norm in standards work is to deprecate features for a while (at least months and preferably years) before you remove them. > Recommending UTF-8 and / or UTF-16 without mandating support for one or > both does not get us where I insist we need to be. The problem is coming to agreement on "support" and that pesky word "mandating". Up until now in order for a CIF application developer or user to produce compliant CIFS, all they had to do was to produce a text file in whatever encoding was provided on their system. Now you wish to mandate that they be able to produce UTF8 or UTF16, even if they are running on some code-page based system. It is fine to recommend to them that they do this. It is fine to tell them that CIF compliance via some non-unicode is about to be deprecated, so they should take the issue seriously, but it is most definitely _not_ fine to mandate that they make the change _now_ because we are impatient, and don't want to go through the normal standards process of deprecating a feature before removing it. We have already made that mistake with other CIF2 features, e.g. the drastic change in string quoting. At least with the string quoting change we can easily provide portable conversion software and compensate for the discourtesy of failing to provide a transition period during which the old quoting convention is deprecated (provided we add the concatenation operator to the CIF2 spec). In addition, we have the excuse of their being a ressonable need to harmonize the string quotation process between the CIF itself and DDLm to have dREL fucntion smoothly. What is our excuse for impatience on the encoding issue? We don't have any reasonable prospect to clean software support for the encoding transition. We don't even have a coherent document yet. John, you seem to be doing a lot of "insisting" and "requiring". Precisely which existing CIF workflows are going to be negatively impacted if your demands are not met? What problem is being solved? The motion I have proposed does not make anything worse for anybody currently using CIF and allows them to start moving into CIF2 now. Your approach imposes conditions it will take months or years to meet with no prospect that satisfying your demands will solve any problem for anybody. Please rethink your position. If we recommend UTF8/UTF16 support we have a decent chance that somebody will simply provide it. If we mandate UTF8/UTF16 support we force pointless delays in the adoption of the rest of CIF2 and gain what in exchange? Regards, Herbert P.S. A partial answer to your question about text encodings is at http://en.wikipedia.org/wiki/Character_encoding However, the real answer (not a joke) is that a text encoding is whatever the formatted I/O system in a fortran compiler on the system under discussion reads and writes or the format of a COBOL EBCDIC-sequential file or a COBOL ASCII line-sequential file, or what a text editor on the system handles. That is the point -- text is something very, very system and language dependent. The strange thing is that text files have a much longer practical survival time than binary files, as backwards as that may seem, because there is a much larger investment in ensuring the continued readbility of text files than of binary files. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Tue, 28 Sep 2010, Bollinger, John C wrote: > > On Tuesday, September 28, 2010 4:54 AM, SIMON WESTRIP wrote: > [...] >> So I think the 'As for CIF1...' proposals with this explicit default >> encoding is certainly heading towards a workable compromise. Herbert is >> unhappy to mandate a particular encoding for non-ASCII use, but has >> agreed to recommend UTF8 and UTF16 in such cases. Such recommendations >> along with a default encoding that should be adopted in the absence of >> any pointers to the contrary could boil down to UTF8/16 + local in all >> intents and purposes, and could boil down to UTF8/16 if you want to use >> non-ASCII text. > > Recommending UTF-8 and / or UTF-16 without mandating support for one or > both does not get us where I insist we need to be. In particular, the > point of requiring support for at least one specific encoding applicable > to the entire CIF2 character repertoire is to provide a means *wholly > within the standard* by which conforming parties can be certain of > communicating arbitrary CIF content accurately. The various Unicode > Transformation Formats have additional desirable properties in that > regard that we have covered extensively. > > If establishing UTF-8 as the default encoding confers a mandate to > support it then where indeed is the great distinction between 'As for > CIF1...' and UTF-8 (+- UTF-16) + local? If there is one then it can > only be in the definition of "text" on the one hand and "local" on the > other, which is to say in the details of support for non-UTF-x > encodings. That is an area where perhaps we could find a consensus, or > at least a strong majority opinion. For that to happen, I require > definitions of "text" and "text file" sufficient to program to. James > has asked for the same. "local" already provides such definitions, > intended to cover the cases that CIF1 allows and UTF-8 +- UTF-16 does > not. Are there cases it misses that should be covered? Are there cases > it covers that should be missed? > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From John.Bollinger at STJUDE.ORG Tue Sep 28 21:56:24 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Tue, 28 Sep 2010 15:56:24 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <310698.3619.qm@web87010.mail.ird.yahoo.com> References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh. local> <310698.3619.qm@web87010.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE8@SJMEMXMBS11.stjude.sjcrh.local> On Tuesday, September 28, 2010 2:19 PM, SIMON WESTRIP wrote: > Perhaps it might further your cause to >present your proposal more completely, and define "text" and "text file" sufficient to program to. > >Certainly, although I think I understand your use of "local", I would like to see it presented as part of >a full specification of CIF encoding so that I might determine more clearly how the concept will be received >by current CIF users/developers. Fair enough. Here are the relevant excerpts of a revised full draft of the CIF2 Changes document. I had intended to hold off on this, pending possible adoption of the underlying concept, but as long as you asked: ------ TERMINOLOGY Reference to character(s) means abstract characters assigned code points by Unicode. [...] Reference to CIF text means a sequence of characters collectively complying with the CIF syntax described in this specification. CIF text is the unparsed logical content of a CIF, independent of any mechanism for representing that content concretely. [...] CHANGE X - REFINEMENT to CIF1 Persistent Form. [...] (3) CIF/local CIF text is serialized according to this form by encoding it via the default text conventions for its environment, including not only character encoding scheme, but also the newline representation and possibly other details. The specific meaning of "default text conventions" is system-dependent. They are the conventions employed by the system's standard text editors when creating new files, the conventions the system's I/O libraries assume for 'formatted' or 'text' files when not explicitly instructed differently, etc.. ------ I am not emotionally tied to that exact wording nor even to all of the details, but I would be reasonably satisfied with them if adopted as-is. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From John.Bollinger at STJUDE.ORG Tue Sep 28 23:21:19 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Tue, 28 Sep 2010 17:21:19 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> Dear Herb, On Tuesday, September 28, 2010 2:41 PM, Herbert J. Bernstein > The norm in standards work is to deprecate features for a while >(at least months and preferably years) before you remove them. I acknowledge that principle, and I see no incompatibility between it and option 5. More below. Do not overlook my final comments. >> Recommending UTF-8 and / or UTF-16 without mandating support for one or >> both does not get us where I insist we need to be. > >The problem is coming to agreement on "support" and that pesky word >"mandating". By "mandating support" I mean that a file containing a sequence of characters conforming to the CIF syntax and encoded via UTF-8 is defined to be a conformant CIF everywhere. By itself, that would not obligate anyone to encode their CIFs in UTF-8. It would, however, mean that fully-conformant CIF2 readers must be prepared to accept CIFs encoded in that manner. Even that is no barrier to adoption, though, for CIF users must be prepared to deal with the encoding question under any alternative on the table, and if they can read only their local encoding then they would need to be able to transcode in any event. > Up until now in order for a CIF application developer or >user to produce compliant CIFS, all they had to do was to produce a text >file in whatever encoding was provided on their system. Now you wish to >mandate that they be able to produce UTF8 or UTF16, even if they are >running on some code-page based system. Not at all. It is the single objective of the "+ local" provision of my preferred alternative to enable application developers, authors, and anyone else to continue to do exactly what you describe them already doing. [...] >We have already made that mistake with other CIF2 features, e.g. the >drastic change in string quoting. I agree with you that such changes are mistaken. That was my motivation in questioning UTF-8 only to begin with. [...] >The motion I have proposed does not make anything worse for anybody >currently using CIF and allows them to start moving into CIF2 now. Neither your motion nor my preferred one make anything worse for anybody using CIF1, and both allow them to start moving into CIF2 now. > Your >approach imposes conditions it will take months or years to meet with no >prospect that satisfying your demands will solve any problem for anybody. My approach imposes no special conditions, but it offers the advantages of UTF-8 as an available standard feature. As long as we are relying on "The norm in standards work," wouldn't you agree that it is normal to introduce new features to a standard no later than the time the features they supersede are deprecated? >Please rethink your position. I have considered my position carefully, and rethought it several times over the course of our discussion. I firmly believe that I am advocating a solid and eminently workable compromise between support of the existing CIF1 base and the future needs of CIF2 users. >If we recommend UTF8/UTF16 support we have a decent chance that somebody >will simply provide it. If we mandate UTF8/UTF16 support we force >pointless delays in the adoption of the rest of CIF2 and gain what in >exchange? Even if CIF2 ended up UTF-8 only, people could write software exactly as they would do under your proposal, then wrap it in a transcoder. Or in that event I think it likely that some would implement my preferred alternative (5) as an extension. Perhaps you would agree, as that's the same end result that you think likely somebody will simply provide, coming from the opposite direction. I see no reason to fear any significant delays in CIF2 adoption arising from any particular result this discussion may ultimately reach. [...] >However, the real answer (not a joke) is that a text encoding is whatever >the formatted I/O system in a fortran compiler on the system under >discussion reads and writes or the format of a COBOL EBCDIC-sequential >file or a COBOL ASCII line-sequential file, or what a text editor on the >system handles. That is the point -- text is something very, very system >and language dependent. The strange thing is that text files have a much >longer practical survival time than binary files, as backwards as that may >seem, because there is a much larger investment in ensuring the continued >readbility of text files than of binary files. I am laughing, but not because I think you're joking. As far as I can tell, that answer is functionally identical to what I have been advocating as "local". It's even worded similarly. My desire to include it (but not to be limited to it) is the primary difference between James's most preferred position and mine. Best Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From John.Bollinger at STJUDE.ORG Tue Sep 28 23:25:51 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Tue, 28 Sep 2010 17:25:51 -0500 Subject: [Cif2-encoding] Let's all take a deep breath.... . In-Reply-To: References: Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDEA@SJMEMXMBS11.stjude.sjcrh.local> On Sunday, September 26, 2010 6:47 PM, James Hester wrote: >I am however >unhappy that both Brian and Simon introduced new concerns and nobody >has had a chance to comment on how the various proposals under >consideration might affect those concerns. I would therefore like to >suggest that the voting period continues until the end of this week, >and that we all endeavour to express any concerns or comments that we >need to make in a timely fashion. I have responded to Simon's new concerns, I think, but not to Brian's. Supplemental to James's well-reasoned comments, then: On Friday, September 24, 2010 4:24 AM, Brian McMahon wrote: >I still feel this argument is at heart a "binary/text" >dichotomy, where "binary" implies that one can prescribe specific byte-level representations of every distinct character; "text" >implies that you're at the mercy of external libraries and mappings between encoding conventions - and those >mappings are not always explicit or easy to identify. That characterization of "text" sounds suspiciously similar to the "local" part of option 5 -- as it should, because the two attempt to describe the same (I think) concept. I am open to alternative definitions, but I do not comprehend the apparent aversion to defining these terms. If they are so obvious as to not require definition, then providing definitions anyway will be simple and harmless. If not, then how else do we expect consumers of the spec to come to the same conclusion about what it means? >I sympathise greatly with James's desire for a prescriptive, "binary" >approach, but its corollary is that a CIF application must take full responsibility for expressing any supported extended character set (I mean accented Latin letters, Greek characters, Cyrillic or Chinese alphabets). I do not follow this logic, inasmuch as it seems to be about the CIF2 character repertoire, rather than about the encodings with which characters from that repertoire may be encoded. The character repertoire is not the subject of this debate. Relying on "text" to define allowed characters would mean that some reasonable content expressed in conformant CIF form on one system cannot be expressed in any conformant CIF form on another. For example, a CIF-format, Chinese-language journal article encoded in EUC-CN might be perfectly valid CIF in the journal office, but there would be no CIF-conformant way to represent it at all on a system whose definition of "text" does not accommodate Chinese characters. [...] >I put option 5 at the bottom because of the non-portability of a "local" encoding. This is the part I understand least. "Text" is at least roughly equivalent to "local", and entirely as non-portable. Merely tagging CIFs with encoding information doesn't fix that very well, as we covered in the course of our discussion, particularly when doing so is optional. Moreover, even optional tagging is a feature only of choice 2, not choice 1. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From yaya at bernstein-plus-sons.com Wed Sep 29 02:28:25 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 28 Sep 2010 21:28:25 -0400 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > References: <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local > <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > Message-ID: John, Now I am totally confused about what you are proposing and agree with Simon that what is needed for you to state your proposal as the precise wording that you propose to insert and/or change in the current CIF2 change document "5 July 2010: draft of changes to the existing CIF 1.1 specification for public discussion" If I understand your proposal correctly, the _only_ thing you are proposing that differs in any way from my proposed motion is a mandate that a CIF2 conformant reader must be able to read a UTF8 CIF2 file, but that _no_ CIF application would actually be required to provide such code, provided there was some mechanism available to transcode from UTF8 to the local encoding, which does not seem to be a mandate on the conformant CIF2 reader at all, but a requirement for the provision of a portable utility to do that external transcoding. If that is the case, wouldn't it make more sense to just provide that utility that to argue about whether my motion requires somebody to write their own? Having the utility in hand would avoid having multiple, conflicting interpretations of this input transcoding requirement. If I have read your message correctly, please just write the utility you are proposing. If I have read your message incorrectly, please write the specification changes you propose for the draft changes in place of the changes in my motion. _This_ is why it was, is, and will remain a good idea to simply have a meeting and talk these things out. At 5:21 PM -0500 9/28/10, Bollinger, John C wrote: >Dear Herb, > >On Tuesday, September 28, 2010 2:41 PM, Herbert J. Bernstein > >> The norm in standards work is to deprecate features for a while >>(at least months and preferably years) before you remove them. > >I acknowledge that principle, and I see no incompatibility between >it and option 5. More below. Do not overlook my final comments. > >>> Recommending UTF-8 and / or UTF-16 without mandating support for one or >>> both does not get us where I insist we need to be. >> >>The problem is coming to agreement on "support" and that pesky word >>"mandating". > >By "mandating support" I mean that a file containing a sequence of >characters conforming to the CIF syntax and encoded via UTF-8 is >defined to be a conformant CIF everywhere. By itself, that would >not obligate anyone to encode their CIFs in UTF-8. It would, >however, mean that fully-conformant CIF2 readers must be prepared to >accept CIFs encoded in that manner. Even that is no barrier to >adoption, though, for CIF users must be prepared to deal with the >encoding question under any alternative on the table, and if they >can read only their local encoding then they would need to be able >to transcode in any event. > >> Up until now in order for a CIF application developer or >>user to produce compliant CIFS, all they had to do was to produce a text >>file in whatever encoding was provided on their system. Now you wish to >>mandate that they be able to produce UTF8 or UTF16, even if they are >>running on some code-page based system. > >Not at all. It is the single objective of the "+ local" provision >of my preferred alternative to enable application developers, >authors, and anyone else to continue to do exactly what you describe >them already doing. > >[...] > >>We have already made that mistake with other CIF2 features, e.g. the >>drastic change in string quoting. > >I agree with you that such changes are mistaken. That was my >motivation in questioning UTF-8 only to begin with. > >[...] > >>The motion I have proposed does not make anything worse for anybody >>currently using CIF and allows them to start moving into CIF2 now. > >Neither your motion nor my preferred one make anything worse for >anybody using CIF1, and both allow them to start moving into CIF2 >now. > >> Your >>approach imposes conditions it will take months or years to meet with no >>prospect that satisfying your demands will solve any problem for anybody. > >My approach imposes no special conditions, but it offers the >advantages of UTF-8 as an available standard feature. As long as we >are relying on "The norm in standards work," wouldn't you agree that >it is normal to introduce new features to a standard no later than >the time the features they supersede are deprecated? > >>Please rethink your position. > >I have considered my position carefully, and rethought it several >times over the course of our discussion. I firmly believe that I am >advocating a solid and eminently workable compromise between support >of the existing CIF1 base and the future needs of CIF2 users. > >>If we recommend UTF8/UTF16 support we have a decent chance that somebody >>will simply provide it. If we mandate UTF8/UTF16 support we force >>pointless delays in the adoption of the rest of CIF2 and gain what in >>exchange? > >Even if CIF2 ended up UTF-8 only, people could write software >exactly as they would do under your proposal, then wrap it in a >transcoder. Or in that event I think it likely that some would >implement my preferred alternative (5) as an extension. Perhaps you >would agree, as that's the same end result that you think likely >somebody will simply provide, coming from the opposite direction. > >I see no reason to fear any significant delays in CIF2 adoption >arising from any particular result this discussion may ultimately >reach. > >[...] > >>However, the real answer (not a joke) is that a text encoding is whatever >>the formatted I/O system in a fortran compiler on the system under >>discussion reads and writes or the format of a COBOL EBCDIC-sequential >>file or a COBOL ASCII line-sequential file, or what a text editor on the >>system handles. That is the point -- text is something very, very system >>and language dependent. The strange thing is that text files have a much >>longer practical survival time than binary files, as backwards as that may >>seem, because there is a much larger investment in ensuring the continued >>readbility of text files than of binary files. > >I am laughing, but not because I think you're joking. As far as I >can tell, that answer is functionally identical to what I have been >advocating as "local". It's even worded similarly. My desire to >include it (but not to be limited to it) is the primary difference >between James's most preferred position and mine. > > >Best Regards, > >John >-- >John C. Bollinger, Ph.D. >Department of Structural Biology >St. Jude Children's Research Hospital > > >Email Disclaimer: www.stjude.org/emaildisclaimer > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From jamesrhester at gmail.com Wed Sep 29 03:16:07 2010 From: jamesrhester at gmail.com (James Hester) Date: Wed, 29 Sep 2010 12:16:07 +1000 Subject: [Cif2-encoding] Addressing Simon's concerns Message-ID: Some comments on Simon's concerns as raised last week: Simon wrote: I have been developing a new docx template which the IUCr editorial office is shortly to release for use by authors. The template will be packaged with some tools to extract data from CIFs and tabulate them in the Word document, e.g. open an mmCIF, click a button, and standard tables populated with data from the CIF will be included in the document, acting as table templates for the author to edit as appropriate for their manuscript. Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' biologists to start using/accepting mmCIF as a useful medium, rather than as a product of their deposition to the PDB, and to encourage them to become comfortable with passing mmCIFs between applications, and even to edit the things (in the same way as the core-CIF community treats CIFs). For example, our perception is that there is no reason why an author should not feel free to take an mmCIF that has been created by e.g. pdb_extract and populate it using third-party software before uploading to the PDB for deposition. This cause would not be furthered by effectively invalidating an mmCIF if it were not to be encoded in one of the specified encodings. This cause would also not be furthered if the PDB or other colleagues are unable to figure out what the symbols in the text were without hassling the provider of the CIF (think of a phone conversion "OK, try this, now send it again....no, what about trying this format and send it again...that works, except that the Greek characters don't come out....") I think a rough equivalent of what you are saying is "We would alienate biologists if they are unable to submit manuscripts in their native language". However, scientists are used to making some extra effort in order to achieve international communication. Furthermore, are there any macromolecular data exchange formats that allow characters from the Unicode range to be interchanged reliably? Isn't there a carrot as well as a (small) stick here? So although I am uneasy about a specification that propogates uncertainty, I'm also uneasy about alienating users, especially when we are struggling to change their mindset as in the case of the biological community (my perception of the biological community's attitude to mmCIF is based on feedback from authors/coeditors to IUCr journals). Granted this may not be the most compelling argument in favour of 'any encoding', but recognizing the hurdles that may have to be overcome once we move beyond ASCII whatever the CIF2 specification, I support 'any encoding' as 'a means to an end'. Just to make sure that I understand, you are concerned that third party software may take a UTF8 mmCIF template provided by the IUCr and populate it with further information, and at some stage transcode to a different encoding. By mandating UTF8, we are therefore forcing biologists to jump through more hoops than they would otherwise have to. I don't see how that last sentence follows: surely those writing third party software are the ones that will be dealing with encoding issues, and as long as we say right now, at the beginning, that the encoding is UTF8/16, the software will be written as intended and biologists will be oblivous to the encoding. all the best, James. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100929/54b18a8d/attachment.html From simonwestrip at btinternet.com Wed Sep 29 08:17:01 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Wed, 29 Sep 2010 07:17:01 +0000 (GMT) Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE8@SJMEMXMBS11.stjude.sjcrh.local> References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh. local> <310698.3619.qm@web87010.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE8@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <967916.56129.qm@web87002.mail.ird.yahoo.com> John, I do not think a specification that suggests that a CIF can be invalidated simply by being moved to another environment is helpful to anyone. Cheers Simon ________________________________ From: "Bollinger, John C" To: Group for discussing encoding and content validation schemes for CIF2 Sent: Tuesday, 28 September, 2010 21:56:24 Subject: Re: [Cif2-encoding] How we wrap this up On Tuesday, September 28, 2010 2:19 PM, SIMON WESTRIP wrote: > Perhaps it might further your cause to >present your proposal more completely, and define "text" and "text file" >sufficient to program to. > >Certainly, although I think I understand your use of "local", I would like to >see it presented as part of >a full specification of CIF encoding so that I might determine more clearly how >the concept will be received >by current CIF users/developers. Fair enough. Here are the relevant excerpts of a revised full draft of the CIF2 Changes document. I had intended to hold off on this, pending possible adoption of the underlying concept, but as long as you asked: ------ TERMINOLOGY Reference to character(s) means abstract characters assigned code points by Unicode. [...] Reference to CIF text means a sequence of characters collectively complying with the CIF syntax described in this specification. CIF text is the unparsed logical content of a CIF, independent of any mechanism for representing that content concretely. [...] CHANGE X - REFINEMENT to CIF1 Persistent Form. [...] (3) CIF/local CIF text is serialized according to this form by encoding it via the default text conventions for its environment, including not only character encoding scheme, but also the newline representation and possibly other details. The specific meaning of "default text conventions" is system-dependent. They are the conventions employed by the system's standard text editors when creating new files, the conventions the system's I/O libraries assume for 'formatted' or 'text' files when not explicitly instructed differently, etc.. ------ I am not emotionally tied to that exact wording nor even to all of the details, but I would be reasonably satisfied with them if adopted as-is. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100929/29deb50d/attachment-0001.html From simonwestrip at btinternet.com Wed Sep 29 09:31:48 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Wed, 29 Sep 2010 08:31:48 +0000 (GMT) Subject: [Cif2-encoding] Addressing Simon's concerns In-Reply-To: References: Message-ID: <87249.7229.qm@web87004.mail.ird.yahoo.com> Dear James I was *not* suggesting that "We would alienate biologists if they are unable to submit manuscripts in their native language". Rather, I would like them to start working with mmCIF and feal happy to do so, whether it be one of the 60000+ mmCIFs in the PDB archives, or one of the future mmCIF2s (if such a thing comes into being). So I concluded that the less difference between CIF1 and CIF2 as perceived by the user, the better. This particular message was an attempt to explain why I had switched to a more flexible approach. To quote myself: "Granted this may not be the most compelling argument in favour of 'any encoding', but recognizing the hurdles that may have to be overcome once we move beyond ASCII whatever the CIF2 specification, I support 'any encoding' as 'a means to an end'." As it stands, the mmCIF user group is very different from the core CIF user group and the adoption of 'specified' encodings by the PDB is likely to have far less impact on users than the adoption of 'specified' encodings by the IUCr. My use of mmCIFs as an example was not intended to sway anyone - it just happened to be the consideration that prompted me to request that I could switch my vote. Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Wednesday, 29 September, 2010 3:16:07 Subject: [Cif2-encoding] Addressing Simon's concerns Some comments on Simon's concerns as raised last week: Simon wrote: I have been developing a new docx template which the IUCr editorial office is shortly to release for use by authors. The template will be packaged with some tools to extract data from CIFs and tabulate them in the Word document, e.g. open an mmCIF, click a button, and standard tables populated with data from the CIF will be included in the document, acting as table templates for the author to edit as appropriate for their manuscript. Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' biologists to start using/accepting mmCIF as a useful medium, rather than as a product of their deposition to the PDB, and to encourage them to become comfortable with passing mmCIFs between applications, and even to edit the things (in the same way as the core-CIF community treats CIFs). For example, our perception is that there is no reason why an author should not feel free to take an mmCIF that has been created by e.g. pdb_extract and populate it using third-party software before uploading to the PDB for deposition. This cause would not be furthered by effectively invalidating an mmCIF if it were not to be encoded in one of the specified encodings. This cause would also not be furthered if the PDB or other colleagues are unable to figure out what the symbols in the text were without hassling the provider of the CIF (think of a phone conversion "OK, try this, now send it again....no, what about trying this format and send it again...that works, except that the Greek characters don't come out....") I think a rough equivalent of what you are saying is "We would alienate biologists if they are unable to submit manuscripts in their native language". However, scientists are used to making some extra effort in order to achieve international communication. Furthermore, are there any macromolecular data exchange formats that allow characters from the Unicode range to be interchanged reliably? Isn't there a carrot as well as a (small) stick here? So although I am uneasy about a specification that propogates uncertainty, I'm also uneasy about alienating users, especially when we are struggling to change their mindset as in the case of the biological community (my perception of the biological community's attitude to mmCIF is based on feedback from authors/coeditors to IUCr journals). Granted this may not be the most compelling argument in favour of 'any encoding', but recognizing the hurdles that may have to be overcome once we move beyond ASCII whatever the CIF2 specification, I support 'any encoding' as 'a means to an end'. Just to make sure that I understand, you are concerned that third party software may take a UTF8 mmCIF template provided by the IUCr and populate it with further information, and at some stage transcode to a different encoding. By mandating UTF8, we are therefore forcing biologists to jump through more hoops than they would otherwise have to. I don't see how that last sentence follows: surely those writing third party software are the ones that will be dealing with encoding issues, and as long as we say right now, at the beginning, that the encoding is UTF8/16, the software will be written as intended and biologists will be oblivous to the encoding. all the best, James. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100929/81324f99/attachment.html From simonwestrip at btinternet.com Wed Sep 29 10:17:42 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Wed, 29 Sep 2010 09:17:42 +0000 (GMT) Subject: [Cif2-encoding] Addressing Simon's concerns In-Reply-To: <87249.7229.qm@web87004.mail.ird.yahoo.com> References: <87249.7229.qm@web87004.mail.ird.yahoo.com> Message-ID: <398095.75403.qm@web87008.mail.ird.yahoo.com> Just to clarify my position because I may appear to be too much of a 'floating voter': I can only support 'specified' encoding (i.e. restricted) if CIF2 is presented as and understood to be something *very* distinct from CIF1. I think the reality is that this distinction would not be apparent or respected, and so a more flexible approach should be taken to allow CIF1 to evolve into CIF2. I cannot accept any specification of CIF that suggests that a CIF could cease to be a CIF just because it is moved to an environment where the local encoding is different from the originating encoding. Though I think I understand the motive for this 'local' description, I think its specification could be open to ridicule. Cheers Simon ________________________________ From: SIMON WESTRIP To: Group for discussing encoding and content validation schemes for CIF2 Sent: Wednesday, 29 September, 2010 9:31:48 Subject: Re: [Cif2-encoding] Addressing Simon's concerns Dear James I was *not* suggesting that "We would alienate biologists if they are unable to submit manuscripts in their native language". Rather, I would like them to start working with mmCIF and feal happy to do so, whether it be one of the 60000+ mmCIFs in the PDB archives, or one of the future mmCIF2s (if such a thing comes into being). So I concluded that the less difference between CIF1 and CIF2 as perceived by the user, the better. This particular message was an attempt to explain why I had switched to a more flexible approach. To quote myself: "Granted this may not be the most compelling argument in favour of 'any encoding', but recognizing the hurdles that may have to be overcome once we move beyond ASCII whatever the CIF2 specification, I support 'any encoding' as 'a means to an end'." As it stands, the mmCIF user group is very different from the core CIF user group and the adoption of 'specified' encodings by the PDB is likely to have far less impact on users than the adoption of 'specified' encodings by the IUCr. My use of mmCIFs as an example was not intended to sway anyone - it just happened to be the consideration that prompted me to request that I could switch my vote. Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Wednesday, 29 September, 2010 3:16:07 Subject: [Cif2-encoding] Addressing Simon's concerns Some comments on Simon's concerns as raised last week: Simon wrote: I have been developing a new docx template which the IUCr editorial office is shortly to release for use by authors. The template will be packaged with some tools to extract data from CIFs and tabulate them in the Word document, e.g. open an mmCIF, click a button, and standard tables populated with data from the CIF will be included in the document, acting as table templates for the author to edit as appropriate for their manuscript. Inclusion of the mmCIF tools is part of an unofficial policy to 'coax' biologists to start using/accepting mmCIF as a useful medium, rather than as a product of their deposition to the PDB, and to encourage them to become comfortable with passing mmCIFs between applications, and even to edit the things (in the same way as the core-CIF community treats CIFs). For example, our perception is that there is no reason why an author should not feel free to take an mmCIF that has been created by e.g. pdb_extract and populate it using third-party software before uploading to the PDB for deposition. This cause would not be furthered by effectively invalidating an mmCIF if it were not to be encoded in one of the specified encodings. This cause would also not be furthered if the PDB or other colleagues are unable to figure out what the symbols in the text were without hassling the provider of the CIF (think of a phone conversion "OK, try this, now send it again....no, what about trying this format and send it again...that works, except that the Greek characters don't come out....") I think a rough equivalent of what you are saying is "We would alienate biologists if they are unable to submit manuscripts in their native language". However, scientists are used to making some extra effort in order to achieve international communication. Furthermore, are there any macromolecular data exchange formats that allow characters from the Unicode range to be interchanged reliably? Isn't there a carrot as well as a (small) stick here? So although I am uneasy about a specification that propogates uncertainty, I'm also uneasy about alienating users, especially when we are struggling to change their mindset as in the case of the biological community (my perception of the biological community's attitude to mmCIF is based on feedback from authors/coeditors to IUCr journals). Granted this may not be the most compelling argument in favour of 'any encoding', but recognizing the hurdles that may have to be overcome once we move beyond ASCII whatever the CIF2 specification, I support 'any encoding' as 'a means to an end'. Just to make sure that I understand, you are concerned that third party software may take a UTF8 mmCIF template provided by the IUCr and populate it with further information, and at some stage transcode to a different encoding. By mandating UTF8, we are therefore forcing biologists to jump through more hoops than they would otherwise have to. I don't see how that last sentence follows: surely those writing third party software are the ones that will be dealing with encoding issues, and as long as we say right now, at the beginning, that the encoding is UTF8/16, the software will be written as intended and biologists will be oblivous to the encoding. all the best, James. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100929/db90a89c/attachment-0001.html From bm at iucr.org Wed Sep 29 11:23:24 2010 From: bm at iucr.org (Brian McMahon) Date: Wed, 29 Sep 2010 11:23:24 +0100 Subject: [Cif2-encoding] Addressing Brian's concerns In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDEA@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDEA@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <20100929102324.GA24670@emerald.iucr.org> Thanks to James for taking the time to describe the text-encoding functionality already available through Qt. It helps me to understand how a CIF-authoring application will work in diverse locales. I'll also test the robustness of juffed if I can locate one of our problematic PDFs on my SMB drive :-) As you say, the discussion doesn't greatly help to tip the scales, though what does help is the comment about file import being more reliable if a CIF has a known unique encoding (or small set of automatically distinguishable encodings). Thanks also to John for his reply. I want to respond briefly to just one point here. >> I put option 5 at the bottom because of the non-portability of >> a "local" encoding. > > This is the part I understand least. "Text" is at least roughly > equivalent to "local", and entirely as non-portable. Merely tagging > CIFs with encoding information doesn't fix that very well, as we > covered in the course of our discussion, particularly when doing > so is optional. You're right - in my mind "text" and "local" are essentially the same thing, but I have understood "local" to imply that no information about the actual encoding algorithms are available if such a file is transported elsewhere. Maybe that's not so. I do see some value in embedding an encoding declaration as a hint, albeit the hint cannot be completely relied upon. By making it optional, I hope that an *application* that writes the hint has a reasonable possibility of getting it right (whereas mandating it might encourage a user to invent a random encoding declaration if one is absent). Best wishes Brian On Tue, Sep 28, 2010 at 05:25:51PM -0500, Bollinger, John C wrote: > > On Sunday, September 26, 2010 6:47 PM, James Hester wrote: > > >I am however > >unhappy that both Brian and Simon introduced new concerns and nobody > >has had a chance to comment on how the various proposals under > >consideration might affect those concerns. I would therefore like to > >suggest that the voting period continues until the end of this week, > >and that we all endeavour to express any concerns or comments that we > >need to make in a timely fashion. > > I have responded to Simon's new concerns, I think, but not to Brian's. Supplemental to James's well-reasoned comments, then: > > On Friday, September 24, 2010 4:24 AM, Brian McMahon wrote: > > >I still feel this argument is at heart a "binary/text" > >dichotomy, where "binary" implies that one can prescribe specific byte-level representations of every distinct character; "text" > >implies that you're at the mercy of external libraries and mappings between encoding conventions - and those >mappings are not always explicit or easy to identify. > > That characterization of "text" sounds suspiciously similar to the "local" part of option 5 -- as it should, because the two attempt to describe the same (I think) concept. I am open to alternative definitions, but I do not comprehend the apparent aversion to defining these terms. If they are so obvious as to not require definition, then providing definitions anyway will be simple and harmless. If not, then how else do we expect consumers of the spec to come to the same conclusion about what it means? > > >I sympathise greatly with James's desire for a prescriptive, "binary" > >approach, but its corollary is that a CIF application must take full responsibility for expressing any supported extended character set (I mean accented Latin letters, Greek characters, Cyrillic or Chinese alphabets). > > I do not follow this logic, inasmuch as it seems to be about the CIF2 character repertoire, rather than about the encodings with which characters from that repertoire may be encoded. The character repertoire is not the subject of this debate. > > Relying on "text" to define allowed characters would mean that some reasonable content expressed in conformant CIF form on one system cannot be expressed in any conformant CIF form on another. For example, a CIF-format, Chinese-language journal article encoded in EUC-CN might be perfectly valid CIF in the journal office, but there would be no CIF-conformant way to represent it at all on a system whose definition of "text" does not accommodate Chinese characters. > > [...] > > >I put option 5 at the bottom because of the non-portability of a "local" encoding. > > This is the part I understand least. "Text" is at least roughly equivalent to "local", and entirely as non-portable. Merely tagging CIFs with encoding information doesn't fix that very well, as we covered in the course of our discussion, particularly when doing so is optional. Moreover, even optional tagging is a feature only of choice 2, not choice 1. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding From bm at iucr.org Wed Sep 29 11:25:36 2010 From: bm at iucr.org (Brian McMahon) Date: Wed, 29 Sep 2010 11:25:36 +0100 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <20100929102536.GB24670@emerald.iucr.org> I think the crux of issue is as follows: [But part of our difficulty is that we are all having separate epiphanies, and focusing on five different "cruxes". Clarifying the real divergence between our views would be a genuine benefit of a Skype conference, to which I have no personal objection.] In the real world, a need may arise to exchange CIFs constructed in non-canonical encodings. ("Canonical" probably means UTF-8 and/or UTF-16). Such a need would involve some transcoding strategy. What is the actual likelihood of that need arising? I would characterise James's position as "not very, and even less if the software written to generate CIFs is constrained to use canonical encodings within the standard". I would characterise the position of the rest of us as "reasonable to high, so that we wish to formulate the standard in a way that recognises non-canonical encodings and helps to establish or at least inform appropriate transcoding strategies". There appear to be strong disagreements among us, but in fact there's a lot of common ground, and a drafting exercise would probably move us towards a consensus. Do you agree that that is a fair assessment? If so, we can analyse further: what are the implications of mandating a canonical encoding or not if judgement (a) is wrong and if judgement (b) is wrong? My feeling is that the world will not end - or even change very much - in any case; but it could determine whether we need to formulate an optimal transcoding strategy now, or can defer it to a later date. However, if anyone thinks this is just another diversion, I'll drop this line of approach so as not to slow things down even more. Regards Brian On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. Bernstein wrote: > John, > > Now I am totally confused about what you are proposing and agree with Simon > that what is needed for you to state your proposal as the precise wording > that you propose to insert and/or change in the current CIF2 change document > "5 July 2010: draft of changes to the existing CIF 1.1 specification > for public discussion" > > If I understand your proposal correctly, the _only_ thing you are proposing > that differs in any way from my proposed motion is a mandate that a > CIF2 conformant reader must be able to read a UTF8 CIF2 file, but > that _no_ CIF application would actually be required to provide such > code, provided there was some mechanism available to transcode from > UTF8 to the local encoding, > which does not seem to be a mandate on the conformant CIF2 reader at > all, but a requirement for the provision of a portable utility to > do that external transcoding. > > If that is the case, wouldn't it make more sense to just provide that > utility that to argue about whether my motion requires somebody to write > their own? Having the utility in hand would avoid having multiple, > conflicting interpretations of this input transcoding requirement. > > If I have read your message correctly, please just write the utility you > are proposing. If I have read your message incorrectly, please > write the specification changes you propose for the draft changes > in place of the changes in my motion. > > _This_ is why it was, is, and will remain a good idea to simply have > a meeting and talk these things out. > > > > At 5:21 PM -0500 9/28/10, Bollinger, John C wrote: > >Dear Herb, > > > >On Tuesday, September 28, 2010 2:41 PM, Herbert J. Bernstein > > > >> The norm in standards work is to deprecate features for a while > >>(at least months and preferably years) before you remove them. > > > >I acknowledge that principle, and I see no incompatibility between > >it and option 5. More below. Do not overlook my final comments. > > > >>> Recommending UTF-8 and / or UTF-16 without mandating support for one or > >>> both does not get us where I insist we need to be. > >> > >>The problem is coming to agreement on "support" and that pesky word > >>"mandating". > > > >By "mandating support" I mean that a file containing a sequence of > >characters conforming to the CIF syntax and encoded via UTF-8 is > >defined to be a conformant CIF everywhere. By itself, that would > >not obligate anyone to encode their CIFs in UTF-8. It would, > >however, mean that fully-conformant CIF2 readers must be prepared to > >accept CIFs encoded in that manner. Even that is no barrier to > >adoption, though, for CIF users must be prepared to deal with the > >encoding question under any alternative on the table, and if they > >can read only their local encoding then they would need to be able > >to transcode in any event. > > > >> Up until now in order for a CIF application developer or > >>user to produce compliant CIFS, all they had to do was to produce a text > >>file in whatever encoding was provided on their system. Now you wish to > >>mandate that they be able to produce UTF8 or UTF16, even if they are > >>running on some code-page based system. > > > >Not at all. It is the single objective of the "+ local" provision > >of my preferred alternative to enable application developers, > >authors, and anyone else to continue to do exactly what you describe > >them already doing. > > > >[...] > > > >>We have already made that mistake with other CIF2 features, e.g. the > >>drastic change in string quoting. > > > >I agree with you that such changes are mistaken. That was my > >motivation in questioning UTF-8 only to begin with. > > > >[...] > > > >>The motion I have proposed does not make anything worse for anybody > >>currently using CIF and allows them to start moving into CIF2 now. > > > >Neither your motion nor my preferred one make anything worse for > >anybody using CIF1, and both allow them to start moving into CIF2 > >now. > > > >> Your > >>approach imposes conditions it will take months or years to meet with no > >>prospect that satisfying your demands will solve any problem for anybody. > > > >My approach imposes no special conditions, but it offers the > >advantages of UTF-8 as an available standard feature. As long as we > >are relying on "The norm in standards work," wouldn't you agree that > >it is normal to introduce new features to a standard no later than > >the time the features they supersede are deprecated? > > > >>Please rethink your position. > > > >I have considered my position carefully, and rethought it several > >times over the course of our discussion. I firmly believe that I am > >advocating a solid and eminently workable compromise between support > >of the existing CIF1 base and the future needs of CIF2 users. > > > >>If we recommend UTF8/UTF16 support we have a decent chance that somebody > >>will simply provide it. If we mandate UTF8/UTF16 support we force > >>pointless delays in the adoption of the rest of CIF2 and gain what in > >>exchange? > > > >Even if CIF2 ended up UTF-8 only, people could write software > >exactly as they would do under your proposal, then wrap it in a > >transcoder. Or in that event I think it likely that some would > >implement my preferred alternative (5) as an extension. Perhaps you > >would agree, as that's the same end result that you think likely > >somebody will simply provide, coming from the opposite direction. > > > >I see no reason to fear any significant delays in CIF2 adoption > >arising from any particular result this discussion may ultimately > >reach. > > > >[...] > > > >>However, the real answer (not a joke) is that a text encoding is whatever > >>the formatted I/O system in a fortran compiler on the system under > >>discussion reads and writes or the format of a COBOL EBCDIC-sequential > >>file or a COBOL ASCII line-sequential file, or what a text editor on the > >>system handles. That is the point -- text is something very, very system > >>and language dependent. The strange thing is that text files have a much > >>longer practical survival time than binary files, as backwards as that may > >>seem, because there is a much larger investment in ensuring the continued > >>readbility of text files than of binary files. > > > >I am laughing, but not because I think you're joking. As far as I > >can tell, that answer is functionally identical to what I have been > >advocating as "local". It's even worded similarly. My desire to > >include it (but not to be limited to it) is the primary difference > >between James's most preferred position and mine. > > > > > >Best Regards, > > > >John > >-- > >John C. Bollinger, Ph.D. > >Department of Structural Biology > >St. Jude Children's Research Hospital > > > > > >Email Disclaimer: www.stjude.org/emaildisclaimer > > > >_______________________________________________ > >cif2-encoding mailing list > >cif2-encoding at iucr.org > >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > -- > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding From yaya at bernstein-plus-sons.com Wed Sep 29 15:16:24 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Wed, 29 Sep 2010 10:16:24 -0400 (EDT) Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 In-Reply-To: <20100929102536.GB24670@emerald.iucr.org> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <20100929102536.GB24670@emerald.iucr.org> Message-ID: Dear Colleagues, James and I are going to try a Skype conference call at 10:45 pm (22:45) AEST, 8:45 am EDT on Thursday, 30 September 2010. All are welcome to join in. If you wish to join, please email me your skype ID today. I will originate the call, which will be voice only (a skype constraint on conference calls). My skype id is yayahjb. I'll keep my email open during the call, so if you just manage to get into Skype last minute either send me a Skype chat and/or an email with your skype id and I'll try to add you to the conference call. If everything else fails, the land line I will be at is 1-631-286-1339, but that is just to try to coordinate things. I cannot cross-connect the landline to the skype conference. The call will have to end by 9:45 am EDT so I can go teach a class. Regards, Herbert For those in England, 8:45 am EDT is 13:35 BST (12:45 GMT) ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From John.Bollinger at STJUDE.ORG Wed Sep 29 15:24:43 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Wed, 29 Sep 2010 09:24:43 -0500 Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 Septe mber 2010. . In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <20100929102536.GB24670@emerald.iucr.org> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDEB@SJMEMXMBS11.stjude.sjcrh.local> I will be unable to attend, unless there is a way to link in a conventional telephone connection (cell or land line). It sounds like that is not possible. I must be in my office at the time of the call, where all access to Skype is blocked by our institutional firewall. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital -----Original Message----- From: cif2-encoding-bounces at iucr.org [mailto:cif2-encoding-bounces at iucr.org] On Behalf Of Herbert J. Bernstein Sent: Wednesday, September 29, 2010 9:16 AM To: Group for discussing encoding and content validation schemes for CIF2 Subject: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 Septe mber 2010. . Dear Colleagues, James and I are going to try a Skype conference call at 10:45 pm (22:45) AEST, 8:45 am EDT on Thursday, 30 September 2010. All are welcome to join in. If you wish to join, please email me your skype ID today. I will originate the call, which will be voice only (a skype constraint on conference calls). My skype id is yayahjb. I'll keep my email open during the call, so if you just manage to get into Skype last minute either send me a Skype chat and/or an email with your skype id and I'll try to add you to the conference call. If everything else fails, the land line I will be at is 1-631-286-1339, but that is just to try to coordinate things. I cannot cross-connect the landline to the skype conference. The call will have to end by 9:45 am EDT so I can go teach a class. Regards, Herbert For those in England, 8:45 am EDT is 13:35 BST (12:45 GMT) ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding Email Disclaimer: www.stjude.org/emaildisclaimer From jamesrhester at gmail.com Wed Sep 29 15:24:45 2010 From: jamesrhester at gmail.com (James Hester) Date: Thu, 30 Sep 2010 00:24:45 +1000 Subject: [Cif2-encoding] A new(?) compromise position Message-ID: Here is a newish compromise: Encoding: The encoding of CIF2 text streams containing only code points in the ASCII range is not specified. CIF2 text streams containing any code points outside the ASCII range must be encoded such that the encoding can be reliably identified from the file contents. At present only UTF8 and UTF16 are considered to satisfy this constraint. Commentary: this is intended to mean that encoding works 'as for CIF1' (Proposals 1,2) for files containing only ASCII text, and works as for Proposal 4 for any other files. I believe that this allows legacy workflows to operate smoothly on CIF2 files (legacy workflows do not process non ASCII text) but also avoids the tower of Babel effect that will ensue if non-ASCII codepoints are encoded using local conventions. To explain the thinking further, perhaps I could take another stab at Herbert's point of view in my own words. Herbert (I think correctly) surmises that all currently used CIF applications do not explicitly specify the encoding of their input and output files, and so therefore are conceptually working with CIFs in a variety of local encodings. Mandating any encoding for CIF2 would therefore force at least some and perhaps most of these applications to change the way they read and write text, which is disruptive and obtuse when the system works fine as it is. Proposals 1 and 2 are aimed at avoiding this disruption. On the other hand, I look at the same situation and see that all this software is in fact reading and writing ASCII, because all of these local encodings are actually equivalent to ASCII for characters used in CIFs, and I further assert that this happy coincidence between encodings is the single reason CIF files are easily transferable between different systems. These two points of view create two different results if the CIF character repertoire is extended beyond the ASCII range. If we allow the current approach to encoding to continue, the happy coincidence of encodings ceases to operate outside the ASCII range and CIF files are no longer easily interchangeable. If we make explicit the commonality of CIF1 encodings by mandating a common set of identifiable encodings, the use of default encodings has to be abandoned with accompanying effort from programmers. I believe that this latest proposal respects Herbert's concerns as well as mine, and is eminently workable as a starting point for going forward. I'm now off to do a sample change and expect unanimous support from all parties when I return in an hour's time :) On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon wrote: > I think the crux of issue is as follows: > > [But part of our difficulty is that we are all having separate > epiphanies, and focusing on five different "cruxes". Clarifying > the real divergence between our views would be a genuine benefit of > a Skype conference, to which I have no personal objection.] > > In the real world, a need may arise to exchange CIFs constructed in > non-canonical encodings. ("Canonical" probably means UTF-8 and/or > UTF-16). Such a need would involve some transcoding strategy. > > What is the actual likelihood of that need arising? > > I would characterise James's position as "not very, and even less > if the software written to generate CIFs is constrained to use > canonical encodings within the standard". > > I would characterise the position of the rest of us as "reasonable to > high, so that we wish to formulate the standard in a way that > recognises non-canonical encodings and helps to establish or at > least inform appropriate transcoding strategies". There appear to be > strong disagreements among us, but in fact there's a lot of common > ground, and a drafting exercise would probably move us towards a > consensus. > > Do you agree that that is a fair assessment? > > If so, we can analyse further: what are the implications of mandating > a canonical encoding or not if judgement (a) is wrong and if judgement > (b) is wrong? My feeling is that the world will not end - or even > change very much - in any case; but it could determine whether we > need to formulate an optimal transcoding strategy now, or can defer > it to a later date. > > However, if anyone thinks this is just another diversion, I'll drop > this line of approach so as not to slow things down even more. > > Regards > Brian > > On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. Bernstein wrote: > > John, > > > > Now I am totally confused about what you are proposing and agree with > Simon > > that what is needed for you to state your proposal as the precise wording > > that you propose to insert and/or change in the current CIF2 change > document > > "5 July 2010: draft of changes to the existing CIF 1.1 specification > > for public discussion" > > > > If I understand your proposal correctly, the _only_ thing you are > proposing > > that differs in any way from my proposed motion is a mandate that a > > CIF2 conformant reader must be able to read a UTF8 CIF2 file, but > > that _no_ CIF application would actually be required to provide such > > code, provided there was some mechanism available to transcode from > > UTF8 to the local encoding, > > which does not seem to be a mandate on the conformant CIF2 reader at > > all, but a requirement for the provision of a portable utility to > > do that external transcoding. > > > > If that is the case, wouldn't it make more sense to just provide that > > utility that to argue about whether my motion requires somebody to write > > their own? Having the utility in hand would avoid having multiple, > > conflicting interpretations of this input transcoding requirement. > > > > If I have read your message correctly, please just write the utility you > > are proposing. If I have read your message incorrectly, please > > write the specification changes you propose for the draft changes > > in place of the changes in my motion. > > > > _This_ is why it was, is, and will remain a good idea to simply have > > a meeting and talk these things out. > > > > > > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100930/8ba3b194/attachment.html From John.Bollinger at STJUDE.ORG Wed Sep 29 15:26:10 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Wed, 29 Sep 2010 09:26:10 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <967916.56129.qm@web87002.mail.ird.yahoo.com> References: <526633.3484.qm@web87004.mail.ird.yahoo.com> <613218.81205.qm@web87011.mail.ird.yahoo.com> <281388.90819.qm@web87012.mail.ird.yahoo.com> <463665.7127.qm@web87004.mail.ird.yahoo.com> <262880.46378.qm@web87002.mail.ird.yahoo.com> <476110.27334.qm@web87005.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh. local> <310698.3619.qm@web87010.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE8@SJMEMXMBS11.stjude.sjcrh.local> <967916.56129.qm@web87002.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDEC@SJMEMXMBS11.stjude.sjcrh.local> Simon, On Wednesday, September 29, 2010 2:17 AM, SIMON WESTRIP >John, I do not think a specification that suggests that a CIF can be invalidated simply by being moved to >another environment is helpful to anyone. In that case, you must be operating under a different definition of "text" than Herb provided yesterday: On Tuesday, September 28, 2010 2:41 PM, Herbert J. Bernstein wrote: >However, the real answer (not a joke) is that a text encoding is whatever >the formatted I/O system in a fortran compiler on the system under >discussion reads and writes or the format of a COBOL EBCDIC-sequential >file or a COBOL ASCII line-sequential file, or what a text editor on the >system handles. That is the point -- text is something very, very system >and language dependent. [...] The potential for confusion over the meaning of "text" was by far my greatest cause for concern about the "As for CIF1..." alternatives, so I am very grateful to Herb for providing a definition. I am furthermore very pleased that his definition matches so well the one that I have advanced under the label "local", which I think is also the best interpretation of the requirements of CIF1. Even disregarding the definition of "text", however, CIF1 clearly holds that a CIF can indeed be invalidated simply by being moved to another environment. In particular, CIF1 expressly specifies that CIF processors are not required to understand non-native line termination sequences. I have used CIF1 processors on several platforms that do not do so. As has been observed several times, CIF1 has nevertheless served well for years. We would not be having this discussion now if it were not helpful to many people. I submit that among the options on the table, only (3) and (4) do not leave CIF2 CIFs susceptible to invalidation upon being moved to a different environment. These are not my overall preference, but I favor them over "text"-only because they permit use of UTF-8. Under the above definition of "text" and the "As for CIF1..." proposals, any recommendation that the spec might make to use UTF-8 and / or UTF-16 would be futile. Depending on the environment, either UTF-8(-16) would be required for conformance with the local definition of "text", or it would be forbidden as non-conforming (I disregard the case of ASCII-only CIFs for which the encoding could be construed as any ASCII-compatible encoding, including UTF-8). In most current environments, UTF-8 would be forbidden. As much as I join Herb in favoring support for "text" CIFs as he defines them, I remain convinced that UTF-8 must be a conformant option for CIF2 to move ahead. I think UTF-8 would be sufficient to cover most (but not all) of the cases for which "text" ensures support, thus my preference for options (3) and (4) over options (1) and (2). This is, again, the genesis of option (5), which I think now could be relabeled "text + UTF-8 (+- UTF-16)". Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From simonwestrip at btinternet.com Wed Sep 29 15:27:14 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Wed, 29 Sep 2010 14:27:14 +0000 (GMT) Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 In-Reply-To: References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <20100929102536.GB24670@emerald.iucr.org> Message-ID: <263248.7794.qm@web87008.mail.ird.yahoo.com> Dear Herbert I now have a Skype ID, but dont have the hardware yet (webcam... nor microphone/speakers come to think of it!) I'll see if I can get setup in time, but if not I hope you can make some progress... Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Wednesday, 29 September, 2010 15:16:24 Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 Dear Colleagues, James and I are going to try a Skype conference call at 10:45 pm (22:45) AEST, 8:45 am EDT on Thursday, 30 September 2010. All are welcome to join in. If you wish to join, please email me your skype ID today. I will originate the call, which will be voice only (a skype constraint on conference calls). My skype id is yayahjb. I'll keep my email open during the call, so if you just manage to get into Skype last minute either send me a Skype chat and/or an email with your skype id and I'll try to add you to the conference call. If everything else fails, the land line I will be at is 1-631-286-1339, but that is just to try to coordinate things. I cannot cross-connect the landline to the skype conference. The call will have to end by 9:45 am EDT so I can go teach a class. Regards, Herbert For those in England, 8:45 am EDT is 13:35 BST (12:45 GMT) ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100929/7266dc38/attachment-0001.html From John.Bollinger at STJUDE.ORG Wed Sep 29 15:38:50 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Wed, 29 Sep 2010 09:38:50 -0500 Subject: [Cif2-encoding] A new(?) compromise position. . In-Reply-To: References: Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDED@SJMEMXMBS11.stjude.sjcrh.local> On Wednesday, September 29, 2010 9:25 AM, James Hester wrote: >Here is a newish compromise: >Encoding: The encoding of CIF2 text streams containing only code points in the ASCII range is not specified. >CIF2 text streams containing any code points outside the ASCII range must be encoded such that the encoding can >be reliably identified from the file contents. At present only UTF8 and UTF16 are considered to satisfy this >constraint. I would accept this compromise, or, preferably, a similar one that substitutes Herb's definition of "text" for "not specified". Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From simonwestrip at btinternet.com Wed Sep 29 15:42:20 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Wed, 29 Sep 2010 14:42:20 +0000 (GMT) Subject: [Cif2-encoding] A new(?) compromise position In-Reply-To: References: Message-ID: <749155.47691.qm@web87014.mail.ird.yahoo.com> You wont be surprised to hear my support for this - especially if you've read recent exchanges between Herbert and I regarding compromise. Go for it :-) ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Wednesday, 29 September, 2010 15:24:45 Subject: [Cif2-encoding] A new(?) compromise position Here is a newish compromise: Encoding: The encoding of CIF2 text streams containing only code points in the ASCII range is not specified. CIF2 text streams containing any code points outside the ASCII range must be encoded such that the encoding can be reliably identified from the file contents. At present only UTF8 and UTF16 are considered to satisfy this constraint. Commentary: this is intended to mean that encoding works 'as for CIF1' (Proposals 1,2) for files containing only ASCII text, and works as for Proposal 4 for any other files. I believe that this allows legacy workflows to operate smoothly on CIF2 files (legacy workflows do not process non ASCII text) but also avoids the tower of Babel effect that will ensue if non-ASCII codepoints are encoded using local conventions. To explain the thinking further, perhaps I could take another stab at Herbert's point of view in my own words. Herbert (I think correctly) surmises that all currently used CIF applications do not explicitly specify the encoding of their input and output files, and so therefore are conceptually working with CIFs in a variety of local encodings. Mandating any encoding for CIF2 would therefore force at least some and perhaps most of these applications to change the way they read and write text, which is disruptive and obtuse when the system works fine as it is. Proposals 1 and 2 are aimed at avoiding this disruption. On the other hand, I look at the same situation and see that all this software is in fact reading and writing ASCII, because all of these local encodings are actually equivalent to ASCII for characters used in CIFs, and I further assert that this happy coincidence between encodings is the single reason CIF files are easily transferable between different systems. These two points of view create two different results if the CIF character repertoire is extended beyond the ASCII range. If we allow the current approach to encoding to continue, the happy coincidence of encodings ceases to operate outside the ASCII range and CIF files are no longer easily interchangeable. If we make explicit the commonality of CIF1 encodings by mandating a common set of identifiable encodings, the use of default encodings has to be abandoned with accompanying effort from programmers. I believe that this latest proposal respects Herbert's concerns as well as mine, and is eminently workable as a starting point for going forward. I'm now off to do a sample change and expect unanimous support from all parties when I return in an hour's time :) On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon wrote: I think the crux of issue is as follows: > >[But part of our difficulty is that we are all having separate >epiphanies, and focusing on five different "cruxes". Clarifying >the real divergence between our views would be a genuine benefit of >a Skype conference, to which I have no personal objection.] > >In the real world, a need may arise to exchange CIFs constructed in >non-canonical encodings. ("Canonical" probably means UTF-8 and/or >UTF-16). Such a need would involve some transcoding strategy. > >What is the actual likelihood of that need arising? > >I would characterise James's position as "not very, and even less >if the software written to generate CIFs is constrained to use >canonical encodings within the standard". > >I would characterise the position of the rest of us as "reasonable to >high, so that we wish to formulate the standard in a way that >recognises non-canonical encodings and helps to establish or at >least inform appropriate transcoding strategies". There appear to be >strong disagreements among us, but in fact there's a lot of common >ground, and a drafting exercise would probably move us towards a >consensus. > >Do you agree that that is a fair assessment? > >If so, we can analyse further: what are the implications of mandating >a canonical encoding or not if judgement (a) is wrong and if judgement >(b) is wrong? My feeling is that the world will not end - or even >change very much - in any case; but it could determine whether we >need to formulate an optimal transcoding strategy now, or can defer >it to a later date. > >However, if anyone thinks this is just another diversion, I'll drop >this line of approach so as not to slow things down even more. > >Regards >Brian > > >On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. Bernstein wrote: >> John, >> >> Now I am totally confused about what you are proposing and agree with Simon >> that what is needed for you to state your proposal as the precise wording >> that you propose to insert and/or change in the current CIF2 change document >> "5 July 2010: draft of changes to the existing CIF 1.1 specification >> for public discussion" >> >> If I understand your proposal correctly, the _only_ thing you are proposing >> that differs in any way from my proposed motion is a mandate that a >> CIF2 conformant reader must be able to read a UTF8 CIF2 file, but >> that _no_ CIF application would actually be required to provide such >> code, provided there was some mechanism available to transcode from >> UTF8 to the local encoding, >> which does not seem to be a mandate on the conformant CIF2 reader at >> all, but a requirement for the provision of a portable utility to >> do that external transcoding. >> >> If that is the case, wouldn't it make more sense to just provide that >> utility that to argue about whether my motion requires somebody to write >> their own? Having the utility in hand would avoid having multiple, >> conflicting interpretations of this input transcoding requirement. >> >> If I have read your message correctly, please just write the utility you >> are proposing. If I have read your message incorrectly, please >> write the specification changes you propose for the draft changes >> in place of the changes in my motion. >> >> _This_ is why it was, is, and will remain a good idea to simply have >> a meeting and talk these things out. >> >> >> >-- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100929/6150de26/attachment.html From yaya at bernstein-plus-sons.com Wed Sep 29 15:45:47 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Wed, 29 Sep 2010 10:45:47 -0400 (EDT) Subject: [Cif2-encoding] A new(?) compromise position In-Reply-To: References: Message-ID: Dear James, I respect the attempt to compromise, but the sentence "At present only UTF8 and UTF16 are considered to satisfy this constraint" is not quite right without some additional work on the spec. UTF16 with a BOM is self-identifying. UTF8 with a BOM is also self-identifying. However, UTF8 without a BOM and without some other disambiguator (e.g. the accented o's), is _not_ self identifying. I know, because my students and I hit this problem all the time in working with multi-linguage, multi-code-page message catalogs for RasMol. Sometimes the only way we can figure out whether a UTF8 file is really a UTF8 file is to start translating the actual strings and see if they make sense. Another problem is what the "ASCII range" means to various people. I suggest being much more restrictive and saying "the printable ASCII characters, code points 32-126 plus CR, LF and HT" Combined the statment I would suggest If a CIF2 text stream contains only characters equivalent to the printable ASCII characters plus HT, LF and CR, i.e. decimal code points 32-126, 9, 10 and 13, then to ensure compatibility with CIF1, the CIF2 specification does not require any explicit specification of the particular encoding used, but recommends the use of UTF8. If a CIF2 text stream contains any characters equivalent to Unicode code points not in that range, then for any encoding other then UTF8 it is the responsibility of any application writing such a CIF to unambigously specify the particular encoding used, preferably within the file itself. UTF16 with a BOM conforms to this requirement. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 30 Sep 2010, James Hester wrote: > Here is a newish compromise: > > Encoding: The encoding of CIF2 text streams containing only code points in the ASCII > range is not specified. CIF2 text streams containing any code points outside the ASCII > range must be encoded such that the encoding can be reliably identified from the file > contents.? At present only UTF8 and UTF16 are considered to satisfy this constraint. > > Commentary: this is intended to mean that encoding works 'as for CIF1' (Proposals 1,2) > for files containing only ASCII text, and works as for Proposal 4 for any other files.? > I believe that this allows legacy workflows to operate smoothly on CIF2 files (legacy > workflows do not process non ASCII text) but also avoids the tower of Babel effect that > will ensue if non-ASCII codepoints are encoded using local conventions.? > > To explain the thinking further, perhaps I could take another stab at Herbert's point of > view in my own words.? Herbert (I think correctly) surmises that all currently used CIF > applications do not explicitly specify the encoding of their input and output files, and > so therefore are conceptually working with CIFs in a variety of local encodings.? > Mandating any encoding for CIF2 would therefore force at least some and perhaps most of > these applications to change the way they read and write text, which is disruptive and > obtuse when the system works fine as it is.? Proposals 1 and 2 are aimed at avoiding > this disruption. > > On the other hand, I look at the same situation and see that all this software is in > fact reading and writing ASCII, because all of these local encodings are actually > equivalent to ASCII for characters used in CIFs, and I further assert that this happy > coincidence between encodings is the single reason CIF files are easily transferable > between different systems. > > These two points of view create two different results if the CIF character repertoire is > extended beyond the ASCII range.? If we allow the current approach to encoding to > continue, the happy coincidence of encodings ceases to operate outside the ASCII range > and CIF files are no longer easily interchangeable.? If we make explicit the commonality > of CIF1 encodings by mandating a common set of identifiable encodings, the use of > default encodings has to be abandoned with accompanying effort from programmers. > > I believe that this latest proposal respects Herbert's concerns as well as mine, and is > eminently workable as a starting point for going forward.? I'm now off to do a sample > change and expect unanimous support from all parties when I return in an hour's time :) > > On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon wrote: > I think the crux of issue is as follows: > > [But part of our difficulty is that we are all having separate > epiphanies, and focusing on five different "cruxes". Clarifying > the real divergence between our views would be a genuine benefit of > a Skype conference, to which I have no personal objection.] > > In the real world, a need may arise to exchange CIFs constructed in > non-canonical encodings. ("Canonical" probably means UTF-8 and/or > UTF-16). Such a need would involve some transcoding strategy. > > What is the actual likelihood of that need arising? > > I would characterise James's position as "not very, and even less > if the software written to generate CIFs is constrained to use > canonical encodings within the standard". > > I would characterise the position of the rest of us as "reasonable to > high, so that we wish to formulate the standard in a way that > recognises non-canonical encodings and helps to establish or at > least inform appropriate transcoding strategies". There appear to be > strong disagreements among us, but in fact there's a lot of common > ground, and a drafting exercise would probably move us towards a > consensus. > > Do you agree that that is a fair assessment? > > If so, we can analyse further: what are the implications of mandating > a canonical encoding or not if judgement (a) is wrong and if judgement > (b) is wrong? My feeling is that the world will not end - or even > change very much - in any case; but it could determine whether we > need to formulate an optimal transcoding strategy now, or can defer > it to a later date. > > However, if anyone thinks this is just another diversion, I'll drop > this line of approach so as not to slow things down even more. > > Regards > Brian > > On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. Bernstein wrote: > > John, > > > > Now I am totally confused about what you are proposing and agree with Simon > > that what is needed for you to state your proposal as the precise wording > > that you propose to insert and/or change in the current CIF2 change document > > "5 July 2010: draft of changes to the existing CIF 1.1 specification > > for public discussion" > > > > If I understand your proposal correctly, the _only_ thing you are proposing > > that differs in any way from my proposed motion is a mandate that a > > CIF2 conformant reader must be able to read a UTF8 CIF2 file, but > > that _no_ CIF application would actually be required to provide such > > code, provided there was some mechanism available to transcode from > > UTF8 to the local encoding, > > which does not seem to be a mandate on the conformant CIF2 reader at > > all, but a requirement for the provision of a portable utility to > > do that external transcoding. > > > > If that is the case, wouldn't it make more sense to just provide that > > utility that to argue about whether my motion requires somebody to write > > their own? ?Having the utility in hand would avoid having multiple, > > conflicting interpretations of this input transcoding requirement. > > > > If I have read your message correctly, please just write the utility you > > are proposing. ?If I have read your message incorrectly, please > > write the specification changes you propose for the draft changes > > in place of the changes in my motion. > > > > _This_ is why it was, is, and will remain a good idea to simply have > > a meeting and talk these things out. > > > > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > From John.Bollinger at STJUDE.ORG Wed Sep 29 16:28:04 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Wed, 29 Sep 2010 10:28:04 -0500 Subject: [Cif2-encoding] How we wrap this up In-Reply-To: <20100929102536.GB24670@emerald.iucr.org> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDE3@SJMEMXMBS11.stjude.sjcrh.local> <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <20100929102536.GB24670@emerald.iucr.org> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDEE@SJMEMXMBS11.stjude.sjcrh.local> This is perhaps irrelevant if James's compromise gains as much traction as it seems poised to do, but On Wednesday, September 29, 2010 5:26 AM, Brian McMahon wrote: [...] >In the real world, a need may arise to exchange CIFs constructed in >non-canonical encodings. ("Canonical" probably means UTF-8 and/or >UTF-16). Such a need would involve some transcoding strategy. > >What is the actual likelihood of that need arising? > >I would characterise James's position as "not very, and even less >if the software written to generate CIFs is constrained to use >canonical encodings within the standard". > >I would characterise the position of the rest of us as "reasonable to >high, so that we wish to formulate the standard in a way that >recognises non-canonical encodings and helps to establish or at >least inform appropriate transcoding strategies". I was about to deny that as a valid characterization of my position, but after some consideration I realized that it does cover me. Good wordsmithing. My divergence from the pack is probably over the mechanism by which I suppose CIFs must be exchanged. I view CIFs constructed via most encodings as inherently unsuitable for exchange, at least if they contain non-ASCII characters. Hence, the needed transcoding strategy (absent some established agreement otherwise) must be for the originator of the exchange to first transcode into a canonical encoding. From there springs my continued advocacy for options that in fact provide a canonical encoding. >There appear to be >strong disagreements among us, but in fact there's a lot of common >ground, and a drafting exercise would probably move us towards a >consensus. > >Do you agree that that is a fair assessment? Yes. >If so, we can analyse further: what are the implications of mandating >a canonical encoding or not if judgement (a) is wrong and if judgement >(b) is wrong? My feeling is that the world will not end - or even >change very much - in any case; but it could determine whether we >need to formulate an optimal transcoding strategy now, or can defer >it to a later date. > >However, if anyone thinks this is just another diversion, I'll drop >this line of approach so as not to slow things down even more. This may be a useful avenue to pursue, but I suggest we table it pending the response to James's compromise proposal. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From jamesrhester at gmail.com Wed Sep 29 16:31:14 2010 From: jamesrhester at gmail.com (James Hester) Date: Thu, 30 Sep 2010 01:31:14 +1000 Subject: [Cif2-encoding] A new(?) compromise position In-Reply-To: References: Message-ID: Hi Herbert (I should be in bed, but whatever): I do not think it is appropriate to require the *application* to unambiguously identify the encoding, as no widely-recognised standard procedure exists to do this. The means of identification should rather be based on the international standard describing the encoding. Only UTF16 and UTF8 currently meet this requirement, I believe. I will try to express this better after a sleep... Regarding UTF8: I'm glad to see such vigilance in the cause of correctly identifying file encoding. A UTF8 file, naturally, can also look like a file in a variety of single-byte encodings regardless of a BOM at the front. However, a file in a non-UTF8 encoding is highly unlikely to be mistaken for a UTF8 file. Therefore, providing an input file is first checked for UTF8 encoding, I do not see any significant danger of a mistaken encoding. I'd be happy to include recommendations to use a UTF8 BOM and to check for UTF8 encoding before any others that we may eventually add to the list. I'm curious to see what these files are that you have trouble identifying as UTF8, as they may represent obscure corner cases. Any chance you could dig one or two up? James. On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein < yaya at bernstein-plus-sons.com> wrote: > Dear James, > > I respect the attempt to compromise, but the sentence "At present only > UTF8 and UTF16 are considered to satisfy this constraint" is not quite > right without some additional work on the spec. UTF16 with a BOM is > self-identifying. UTF8 with a BOM is also self-identifying. However, > UTF8 without a BOM and without some other disambiguator (e.g. the > accented o's), is _not_ self identifying. I know, because my students > and I hit this problem all the time in working with multi-linguage, > multi-code-page message catalogs for RasMol. Sometimes the only way > we can figure out whether a UTF8 file is really a UTF8 file is to > start translating the actual strings and see if they make sense. > > Another problem is what the "ASCII range" means to various people. > I suggest being much more restrictive and saying "the printable > ASCII characters, code points 32-126 plus CR, LF and HT" > > Combined the statment I would suggest > > If a CIF2 text stream contains only characters equivalent to the > printable ASCII characters plus HT, LF and CR, i.e. decimal code > points 32-126, 9, 10 and 13, then to ensure compatibility with > CIF1, the CIF2 specification does not require any explicit > specification of the particular encoding used, but recommends > the use of UTF8. If a CIF2 text stream contains any characters > equivalent to Unicode code points not in that range, then for > any encoding other then UTF8 it is the responsibility of any > application writing such a CIF to unambigously specify the > particular encoding used, preferably within the file itself. > UTF16 with a BOM conforms to this requirement. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > > On Thu, 30 Sep 2010, James Hester wrote: > > Here is a newish compromise: >> >> Encoding: The encoding of CIF2 text streams containing only code points in >> the ASCII >> range is not specified. CIF2 text streams containing any code points >> outside the ASCII >> range must be encoded such that the encoding can be reliably identified >> from the file >> contents. At present only UTF8 and UTF16 are considered to satisfy this >> constraint. >> >> Commentary: this is intended to mean that encoding works 'as for CIF1' >> (Proposals 1,2) >> for files containing only ASCII text, and works as for Proposal 4 for any >> other files. >> I believe that this allows legacy workflows to operate smoothly on CIF2 >> files (legacy >> workflows do not process non ASCII text) but also avoids the tower of >> Babel effect that >> will ensue if non-ASCII codepoints are encoded using local conventions. >> >> To explain the thinking further, perhaps I could take another stab at >> Herbert's point of >> view in my own words. Herbert (I think correctly) surmises that all >> currently used CIF >> applications do not explicitly specify the encoding of their input and >> output files, and >> so therefore are conceptually working with CIFs in a variety of local >> encodings. >> Mandating any encoding for CIF2 would therefore force at least some and >> perhaps most of >> these applications to change the way they read and write text, which is >> disruptive and >> obtuse when the system works fine as it is. Proposals 1 and 2 are aimed >> at avoiding >> this disruption. >> >> On the other hand, I look at the same situation and see that all this >> software is in >> fact reading and writing ASCII, because all of these local encodings are >> actually >> equivalent to ASCII for characters used in CIFs, and I further assert that >> this happy >> coincidence between encodings is the single reason CIF files are easily >> transferable >> between different systems. >> >> These two points of view create two different results if the CIF character >> repertoire is >> extended beyond the ASCII range. If we allow the current approach to >> encoding to >> continue, the happy coincidence of encodings ceases to operate outside the >> ASCII range >> and CIF files are no longer easily interchangeable. If we make explicit >> the commonality >> of CIF1 encodings by mandating a common set of identifiable encodings, the >> use of >> default encodings has to be abandoned with accompanying effort from >> programmers. >> >> I believe that this latest proposal respects Herbert's concerns as well as >> mine, and is >> eminently workable as a starting point for going forward. I'm now off to >> do a sample >> change and expect unanimous support from all parties when I return in an >> hour's time :) >> >> On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon wrote: >> I think the crux of issue is as follows: >> >> [But part of our difficulty is that we are all having separate >> epiphanies, and focusing on five different "cruxes". Clarifying >> the real divergence between our views would be a genuine benefit of >> a Skype conference, to which I have no personal objection.] >> >> In the real world, a need may arise to exchange CIFs constructed in >> non-canonical encodings. ("Canonical" probably means UTF-8 and/or >> UTF-16). Such a need would involve some transcoding strategy. >> >> What is the actual likelihood of that need arising? >> >> I would characterise James's position as "not very, and even less >> if the software written to generate CIFs is constrained to use >> canonical encodings within the standard". >> >> I would characterise the position of the rest of us as "reasonable to >> high, so that we wish to formulate the standard in a way that >> recognises non-canonical encodings and helps to establish or at >> least inform appropriate transcoding strategies". There appear to be >> strong disagreements among us, but in fact there's a lot of common >> ground, and a drafting exercise would probably move us towards a >> consensus. >> >> Do you agree that that is a fair assessment? >> >> If so, we can analyse further: what are the implications of mandating >> a canonical encoding or not if judgement (a) is wrong and if >> judgement >> (b) is wrong? My feeling is that the world will not end - or even >> change very much - in any case; but it could determine whether we >> need to formulate an optimal transcoding strategy now, or can defer >> it to a later date. >> >> However, if anyone thinks this is just another diversion, I'll drop >> this line of approach so as not to slow things down even more. >> >> Regards >> Brian >> >> On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. Bernstein wrote: >> > John, >> > >> > Now I am totally confused about what you are proposing and agree with >> Simon >> > that what is needed for you to state your proposal as the precise >> wording >> > that you propose to insert and/or change in the current CIF2 change >> document >> > "5 July 2010: draft of changes to the existing CIF 1.1 specification >> > for public discussion" >> > >> > If I understand your proposal correctly, the _only_ thing you are >> proposing >> > that differs in any way from my proposed motion is a mandate that a >> > CIF2 conformant reader must be able to read a UTF8 CIF2 file, but >> > that _no_ CIF application would actually be required to provide such >> > code, provided there was some mechanism available to transcode from >> > UTF8 to the local encoding, >> > which does not seem to be a mandate on the conformant CIF2 reader at >> > all, but a requirement for the provision of a portable utility to >> > do that external transcoding. >> > >> > If that is the case, wouldn't it make more sense to just provide that >> > utility that to argue about whether my motion requires somebody to write >> > their own? Having the utility in hand would avoid having multiple, >> > conflicting interpretations of this input transcoding requirement. >> > >> > If I have read your message correctly, please just write the utility you >> > are proposing. If I have read your message incorrectly, please >> > write the specification changes you propose for the draft changes >> > in place of the changes in my motion. >> > >> > _This_ is why it was, is, and will remain a good idea to simply have >> > a meeting and talk these things out. >> > >> > >> > >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> >> > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100930/6925b5ff/attachment-0001.html From bm at iucr.org Wed Sep 29 16:56:03 2010 From: bm at iucr.org (Brian McMahon) Date: Wed, 29 Sep 2010 16:56:03 +0100 Subject: [Cif2-encoding] A new(?) compromise position In-Reply-To: References: Message-ID: <20100929155603.GA9755@emerald.iucr.org> On Thu, Sep 30, 2010 at 12:24:45AM +1000, James Hester wrote: > Here is a newish compromise: > > Encoding: The encoding of CIF2 text streams containing only code points in > the ASCII range is not specified. CIF2 text streams containing any code > points outside the ASCII range must be encoded such that the encoding can be > reliably identified from the file contents. At present only UTF8 and UTF16 > are considered to satisfy this constraint. I concur with the principle of this statement (Herbert and John have already demonstrated that some additional drafting effort may be needed to eliminate or reduce ambiguity). It leaves the door open for additional encodings that are deemed to satisfy the constraint of providing self-identification, perhaps through key signatures or hashcodes; but there is then an onus on a community wishing to adopt such an encoding to develop and publish the criteria that make it self-identifying. (I'm not *encouraging* this to happen; the irony of our lengthy debate is that we all seem to be trying to achieve the promulgation of a canonical encoding.) Best wishes Brian _________________________________________________________________________ Brian McMahon tel: +44 1244 342878 Research and Development Officer fax: +44 1244 314888 International Union of Crystallography e-mail: bm at iucr.org 5 Abbey Square, Chester CH1 2HU, England From bm at iucr.org Wed Sep 29 17:01:10 2010 From: bm at iucr.org (Brian McMahon) Date: Wed, 29 Sep 2010 17:01:10 +0100 Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 In-Reply-To: References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <20100929102536.GB24670@emerald.iucr.org> Message-ID: <20100929160110.GB9755@emerald.iucr.org> Dear Herbert My Skype ID is brian.mc.mahon I'll try to join in the conference, though I'm not certain that I will be free at that time. If I fail to materialise, please start without me. > For those in England, 8:45 am EDT is 13:35 BST ^^^^^^^^^^^^^^^^^^^^^^^^ Presumably a general relativistic effect! I'll aim to sign in at quarter to two local time. Best wishes Brian On Wed, Sep 29, 2010 at 10:16:24AM -0400, Herbert J. Bernstein wrote: > Dear Colleagues, > > James and I are going to try a Skype conference call at 10:45 pm > (22:45) AEST, 8:45 am EDT on Thursday, 30 September 2010. All are welcome > to join in. > > If you wish to join, please email me your skype ID today. I will > originate the call, which will be voice only (a skype constraint on > conference calls). My skype id is yayahjb. I'll keep my email open > during the call, so if you just manage to get into Skype last minute > either send me a Skype chat and/or an email with your skype id and > I'll try to add you to the conference call. If everything else fails, > the land line I will be at is 1-631-286-1339, but that is just to > try to coordinate things. I cannot cross-connect the landline to the > skype conference. > > The call will have to end by 9:45 am EDT so I can go teach a class. > > Regards, > Herbert > > For those in England, 8:45 am EDT is 13:35 BST (12:45 GMT) > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== From yaya at bernstein-plus-sons.com Wed Sep 29 17:28:44 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Wed, 29 Sep 2010 12:28:44 -0400 (EDT) Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 In-Reply-To: <20100929160110.GB9755@emerald.iucr.org> References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <20100929102536.GB24670@emerald.iucr.org> <20100929160110.GB9755@emerald.iucr.org> Message-ID: Sorry ... 13:45 BST. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Wed, 29 Sep 2010, Brian McMahon wrote: > Dear Herbert > > My Skype ID is brian.mc.mahon > > I'll try to join in the conference, though I'm not certain that I will be > free at that time. If I fail to materialise, please start without me. > >> For those in England, 8:45 am EDT is 13:35 BST > ^^^^^^^^^^^^^^^^^^^^^^^^ > > Presumably a general relativistic effect! I'll aim to sign in at quarter to > two local time. > > Best wishes > Brian > > On Wed, Sep 29, 2010 at 10:16:24AM -0400, Herbert J. Bernstein wrote: >> Dear Colleagues, >> >> James and I are going to try a Skype conference call at 10:45 pm >> (22:45) AEST, 8:45 am EDT on Thursday, 30 September 2010. All are welcome >> to join in. >> >> If you wish to join, please email me your skype ID today. I will >> originate the call, which will be voice only (a skype constraint on >> conference calls). My skype id is yayahjb. I'll keep my email open >> during the call, so if you just manage to get into Skype last minute >> either send me a Skype chat and/or an email with your skype id and >> I'll try to add you to the conference call. If everything else fails, >> the land line I will be at is 1-631-286-1339, but that is just to >> try to coordinate things. I cannot cross-connect the landline to the >> skype conference. >> >> The call will have to end by 9:45 am EDT so I can go teach a class. >> >> Regards, >> Herbert >> >> For those in England, 8:45 am EDT is 13:35 BST (12:45 GMT) >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From yaya at bernstein-plus-sons.com Wed Sep 29 18:00:18 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Wed, 29 Sep 2010 13:00:18 -0400 (EDT) Subject: [Cif2-encoding] A new(?) compromise position In-Reply-To: References: Message-ID: Dear James, I know from long and painful experience that files with just a few accented characters are very, very difficult to clearly identify, and can look like valid UTF8 files. UTF8 is _not_ self-identifying without the BOM. The case that really convinced me that there was a problem was a French document with a lower case e with an accent acute on the E. I nearly missed a misencoding of a mac native file that because it was being misread as a capital E in a UTF8 file showed the accent as grave. There are simply too many cases like that in which a file written in a non-UTF8 encoding looks like something reasonable, but wrong, to say that UTF without the BOM is self-identifying. As for the question of standards and applications, many programming language standards specify the action of processors of the language. In our case, to have a meaninful standard, we need to specify what is a syntactically valid CIF2 file, to specify the semantics for a compliant CIF2 reader and specify the required actions for a compliant CIF2 writer. We need to do so in a way that breaks as few existing applications as possible. I believe that applications are highly relevant to what we are trying to do. In particular, I favor strict rules on writers and liberal rules on readers, so that files get processed when possible, but tend to get cleaned up when being processed. That same frame of mind is why a lot of text editors invisibly add a BOM at the start of all UTF8 files, but try to accept UTF8 files with or without the BOM. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 30 Sep 2010, James Hester wrote: > Hi Herbert (I should be in bed, but whatever): I do not think it is > appropriate to require the *application* to unambiguously identify the > encoding, as no widely-recognised standard procedure exists to do this.? The > means of identification should rather be based on the international standard > describing the encoding.? Only UTF16 and UTF8 currently meet this > requirement, I believe.? I will try to express this better after a sleep... > > Regarding UTF8: I'm glad to see such vigilance in the cause of correctly > identifying file encoding. A UTF8 file, naturally, can also look like a file > in a variety of single-byte encodings regardless of a BOM at the front.? > However, a file in a non-UTF8 encoding is highly unlikely to be mistaken for > a UTF8 file.? Therefore, providing an input file is first checked for UTF8 > encoding, I do not see any significant danger of a mistaken encoding.? I'd > be happy to include recommendations to use a UTF8 BOM and to check for UTF8 > encoding before any others that we may eventually add to the list. > > I'm curious to see what these files are that you have trouble identifying as > UTF8, as they may represent obscure corner cases.? Any chance you could dig > one or two up? > > James. > On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein > wrote: > Dear James, > > ?I respect the attempt to compromise, but the sentence "At > present only UTF8 and UTF16 are considered to satisfy this > constraint" is not quite > right without some additional work on the spec. ?UTF16 with a > BOM is > self-identifying. ?UTF8 with a BOM is also self-identifying. > ?However, > UTF8 without a BOM and without some other disambiguator (e.g. > the > accented o's), is _not_ self identifying. ?I know, because my > students > and I hit this problem all the time in working with > multi-linguage, > multi-code-page message catalogs for RasMol. ?Sometimes the only > way > we can figure out whether a UTF8 file is really a UTF8 file is > to > start translating the actual strings and see if they make sense. > > ?Another problem is what the "ASCII range" means to various > people. > I suggest being much more restrictive and saying "the printable > ASCII characters, code points 32-126 plus CR, LF and HT" > > ?Combined the statment I would suggest > > If a CIF2 text stream contains only characters equivalent to the > printable ASCII characters plus HT, LF and CR, i.e. decimal code > points 32-126, 9, 10 and 13, then to ensure compatibility with > CIF1, the CIF2 specification does not require any explicit > specification of the particular encoding used, but recommends > the use of UTF8. ?If a CIF2 text stream contains any characters > equivalent to Unicode code points not in that range, then for > any encoding other then UTF8 it is the responsibility of any > application writing such a CIF to unambigously specify the > particular encoding used, preferably within the file itself. > UTF16 with a BOM conforms to this requirement. > > ?Regards, > ? ?Herbert > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > > On Thu, 30 Sep 2010, James Hester wrote: > > Here is a newish compromise: > > Encoding: The encoding of CIF2 text streams containing > only code points in the ASCII > range is not specified. CIF2 text streams containing any > code points outside the ASCII > range must be encoded such that the encoding can be > reliably identified from the file > contents.? At present only UTF8 and UTF16 are considered > to satisfy this constraint. > > Commentary: this is intended to mean that encoding works > 'as for CIF1' (Proposals 1,2) > for files containing only ASCII text, and works as for > Proposal 4 for any other files.? > I believe that this allows legacy workflows to operate > smoothly on CIF2 files (legacy > workflows do not process non ASCII text) but also avoids > the tower of Babel effect that > will ensue if non-ASCII codepoints are encoded using local > conventions.? > > To explain the thinking further, perhaps I could take > another stab at Herbert's point of > view in my own words.? Herbert (I think correctly) > surmises that all currently used CIF > applications do not explicitly specify the encoding of > their input and output files, and > so therefore are conceptually working with CIFs in a > variety of local encodings.? > Mandating any encoding for CIF2 would therefore force at > least some and perhaps most of > these applications to change the way they read and write > text, which is disruptive and > obtuse when the system works fine as it is.? Proposals 1 > and 2 are aimed at avoiding > this disruption. > > On the other hand, I look at the same situation and see > that all this software is in > fact reading and writing ASCII, because all of these local > encodings are actually > equivalent to ASCII for characters used in CIFs, and I > further assert that this happy > coincidence between encodings is the single reason CIF > files are easily transferable > between different systems. > > These two points of view create two different results if > the CIF character repertoire is > extended beyond the ASCII range.? If we allow the current > approach to encoding to > continue, the happy coincidence of encodings ceases to > operate outside the ASCII range > and CIF files are no longer easily interchangeable.? If we > make explicit the commonality > of CIF1 encodings by mandating a common set of > identifiable encodings, the use of > default encodings has to be abandoned with accompanying > effort from programmers. > > I believe that this latest proposal respects Herbert's > concerns as well as mine, and is > eminently workable as a starting point for going forward.? > I'm now off to do a sample > change and expect unanimous support from all parties when > I return in an hour's time :) > > On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon > wrote: > ? ? ?I think the crux of issue is as follows: > > ? ? ?[But part of our difficulty is that we are all having > separate > ? ? ?epiphanies, and focusing on five different "cruxes". > Clarifying > ? ? ?the real divergence between our views would be a > genuine benefit of > ? ? ?a Skype conference, to which I have no personal > objection.] > > ? ? ?In the real world, a need may arise to exchange CIFs > constructed in > ? ? ?non-canonical encodings. ("Canonical" probably means > UTF-8 and/or > ? ? ?UTF-16). Such a need would involve some transcoding > strategy. > > ? ? ?What is the actual likelihood of that need arising? > > ? ? ?I would characterise James's position as "not very, > and even less > ? ? ?if the software written to generate CIFs is > constrained to use > ? ? ?canonical encodings within the standard". > > ? ? ?I would characterise the position of the rest of us > as "reasonable to > ? ? ?high, so that we wish to formulate the standard in a > way that > ? ? ?recognises non-canonical encodings and helps to > establish or at > ? ? ?least inform appropriate transcoding strategies". > There appear to be > ? ? ?strong disagreements among us, but in fact there's a > lot of common > ? ? ?ground, and a drafting exercise would probably move > us towards a > ? ? ?consensus. > > ? ? ?Do you agree that that is a fair assessment? > > ? ? ?If so, we can analyse further: what are the > implications of mandating > ? ? ?a canonical encoding or not if judgement (a) is wrong > and if judgement > ? ? ?(b) is wrong? My feeling is that the world will not > end - or even > ? ? ?change very much - in any case; but it could > determine whether we > ? ? ?need to formulate an optimal transcoding strategy > now, or can defer > ? ? ?it to a later date. > > ? ? ?However, if anyone thinks this is just another > diversion, I'll drop > ? ? ?this line of approach so as not to slow things down > even more. > > ? ? ?Regards > ? ? ?Brian > > On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. > Bernstein wrote: > > John, > > > > Now I am totally confused about what you are proposing > and agree with Simon > > that what is needed for you to state your proposal as > the precise wording > > that you propose to insert and/or change in the current > CIF2 change document > > "5 July 2010: draft of changes to the existing CIF 1.1 > specification > > for public discussion" > > > > If I understand your proposal correctly, the _only_ > thing you are proposing > > that differs in any way from my proposed motion is a > mandate that a > > CIF2 conformant reader must be able to read a UTF8 CIF2 > file, but > > that _no_ CIF application would actually be required to > provide such > > code, provided there was some mechanism available to > transcode from > > UTF8 to the local encoding, > > which does not seem to be a mandate on the conformant > CIF2 reader at > > all, but a requirement for the provision of a portable > utility to > > do that external transcoding. > > > > If that is the case, wouldn't it make more sense to just > provide that > > utility that to argue about whether my motion requires > somebody to write > > their own? ?Having the utility in hand would avoid > having multiple, > > conflicting interpretations of this input transcoding > requirement. > > > > If I have read your message correctly, please just write > the utility you > > are proposing. ?If I have read your message incorrectly, > please > > write the specification changes you propose for the > draft changes > > in place of the changes in my motion. > > > > _This_ is why it was, is, and will remain a good idea to > simply have > > a meeting and talk these things out. > > > > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > From John.Bollinger at STJUDE.ORG Wed Sep 29 22:44:21 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Wed, 29 Sep 2010 16:44:21 -0500 Subject: [Cif2-encoding] A new(?) compromise position. . In-Reply-To: References: Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDEF@SJMEMXMBS11.stjude.sjcrh.local> Herbert, On Wednesday, September 29, 2010 12:00 PM, Herbert J. Bernstein wrote: > I know from long and painful experience that files with just a few accented characters are very, very difficult to clearly identify, and can look like valid UTF8 files. UTF8 is _not_ self-identifying without the BOM. UTF-8 is not deterministically self-identifying either with or without a BOM. It is always conceivable that the supposed BOM bytes are intended as ordinary encoded characters in some other encoding. On the other hand, I don't know of any other character encoding in which a well-formed CIF could begin with the bytes of a UTF-8 BOM. Perhaps that makes a BOM sufficient for our purposes. > The case that really convinced me that there was a problem was a French document with a lower case e with an accent acute on the E. I nearly missed a misencoding of a mac native file that because it was being misread as a capital E in a UTF8 file showed the accent as grave. I wonder whether you and James are thinking along different lines. The error you describe indeed sounds tricky to catch by eye, but isolated non-ASCII characters should never pose a problem for computer detection of UTF-8 vs. single-byte encodings. For ASCII-compatible, non-UTF-8 encoded text to be simultaneously be valid UTF-8 requires that non-ASCII character codes occur only in particular 2-, 3-, and 4-byte patterns. That can never be the case for isolated non-ASCII characters, thus even one such isolated character is enough to let a decoder determine that its input is not valid UTF-8. That does not, however, prevent a UTF-8 decoder from attempting to recover in some way from a decoding error. Most that I have dealt with do so by default, often with no message to the caller. An application developer relying on his decoder to catch invalid UTF-8 must therefore ensure that it is able to signal decoding errors to its caller and is configured to do so. Alternatively, are you confident that in this example you have laid the blame in the right place? I can only speculate, but I note that a double decoding (UTF-8 decoding followed by interpreting the result as Mac OS Roman) would have exactly the result you describe. Likewise, a confusion between Mac OS Roman (assumed) and ISO-8859-1 (actual) would do the same. I'm having trouble, however, figuring out how an error recovery mechanism might have this result, and if the character in question was between ASCII characters (highly likely for French) then the result cannot have arisen from an error-free, yet wrong UTF-8 decoding. > There are simply too many cases like that in which a file written in a >non-UTF8 encoding looks like something reasonable, but wrong, to say that UTF without the BOM is self->identifying. I have no strong objection at the moment to requiring a BOM to identify UTF-8 CIFs. Nevertheless, I'm not yet persuaded that the risk of mis-identifying a byte stream as UTF-8 is so significant. As far as I can tell, the worst possible case is when the true encoding is ASCII-compatible (but not UTF-8) and among the input are exactly two encoded non-ASCII characters, adjacent to each other. Assuming equal probability of all non-ASCII characters, the likelihood of the byte stream being valid UTF-8 is around 12%. If there are two such pairs or one triple, then the likelihood drops to under 3%. It goes down rapidly from there with additional non-ASCII characters (to zero if any occur isolated). The likelihood of such an input being presented in the first place must also be factored in, including whatever influence may be exerted by the fact that the file would not be valid CIF (on account of using non-ASCII characters but not being encoded in UTF-8 or UTF-16). It's hard to gauge the actual risk, but I'm with James in estimating it to be very low. UTF-16 is different. Because the first character of a well-formed CIF (ignoring any BOM) must be from the ASCII subset, and because CIF does not allow character U+0000, it is always possible to distinguish a well-formed CIF encoded in UTF-16 from a well-formed CIF encoded in any ASCII-compatible encoding, EBCDIC, or most, if not all, other encodings. Even BE vs. LE can be readily distinguished for CIF, but note that UTF-16 without a byte-order mark is BE by definition. (Refer to Unicode 5.2, section 3.10, definition 98.) With that said, I also have no strong objection at the moment to requiring a BOM to identify UTF-16 CIFs, though I don't see much advantage to it. John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From jamesrhester at gmail.com Thu Sep 30 02:49:19 2010 From: jamesrhester at gmail.com (James Hester) Date: Thu, 30 Sep 2010 11:49:19 +1000 Subject: [Cif2-encoding] A new(?) compromise position In-Reply-To: References: Message-ID: My simple objective (for files containing non-ASCII characters) is that an application is able to determine the encoding of an incoming file with a high degree of certainty with no information beyond the CIF standard, the encoding standard, and the file contents. If the only choices are UTF8 or UTF16 there is no danger of a misassignment of encoding. Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably detected(*). This appears to me to be an excellent state of affairs. Although I have apparently restricted encodings to UTF8 and UTF16 in the preceding paragraph, this is simply in an effort to get something workable on the table that we can move forward with. I have no particular agenda to limit in future the possible encodings for CIF files, provided that those encodings can be reliably identified subject to the above restrictions. Indeed, this particular group was formed in part to work out a system for including those other encodings. I realise my wordsmithing on the new proposal is somewhat lax, but if we are in agreement on the principle I hope we are able to polish it up to everybody's satisfaction. James. (*) John's email has addressed the UTF8 question in his post adequately, and the wikipedia entry also contains useful discussion. On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein < yaya at bernstein-plus-sons.com> wrote: > Dear James, > > I know from long and painful experience that files with just a few > accented characters are very, very difficult to clearly identify, and can > look like valid UTF8 files. UTF8 is _not_ self-identifying without the BOM. > > The case that really convinced me that there was a problem was a > French document with a lower case e with an accent acute on the E. I > nearly missed a misencoding of a mac native file that because it was being > misread as a capital E in a UTF8 file showed the accent as grave. > > There are simply too many cases like that in which a file written in a > non-UTF8 encoding looks like something reasonable, but wrong, to say that > UTF without the BOM is self-identifying. > > As for the question of standards and applications, many programming > language standards specify the action of processors of the language. > In our case, to have a meaninful standard, we need to specify what > is a syntactically valid CIF2 file, to specify the semantics for > a compliant CIF2 reader and specify the required actions for > a compliant CIF2 writer. We need to do so in a way that breaks > as few existing applications as possible. > > I believe that applications are highly relevant to what we are trying to > do. In particular, I favor strict rules on writers and liberal rules > on readers, so that files get processed when possible, but tend to get > cleaned up when being processed. > > That same frame of mind is why a lot of text editors invisibly add > a BOM at the start of all UTF8 files, but try to accept UTF8 files > with or without the BOM. > > > Regards, > Herbert > > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Thu, 30 Sep 2010, James Hester wrote: > > Hi Herbert (I should be in bed, but whatever): I do not think it is >> appropriate to require the *application* to unambiguously identify the >> encoding, as no widely-recognised standard procedure exists to do this. >> The >> means of identification should rather be based on the international >> standard >> describing the encoding. Only UTF16 and UTF8 currently meet this >> requirement, I believe. I will try to express this better after a >> sleep... >> >> Regarding UTF8: I'm glad to see such vigilance in the cause of correctly >> identifying file encoding. A UTF8 file, naturally, can also look like a >> file >> in a variety of single-byte encodings regardless of a BOM at the front. >> However, a file in a non-UTF8 encoding is highly unlikely to be mistaken >> for >> a UTF8 file. Therefore, providing an input file is first checked for UTF8 >> encoding, I do not see any significant danger of a mistaken encoding. I'd >> be happy to include recommendations to use a UTF8 BOM and to check for >> UTF8 >> encoding before any others that we may eventually add to the list. >> >> I'm curious to see what these files are that you have trouble identifying >> as >> UTF8, as they may represent obscure corner cases. Any chance you could >> dig >> one or two up? >> >> James. >> On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein >> wrote: >> Dear James, >> >> I respect the attempt to compromise, but the sentence "At >> present only UTF8 and UTF16 are considered to satisfy this >> constraint" is not quite >> right without some additional work on the spec. UTF16 with a >> BOM is >> self-identifying. UTF8 with a BOM is also self-identifying. >> However, >> UTF8 without a BOM and without some other disambiguator (e.g. >> the >> accented o's), is _not_ self identifying. I know, because my >> students >> and I hit this problem all the time in working with >> multi-linguage, >> multi-code-page message catalogs for RasMol. Sometimes the only >> way >> we can figure out whether a UTF8 file is really a UTF8 file is >> to >> start translating the actual strings and see if they make sense. >> >> Another problem is what the "ASCII range" means to various >> people. >> I suggest being much more restrictive and saying "the printable >> ASCII characters, code points 32-126 plus CR, LF and HT" >> >> Combined the statment I would suggest >> >> If a CIF2 text stream contains only characters equivalent to the >> printable ASCII characters plus HT, LF and CR, i.e. decimal code >> points 32-126, 9, 10 and 13, then to ensure compatibility with >> CIF1, the CIF2 specification does not require any explicit >> specification of the particular encoding used, but recommends >> the use of UTF8. If a CIF2 text stream contains any characters >> equivalent to Unicode code points not in that range, then for >> any encoding other then UTF8 it is the responsibility of any >> application writing such a CIF to unambigously specify the >> particular encoding used, preferably within the file itself. >> UTF16 with a BOM conforms to this requirement. >> >> Regards, >> Herbert >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> >> >> On Thu, 30 Sep 2010, James Hester wrote: >> >> Here is a newish compromise: >> >> Encoding: The encoding of CIF2 text streams containing >> only code points in the ASCII >> range is not specified. CIF2 text streams containing any >> code points outside the ASCII >> range must be encoded such that the encoding can be >> reliably identified from the file >> contents. At present only UTF8 and UTF16 are considered >> to satisfy this constraint. >> >> Commentary: this is intended to mean that encoding works >> 'as for CIF1' (Proposals 1,2) >> for files containing only ASCII text, and works as for >> Proposal 4 for any other files. >> I believe that this allows legacy workflows to operate >> smoothly on CIF2 files (legacy >> workflows do not process non ASCII text) but also avoids >> the tower of Babel effect that >> will ensue if non-ASCII codepoints are encoded using local >> conventions. >> >> To explain the thinking further, perhaps I could take >> another stab at Herbert's point of >> view in my own words. Herbert (I think correctly) >> surmises that all currently used CIF >> applications do not explicitly specify the encoding of >> their input and output files, and >> so therefore are conceptually working with CIFs in a >> variety of local encodings. >> Mandating any encoding for CIF2 would therefore force at >> least some and perhaps most of >> these applications to change the way they read and write >> text, which is disruptive and >> obtuse when the system works fine as it is. Proposals 1 >> and 2 are aimed at avoiding >> this disruption. >> >> On the other hand, I look at the same situation and see >> that all this software is in >> fact reading and writing ASCII, because all of these local >> encodings are actually >> equivalent to ASCII for characters used in CIFs, and I >> further assert that this happy >> coincidence between encodings is the single reason CIF >> files are easily transferable >> between different systems. >> >> These two points of view create two different results if >> the CIF character repertoire is >> extended beyond the ASCII range. If we allow the current >> approach to encoding to >> continue, the happy coincidence of encodings ceases to >> operate outside the ASCII range >> and CIF files are no longer easily interchangeable. If we >> make explicit the commonality >> of CIF1 encodings by mandating a common set of >> identifiable encodings, the use of >> default encodings has to be abandoned with accompanying >> effort from programmers. >> >> I believe that this latest proposal respects Herbert's >> concerns as well as mine, and is >> eminently workable as a starting point for going forward. >> I'm now off to do a sample >> change and expect unanimous support from all parties when >> I return in an hour's time :) >> >> On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon >> wrote: >> I think the crux of issue is as follows: >> >> [But part of our difficulty is that we are all having >> separate >> epiphanies, and focusing on five different "cruxes". >> Clarifying >> the real divergence between our views would be a >> genuine benefit of >> a Skype conference, to which I have no personal >> objection.] >> >> In the real world, a need may arise to exchange CIFs >> constructed in >> non-canonical encodings. ("Canonical" probably means >> UTF-8 and/or >> UTF-16). Such a need would involve some transcoding >> strategy. >> >> What is the actual likelihood of that need arising? >> >> I would characterise James's position as "not very, >> and even less >> if the software written to generate CIFs is >> constrained to use >> canonical encodings within the standard". >> >> I would characterise the position of the rest of us >> as "reasonable to >> high, so that we wish to formulate the standard in a >> way that >> recognises non-canonical encodings and helps to >> establish or at >> least inform appropriate transcoding strategies". >> There appear to be >> strong disagreements among us, but in fact there's a >> lot of common >> ground, and a drafting exercise would probably move >> us towards a >> consensus. >> >> Do you agree that that is a fair assessment? >> >> If so, we can analyse further: what are the >> implications of mandating >> a canonical encoding or not if judgement (a) is wrong >> and if judgement >> (b) is wrong? My feeling is that the world will not >> end - or even >> change very much - in any case; but it could >> determine whether we >> need to formulate an optimal transcoding strategy >> now, or can defer >> it to a later date. >> >> However, if anyone thinks this is just another >> diversion, I'll drop >> this line of approach so as not to slow things down >> even more. >> >> Regards >> Brian >> >> On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. >> Bernstein wrote: >> > John, >> > >> > Now I am totally confused about what you are proposing >> and agree with Simon >> > that what is needed for you to state your proposal as >> the precise wording >> > that you propose to insert and/or change in the current >> CIF2 change document >> > "5 July 2010: draft of changes to the existing CIF 1.1 >> specification >> > for public discussion" >> > >> > If I understand your proposal correctly, the _only_ >> thing you are proposing >> > that differs in any way from my proposed motion is a >> mandate that a >> > CIF2 conformant reader must be able to read a UTF8 CIF2 >> file, but >> > that _no_ CIF application would actually be required to >> provide such >> > code, provided there was some mechanism available to >> transcode from >> > UTF8 to the local encoding, >> > which does not seem to be a mandate on the conformant >> CIF2 reader at >> > all, but a requirement for the provision of a portable >> utility to >> > do that external transcoding. >> > >> > If that is the case, wouldn't it make more sense to just >> provide that >> > utility that to argue about whether my motion requires >> somebody to write >> > their own? Having the utility in hand would avoid >> having multiple, >> > conflicting interpretations of this input transcoding >> requirement. >> > >> > If I have read your message correctly, please just write >> the utility you >> > are proposing. If I have read your message incorrectly, >> please >> > write the specification changes you propose for the >> draft changes >> > in place of the changes in my motion. >> > >> > _This_ is why it was, is, and will remain a good idea to >> simply have >> > a meeting and talk these things out. >> > >> > >> > >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> >> > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100930/fd83313f/attachment-0001.html From yaya at bernstein-plus-sons.com Thu Sep 30 03:20:06 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Wed, 29 Sep 2010 22:20:06 -0400 (EDT) Subject: [Cif2-encoding] A new(?) compromise position In-Reply-To: References: Message-ID: Dear James, You are mistaken, John said the opposite about determining UTF8from context. The place where he and I differ, is he thinks you can't do it even with a BOM and I am willing to accept UTF8 with a BOM as sufficiently disambiguated. The Wikipedia says no such thing about being able to reliably detect non-UTF8 files. Let us use the Wikipedea's very own UTF8 example as a test case: The code point ' U+00A2 = 00000000 10100010 11000010 10100010 which is 0xC2 0xA2 as a UTF8 byte string Now let us look at the Latin-1 code page at http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.htm which tells us that 0xc2 is  in Latin 1 and 0xa2 is ¢. There is no evidence that I have seen anywhere to support your prosition that "furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably detected." All the evidence I have seen points the other way. James, please look at the facts -- UTF8 without a BOM does not fit your characterization of being self-identifying. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 30 Sep 2010, James Hester wrote: > My simple objective (for files containing non-ASCII characters) is that an application is > able to determine the encoding of an incoming file with a high degree of certainty with no > information beyond the CIF standard, the encoding standard, and the file contents.? If the > only choices are UTF8 or UTF16 there is no danger of a misassignment of encoding.? > Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably > detected(*).? This appears to me to be an excellent state of affairs. > > Although I have apparently restricted encodings to UTF8 and UTF16 in the preceding > paragraph, this is simply in an effort to get something workable on the table that we can > move forward with.? I have no particular agenda to limit in future the possible encodings > for CIF files, provided that those encodings can be reliably identified subject to the above > restrictions.? Indeed, this particular group was formed in part to work out a system for > including those other encodings.? > > I realise my wordsmithing on the new proposal is somewhat lax, but if we are in agreement on > the principle I hope we are able to polish it up to everybody's satisfaction. > > James. > > (*) John's email has addressed the UTF8 question in his post adequately, and the wikipedia > entry also contains useful discussion. > > On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein wrote: > Dear James, > > ?I know from long and painful experience that files with just a few accented > characters are very, very difficult to clearly identify, and can look like valid > UTF8 files. ?UTF8 is _not_ self-identifying without the BOM. > > ?The case that really convinced me that there was a problem was a > French document with a lower case e with an accent acute on the E. ?I nearly > missed a ?misencoding of a mac native file that because it was being misread as > a capital E in a UTF8 file showed the accent as grave. > > ?There are simply too many cases like that in which a file written in a non-UTF8 > encoding looks like something reasonable, but wrong, to say that UTF without the > BOM is self-identifying. > > ?As for the question of standards and applications, many programming > language standards specify the action of processors of the language. > In our case, to have a meaninful standard, we need to specify what > is a syntactically valid CIF2 file, to specify the semantics for > a compliant CIF2 reader and specify the required actions for > a compliant CIF2 writer. ?We need to do so in a way that breaks > as few existing applications as possible. > > ?I believe that applications are highly relevant to what we are trying to do. > ?In particular, I favor strict rules on writers and liberal rules > on readers, so that files get processed when possible, but tend to get > cleaned up when being processed. > > ?That same frame of mind is why a lot of text editors invisibly add > a BOM at the start of all UTF8 files, but try to accept UTF8 files > with or without the BOM. > > > ?Regards, > ? ?Herbert > > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Thu, 30 Sep 2010, James Hester wrote: > > Hi Herbert (I should be in bed, but whatever): I do not think it is > appropriate to require the *application* to unambiguously identify the > encoding, as no widely-recognised standard procedure exists to do this.? > The > means of identification should rather be based on the international > standard > describing the encoding.? Only UTF16 and UTF8 currently meet this > requirement, I believe.? I will try to express this better after a > sleep... > > Regarding UTF8: I'm glad to see such vigilance in the cause of correctly > identifying file encoding. A UTF8 file, naturally, can also look like a > file > in a variety of single-byte encodings regardless of a BOM at the front.? > However, a file in a non-UTF8 encoding is highly unlikely to be mistaken > for > a UTF8 file.? Therefore, providing an input file is first checked for UTF8 > encoding, I do not see any significant danger of a mistaken encoding.? I'd > be happy to include recommendations to use a UTF8 BOM and to check for > UTF8 > encoding before any others that we may eventually add to the list. > > I'm curious to see what these files are that you have trouble identifying > as > UTF8, as they may represent obscure corner cases.? Any chance you could > dig > one or two up? > > James. > On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein > wrote: > ? ? ?Dear James, > > ? ? ??I respect the attempt to compromise, but the sentence "At > ? ? ?present only UTF8 and UTF16 are considered to satisfy this > ? ? ?constraint" is not quite > ? ? ?right without some additional work on the spec. ?UTF16 with a > ? ? ?BOM is > ? ? ?self-identifying. ?UTF8 with a BOM is also self-identifying. > ? ? ??However, > ? ? ?UTF8 without a BOM and without some other disambiguator (e.g. > ? ? ?the > ? ? ?accented o's), is _not_ self identifying. ?I know, because my > ? ? ?students > ? ? ?and I hit this problem all the time in working with > ? ? ?multi-linguage, > ? ? ?multi-code-page message catalogs for RasMol. ?Sometimes the only > ? ? ?way > ? ? ?we can figure out whether a UTF8 file is really a UTF8 file is > ? ? ?to > ? ? ?start translating the actual strings and see if they make sense. > > ? ? ??Another problem is what the "ASCII range" means to various > ? ? ?people. > ? ? ?I suggest being much more restrictive and saying "the printable > ? ? ?ASCII characters, code points 32-126 plus CR, LF and HT" > > ? ? ??Combined the statment I would suggest > > ? ? ?If a CIF2 text stream contains only characters equivalent to the > ? ? ?printable ASCII characters plus HT, LF and CR, i.e. decimal code > ? ? ?points 32-126, 9, 10 and 13, then to ensure compatibility with > ? ? ?CIF1, the CIF2 specification does not require any explicit > ? ? ?specification of the particular encoding used, but recommends > ? ? ?the use of UTF8. ?If a CIF2 text stream contains any characters > ? ? ?equivalent to Unicode code points not in that range, then for > ? ? ?any encoding other then UTF8 it is the responsibility of any > ? ? ?application writing such a CIF to unambigously specify the > ? ? ?particular encoding used, preferably within the file itself. > ? ? ?UTF16 with a BOM conforms to this requirement. > > ? ? ??Regards, > ? ? ?? ?Herbert > > ? ? ?===================================================== > ? ? ??Herbert J. Bernstein, Professor of Computer Science > ? ? ?? Dowling College, Kramer Science Center, KSC 121 > ? ? ?? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ?? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ?? ? ? ? ? ? ? ? yaya at dowling.edu > ? ? ?===================================================== > > > On Thu, 30 Sep 2010, James Hester wrote: > > ? ? ?Here is a newish compromise: > > ? ? ?Encoding: The encoding of CIF2 text streams containing > ? ? ?only code points in the ASCII > ? ? ?range is not specified. CIF2 text streams containing any > ? ? ?code points outside the ASCII > ? ? ?range must be encoded such that the encoding can be > ? ? ?reliably identified from the file > ? ? ?contents.? At present only UTF8 and UTF16 are considered > ? ? ?to satisfy this constraint. > > ? ? ?Commentary: this is intended to mean that encoding works > ? ? ?'as for CIF1' (Proposals 1,2) > ? ? ?for files containing only ASCII text, and works as for > ? ? ?Proposal 4 for any other files.? > ? ? ?I believe that this allows legacy workflows to operate > ? ? ?smoothly on CIF2 files (legacy > ? ? ?workflows do not process non ASCII text) but also avoids > ? ? ?the tower of Babel effect that > ? ? ?will ensue if non-ASCII codepoints are encoded using local > ? ? ?conventions.? > > ? ? ?To explain the thinking further, perhaps I could take > ? ? ?another stab at Herbert's point of > ? ? ?view in my own words.? Herbert (I think correctly) > ? ? ?surmises that all currently used CIF > ? ? ?applications do not explicitly specify the encoding of > ? ? ?their input and output files, and > ? ? ?so therefore are conceptually working with CIFs in a > ? ? ?variety of local encodings.? > ? ? ?Mandating any encoding for CIF2 would therefore force at > ? ? ?least some and perhaps most of > ? ? ?these applications to change the way they read and write > ? ? ?text, which is disruptive and > ? ? ?obtuse when the system works fine as it is.? Proposals 1 > ? ? ?and 2 are aimed at avoiding > ? ? ?this disruption. > > ? ? ?On the other hand, I look at the same situation and see > ? ? ?that all this software is in > ? ? ?fact reading and writing ASCII, because all of these local > ? ? ?encodings are actually > ? ? ?equivalent to ASCII for characters used in CIFs, and I > ? ? ?further assert that this happy > ? ? ?coincidence between encodings is the single reason CIF > ? ? ?files are easily transferable > ? ? ?between different systems. > > ? ? ?These two points of view create two different results if > ? ? ?the CIF character repertoire is > ? ? ?extended beyond the ASCII range.? If we allow the current > ? ? ?approach to encoding to > ? ? ?continue, the happy coincidence of encodings ceases to > ? ? ?operate outside the ASCII range > ? ? ?and CIF files are no longer easily interchangeable.? If we > ? ? ?make explicit the commonality > ? ? ?of CIF1 encodings by mandating a common set of > ? ? ?identifiable encodings, the use of > ? ? ?default encodings has to be abandoned with accompanying > ? ? ?effort from programmers. > > ? ? ?I believe that this latest proposal respects Herbert's > ? ? ?concerns as well as mine, and is > ? ? ?eminently workable as a starting point for going forward.? > ? ? ?I'm now off to do a sample > ? ? ?change and expect unanimous support from all parties when > ? ? ?I return in an hour's time :) > > ? ? ?On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon > ? ? ? wrote: > ? ? ?? ? ?I think the crux of issue is as follows: > > ? ? ?? ? ?[But part of our difficulty is that we are all having > ? ? ?separate > ? ? ?? ? ?epiphanies, and focusing on five different "cruxes". > ? ? ?Clarifying > ? ? ?? ? ?the real divergence between our views would be a > ? ? ?genuine benefit of > ? ? ?? ? ?a Skype conference, to which I have no personal > ? ? ?objection.] > > ? ? ?? ? ?In the real world, a need may arise to exchange CIFs > ? ? ?constructed in > ? ? ?? ? ?non-canonical encodings. ("Canonical" probably means > ? ? ?UTF-8 and/or > ? ? ?? ? ?UTF-16). Such a need would involve some transcoding > ? ? ?strategy. > > ? ? ?? ? ?What is the actual likelihood of that need arising? > > ? ? ?? ? ?I would characterise James's position as "not very, > ? ? ?and even less > ? ? ?? ? ?if the software written to generate CIFs is > ? ? ?constrained to use > ? ? ?? ? ?canonical encodings within the standard". > > ? ? ?? ? ?I would characterise the position of the rest of us > ? ? ?as "reasonable to > ? ? ?? ? ?high, so that we wish to formulate the standard in a > ? ? ?way that > ? ? ?? ? ?recognises non-canonical encodings and helps to > ? ? ?establish or at > ? ? ?? ? ?least inform appropriate transcoding strategies". > ? ? ?There appear to be > ? ? ?? ? ?strong disagreements among us, but in fact there's a > ? ? ?lot of common > ? ? ?? ? ?ground, and a drafting exercise would probably move > ? ? ?us towards a > ? ? ?? ? ?consensus. > > ? ? ?? ? ?Do you agree that that is a fair assessment? > > ? ? ?? ? ?If so, we can analyse further: what are the > ? ? ?implications of mandating > ? ? ?? ? ?a canonical encoding or not if judgement (a) is wrong > ? ? ?and if judgement > ? ? ?? ? ?(b) is wrong? My feeling is that the world will not > ? ? ?end - or even > ? ? ?? ? ?change very much - in any case; but it could > ? ? ?determine whether we > ? ? ?? ? ?need to formulate an optimal transcoding strategy > ? ? ?now, or can defer > ? ? ?? ? ?it to a later date. > > ? ? ?? ? ?However, if anyone thinks this is just another > ? ? ?diversion, I'll drop > ? ? ?? ? ?this line of approach so as not to slow things down > ? ? ?even more. > > ? ? ?? ? ?Regards > ? ? ?? ? ?Brian > > ? ? ?On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. > ? ? ?Bernstein wrote: > ? ? ?> John, > ? ? ?> > ? ? ?> Now I am totally confused about what you are proposing > ? ? ?and agree with Simon > ? ? ?> that what is needed for you to state your proposal as > ? ? ?the precise wording > ? ? ?> that you propose to insert and/or change in the current > ? ? ?CIF2 change document > ? ? ?> "5 July 2010: draft of changes to the existing CIF 1.1 > ? ? ?specification > ? ? ?> for public discussion" > ? ? ?> > ? ? ?> If I understand your proposal correctly, the _only_ > ? ? ?thing you are proposing > ? ? ?> that differs in any way from my proposed motion is a > ? ? ?mandate that a > ? ? ?> CIF2 conformant reader must be able to read a UTF8 CIF2 > ? ? ?file, but > ? ? ?> that _no_ CIF application would actually be required to > ? ? ?provide such > ? ? ?> code, provided there was some mechanism available to > ? ? ?transcode from > ? ? ?> UTF8 to the local encoding, > ? ? ?> which does not seem to be a mandate on the conformant > ? ? ?CIF2 reader at > ? ? ?> all, but a requirement for the provision of a portable > ? ? ?utility to > ? ? ?> do that external transcoding. > ? ? ?> > ? ? ?> If that is the case, wouldn't it make more sense to just > ? ? ?provide that > ? ? ?> utility that to argue about whether my motion requires > ? ? ?somebody to write > ? ? ?> their own? ?Having the utility in hand would avoid > ? ? ?having multiple, > ? ? ?> conflicting interpretations of this input transcoding > ? ? ?requirement. > ? ? ?> > ? ? ?> If I have read your message correctly, please just write > ? ? ?the utility you > ? ? ?> are proposing. ?If I have read your message incorrectly, > ? ? ?please > ? ? ?> write the specification changes you propose for the > ? ? ?draft changes > ? ? ?> in place of the changes in my motion. > ? ? ?> > ? ? ?> _This_ is why it was, is, and will remain a good idea to > ? ? ?simply have > ? ? ?> a meeting and talk these things out. > ? ? ?> > ? ? ?> > ? ? ?> > > ? ? ?-- > ? ? ?T +61 (02) 9717 9907 > ? ? ?F +61 (02) 9717 3145 > ? ? ?M +61 (04) 0249 4148 > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > From bm at iucr.org Thu Sep 30 09:40:28 2010 From: bm at iucr.org (Brian McMahon) Date: Thu, 30 Sep 2010 09:40:28 +0100 Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 In-Reply-To: References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <20100929102536.GB24670@emerald.iucr.org> Message-ID: <20100930084028.GC9485@emerald.iucr.org> Dear Herbert I won't be in the office at the time of the start of this conference. If I get back at a reasonable time, I'll email you with a request to be patched in. Regards Brian On Wed, Sep 29, 2010 at 10:16:24AM -0400, Herbert J. Bernstein wrote: > Dear Colleagues, > > James and I are going to try a Skype conference call at 10:45 pm > (22:45) AEST, 8:45 am EDT on Thursday, 30 September 2010. All are welcome > to join in. > > If you wish to join, please email me your skype ID today. I will > originate the call, which will be voice only (a skype constraint on > conference calls). My skype id is yayahjb. I'll keep my email open > during the call, so if you just manage to get into Skype last minute > either send me a Skype chat and/or an email with your skype id and > I'll try to add you to the conference call. If everything else fails, > the land line I will be at is 1-631-286-1339, but that is just to > try to coordinate things. I cannot cross-connect the landline to the > skype conference. > > The call will have to end by 9:45 am EDT so I can go teach a class. > > Regards, > Herbert > > For those in England, 8:45 am EDT is 13:35 BST (12:45 GMT) > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding From simonwestrip at btinternet.com Thu Sep 30 10:01:15 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Thu, 30 Sep 2010 09:01:15 +0000 (GMT) Subject: [Cif2-encoding] A new(?) compromise position In-Reply-To: References: Message-ID: <772798.69703.qm@web87009.mail.ird.yahoo.com> "UTF8 without a BOM does not fit your characterization of being self-identifying" I believe this is ture, which is why UTF8 has to be specified as the default when the encoding is not self-identifying. With the prescription of their use for 'non-ASCII CIFs' in the spec, UTF8 or 'self-identifying' seems quite satisfactory to me (obviously not worded like this:) - allowing 'ASCII' CIFs to be prepared/used as they always have been, and allowing the encoding of 'non-ASCII CIFs' to be determined with minimal uncertainty. Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Thursday, 30 September, 2010 3:20:06 Subject: Re: [Cif2-encoding] A new(?) compromise position Dear James, You are mistaken, John said the opposite about determining UTF8from context. The place where he and I differ, is he thinks you can't do it even with a BOM and I am willing to accept UTF8 with a BOM as sufficiently disambiguated. The Wikipedia says no such thing about being able to reliably detect non-UTF8 files. Let us use the Wikipedea's very own UTF8 example as a test case: The code point ' U+00A2 = 00000000 10100010 11000010 10100010 which is 0xC2 0xA2 as a UTF8 byte string Now let us look at the Latin-1 code page at http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.htm which tells us that 0xc2 is  in Latin 1 and 0xa2 is ¢. There is no evidence that I have seen anywhere to support your prosition that "furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably detected." All the evidence I have seen points the other way. James, please look at the facts -- UTF8 without a BOM does not fit your characterization of being self-identifying. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 30 Sep 2010, James Hester wrote: > My simple objective (for files containing non-ASCII characters) is that an >application is > able to determine the encoding of an incoming file with a high degree of >certainty with no > information beyond the CIF standard, the encoding standard, and the file >contents. If the > only choices are UTF8 or UTF16 there is no danger of a misassignment of >encoding. > Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can >be reliably > detected(*). This appears to me to be an excellent state of affairs. > > Although I have apparently restricted encodings to UTF8 and UTF16 in the >preceding > paragraph, this is simply in an effort to get something workable on the table >that we can > move forward with. I have no particular agenda to limit in future the possible >encodings > for CIF files, provided that those encodings can be reliably identified subject >to the above > restrictions. Indeed, this particular group was formed in part to work out a >system for > including those other encodings. > > I realise my wordsmithing on the new proposal is somewhat lax, but if we are in >agreement on > the principle I hope we are able to polish it up to everybody's satisfaction. > > James. > > (*) John's email has addressed the UTF8 question in his post adequately, and >the wikipedia > entry also contains useful discussion. > > On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein > wrote: > Dear James, > > I know from long and painful experience that files with just a few >accented > characters are very, very difficult to clearly identify, and can look >like valid > UTF8 files. UTF8 is _not_ self-identifying without the BOM. > > The case that really convinced me that there was a problem was a > French document with a lower case e with an accent acute on the E. I >nearly > missed a misencoding of a mac native file that because it was being >misread as > a capital E in a UTF8 file showed the accent as grave. > > There are simply too many cases like that in which a file written in a >non-UTF8 > encoding looks like something reasonable, but wrong, to say that UTF >without the > BOM is self-identifying. > > As for the question of standards and applications, many programming > language standards specify the action of processors of the language. > In our case, to have a meaninful standard, we need to specify what > is a syntactically valid CIF2 file, to specify the semantics for > a compliant CIF2 reader and specify the required actions for > a compliant CIF2 writer. We need to do so in a way that breaks > as few existing applications as possible. > > I believe that applications are highly relevant to what we are trying to >do. > In particular, I favor strict rules on writers and liberal rules > on readers, so that files get processed when possible, but tend to get > cleaned up when being processed. > > That same frame of mind is why a lot of text editors invisibly add > a BOM at the start of all UTF8 files, but try to accept UTF8 files > with or without the BOM. > > > Regards, > Herbert > > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Thu, 30 Sep 2010, James Hester wrote: > > Hi Herbert (I should be in bed, but whatever): I do not think it is > appropriate to require the *application* to unambiguously identify the > encoding, as no widely-recognised standard procedure exists to do this. > The > means of identification should rather be based on the international > standard > describing the encoding. Only UTF16 and UTF8 currently meet this > requirement, I believe. I will try to express this better after a > sleep... > > Regarding UTF8: I'm glad to see such vigilance in the cause of correctly > identifying file encoding. A UTF8 file, naturally, can also look like a > file > in a variety of single-byte encodings regardless of a BOM at the front. > However, a file in a non-UTF8 encoding is highly unlikely to be mistaken > for > a UTF8 file. Therefore, providing an input file is first checked for >UTF8 > encoding, I do not see any significant danger of a mistaken encoding. >I'd > be happy to include recommendations to use a UTF8 BOM and to check for > UTF8 > encoding before any others that we may eventually add to the list. > > I'm curious to see what these files are that you have trouble identifying > as > UTF8, as they may represent obscure corner cases. Any chance you could > dig > one or two up? > > James. > On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein > wrote: > Dear James, > > I respect the attempt to compromise, but the sentence "At > present only UTF8 and UTF16 are considered to satisfy this > constraint" is not quite > right without some additional work on the spec. UTF16 with a > BOM is > self-identifying. UTF8 with a BOM is also self-identifying. > However, > UTF8 without a BOM and without some other disambiguator (e.g. > the > accented o's), is _not_ self identifying. I know, because my > students > and I hit this problem all the time in working with > multi-linguage, > multi-code-page message catalogs for RasMol. Sometimes the only > way > we can figure out whether a UTF8 file is really a UTF8 file is > to > start translating the actual strings and see if they make sense. > > Another problem is what the "ASCII range" means to various > people. > I suggest being much more restrictive and saying "the printable > ASCII characters, code points 32-126 plus CR, LF and HT" > > Combined the statment I would suggest > > If a CIF2 text stream contains only characters equivalent to the > printable ASCII characters plus HT, LF and CR, i.e. decimal code > points 32-126, 9, 10 and 13, then to ensure compatibility with > CIF1, the CIF2 specification does not require any explicit > specification of the particular encoding used, but recommends > the use of UTF8. If a CIF2 text stream contains any characters > equivalent to Unicode code points not in that range, then for > any encoding other then UTF8 it is the responsibility of any > application writing such a CIF to unambigously specify the > particular encoding used, preferably within the file itself. > UTF16 with a BOM conforms to this requirement. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > > On Thu, 30 Sep 2010, James Hester wrote: > > Here is a newish compromise: > > Encoding: The encoding of CIF2 text streams containing > only code points in the ASCII > range is not specified. CIF2 text streams containing any > code points outside the ASCII > range must be encoded such that the encoding can be > reliably identified from the file > contents. At present only UTF8 and UTF16 are considered > to satisfy this constraint. > > Commentary: this is intended to mean that encoding works > 'as for CIF1' (Proposals 1,2) > for files containing only ASCII text, and works as for > Proposal 4 for any other files. > I believe that this allows legacy workflows to operate > smoothly on CIF2 files (legacy > workflows do not process non ASCII text) but also avoids > the tower of Babel effect that > will ensue if non-ASCII codepoints are encoded using local > conventions. > > To explain the thinking further, perhaps I could take > another stab at Herbert's point of > view in my own words. Herbert (I think correctly) > surmises that all currently used CIF > applications do not explicitly specify the encoding of > their input and output files, and > so therefore are conceptually working with CIFs in a > variety of local encodings. > Mandating any encoding for CIF2 would therefore force at > least some and perhaps most of > these applications to change the way they read and write > text, which is disruptive and > obtuse when the system works fine as it is. Proposals 1 > and 2 are aimed at avoiding > this disruption. > > On the other hand, I look at the same situation and see > that all this software is in > fact reading and writing ASCII, because all of these local > encodings are actually > equivalent to ASCII for characters used in CIFs, and I > further assert that this happy > coincidence between encodings is the single reason CIF > files are easily transferable > between different systems. > > These two points of view create two different results if > the CIF character repertoire is > extended beyond the ASCII range. If we allow the current > approach to encoding to > continue, the happy coincidence of encodings ceases to > operate outside the ASCII range > and CIF files are no longer easily interchangeable. If we > make explicit the commonality > of CIF1 encodings by mandating a common set of > identifiable encodings, the use of > default encodings has to be abandoned with accompanying > effort from programmers. > > I believe that this latest proposal respects Herbert's > concerns as well as mine, and is > eminently workable as a starting point for going forward. > I'm now off to do a sample > change and expect unanimous support from all parties when > I return in an hour's time :) > > On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon > wrote: > I think the crux of issue is as follows: > > [But part of our difficulty is that we are all having > separate > epiphanies, and focusing on five different "cruxes". > Clarifying > the real divergence between our views would be a > genuine benefit of > a Skype conference, to which I have no personal > objection.] > > In the real world, a need may arise to exchange CIFs > constructed in > non-canonical encodings. ("Canonical" probably means > UTF-8 and/or > UTF-16). Such a need would involve some transcoding > strategy. > > What is the actual likelihood of that need arising? > > I would characterise James's position as "not very, > and even less > if the software written to generate CIFs is > constrained to use > canonical encodings within the standard". > > I would characterise the position of the rest of us > as "reasonable to > high, so that we wish to formulate the standard in a > way that > recognises non-canonical encodings and helps to > establish or at > least inform appropriate transcoding strategies". > There appear to be > strong disagreements among us, but in fact there's a > lot of common > ground, and a drafting exercise would probably move > us towards a > consensus. > > Do you agree that that is a fair assessment? > > If so, we can analyse further: what are the > implications of mandating > a canonical encoding or not if judgement (a) is wrong > and if judgement > (b) is wrong? My feeling is that the world will not > end - or even > change very much - in any case; but it could > determine whether we > need to formulate an optimal transcoding strategy > now, or can defer > it to a later date. > > However, if anyone thinks this is just another > diversion, I'll drop > this line of approach so as not to slow things down > even more. > > Regards > Brian > > On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. > Bernstein wrote: > > John, > > > > Now I am totally confused about what you are proposing > and agree with Simon > > that what is needed for you to state your proposal as > the precise wording > > that you propose to insert and/or change in the current > CIF2 change document > > "5 July 2010: draft of changes to the existing CIF 1.1 > specification > > for public discussion" > > > > If I understand your proposal correctly, the _only_ > thing you are proposing > > that differs in any way from my proposed motion is a > mandate that a > > CIF2 conformant reader must be able to read a UTF8 CIF2 > file, but > > that _no_ CIF application would actually be required to > provide such > > code, provided there was some mechanism available to > transcode from > > UTF8 to the local encoding, > > which does not seem to be a mandate on the conformant > CIF2 reader at > > all, but a requirement for the provision of a portable > utility to > > do that external transcoding. > > > > If that is the case, wouldn't it make more sense to just > provide that > > utility that to argue about whether my motion requires > somebody to write > > their own? Having the utility in hand would avoid > having multiple, > > conflicting interpretations of this input transcoding > requirement. > > > > If I have read your message correctly, please just write > the utility you > > are proposing. If I have read your message incorrectly, > please > > write the specification changes you propose for the > draft changes > > in place of the changes in my motion. > > > > _This_ is why it was, is, and will remain a good idea to > simply have > > a meeting and talk these things out. > > > > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100930/1d643b85/attachment-0001.html From yaya at bernstein-plus-sons.com Thu Sep 30 10:40:25 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 30 Sep 2010 05:40:25 -0400 (EDT) Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 In-Reply-To: <20100930084028.GC9485@emerald.iucr.org> References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> Message-ID: OK ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 30 Sep 2010, Brian McMahon wrote: > Dear Herbert > > I won't be in the office at the time of the start of this conference. > If I get back at a reasonable time, I'll email you with a request to be > patched in. > > Regards > Brian > > On Wed, Sep 29, 2010 at 10:16:24AM -0400, Herbert J. Bernstein wrote: >> Dear Colleagues, >> >> James and I are going to try a Skype conference call at 10:45 pm >> (22:45) AEST, 8:45 am EDT on Thursday, 30 September 2010. All are welcome >> to join in. >> >> If you wish to join, please email me your skype ID today. I will >> originate the call, which will be voice only (a skype constraint on >> conference calls). My skype id is yayahjb. I'll keep my email open >> during the call, so if you just manage to get into Skype last minute >> either send me a Skype chat and/or an email with your skype id and >> I'll try to add you to the conference call. If everything else fails, >> the land line I will be at is 1-631-286-1339, but that is just to >> try to coordinate things. I cannot cross-connect the landline to the >> skype conference. >> >> The call will have to end by 9:45 am EDT so I can go teach a class. >> >> Regards, >> Herbert >> >> For those in England, 8:45 am EDT is 13:35 BST (12:45 GMT) >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From simonwestrip at btinternet.com Thu Sep 30 13:21:40 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Thu, 30 Sep 2010 12:21:40 +0000 (GMT) Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 In-Reply-To: References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> Message-ID: <629785.55688.qm@web87004.mail.ird.yahoo.com> Dear Herbert - just to confime, I cannot join the skype call Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Thursday, 30 September, 2010 10:40:25 Subject: Re: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 OK ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 30 Sep 2010, Brian McMahon wrote: > Dear Herbert > > I won't be in the office at the time of the start of this conference. > If I get back at a reasonable time, I'll email you with a request to be > patched in. > > Regards > Brian > > On Wed, Sep 29, 2010 at 10:16:24AM -0400, Herbert J. Bernstein wrote: >> Dear Colleagues, >> >> James and I are going to try a Skype conference call at 10:45 pm >> (22:45) AEST, 8:45 am EDT on Thursday, 30 September 2010. All are welcome >> to join in. >> >> If you wish to join, please email me your skype ID today. I will >> originate the call, which will be voice only (a skype constraint on >> conference calls). My skype id is yayahjb. I'll keep my email open >> during the call, so if you just manage to get into Skype last minute >> either send me a Skype chat and/or an email with your skype id and >> I'll try to add you to the conference call. If everything else fails, >> the land line I will be at is 1-631-286-1339, but that is just to >> try to coordinate things. I cannot cross-connect the landline to the >> skype conference. >> >> The call will have to end by 9:45 am EDT so I can go teach a class. >> >> Regards, >> Herbert >> >> For those in England, 8:45 am EDT is 13:35 BST (12:45 GMT) >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya at dowling.edu >> ===================================================== >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100930/eec267a3/attachment.html From yaya at bernstein-plus-sons.com Thu Sep 30 13:29:39 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 30 Sep 2010 08:29:39 -0400 Subject: [Cif2-encoding] Skype conference call 8:45 am EDT, Thursday 30 September 2010 In-Reply-To: <629785.55688.qm@web87004.mail.ird.yahoo.com> References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> Message-ID: Looks like it will just start with James and me and maybe gain Brian later. Does anyone else wish to be included? At 12:21 PM +0000 9/30/10, SIMON WESTRIP wrote: >Dear Herbert - just to confime, I cannot join the skype call > >Cheers > >Simon > > > >From: Herbert J. Bernstein >To: Group for discussing encoding and content validation schemes for >CIF2 >Sent: Thursday, 30 September, 2010 10:40:25 >Subject: Re: [Cif2-encoding] Skype conference call 8:45 am EDT, >Thursday 30 September 2010 > >OK > >===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu >===================================================== > >On Thu, 30 Sep 2010, Brian McMahon wrote: > >> Dear Herbert >> >> I won't be in the office at the time of the start of this conference. >> If I get back at a reasonable time, I'll email you with a request to be >> patched in. >> >> Regards >> Brian >> >> On Wed, Sep 29, 2010 at 10:16:24AM -0400, Herbert J. Bernstein wrote: >>> Dear Colleagues, >>> >>> James and I are going to try a Skype conference call at 10:45 pm >>> (22:45) AEST, 8:45 am EDT on Thursday, 30 September 2010. All are welcome >>> to join in. >>> >>> If you wish to join, please email me your skype ID today. I will >>> originate the call, which will be voice only (a skype constraint on >>> conference calls). My skype id is yayahjb. I'll keep my email open >>> during the call, so if you just manage to get into Skype last minute >>> either send me a Skype chat and/or an email with your skype id and >>> I'll try to add you to the conference call. If everything else fails, >>> the land line I will be at is 1-631-286-1339, but that is just to >>> try to coordinate things. I cannot cross-connect the landline to the >>> skype conference. >>> >>> The call will have to end by 9:45 am EDT so I can go teach a class. >>> >>> Regards, >>> Herbert >>> >>> For those in England, 8:45 am EDT is 13:35 BST (12:45 GMT) >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> yaya at dowling.edu >>> ===================================================== >>> >>> _______________________________________________ >>> cif2-encoding mailing list >>> cif2-encoding at iucr.org >>> >>>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> >>http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > >_______________________________________________ >cif2-encoding mailing list >cif2-encoding at iucr.org >http://scripts.iucr.org/mailman/listinfo/cif2-encoding -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From yaya at bernstein-plus-sons.com Thu Sep 30 14:40:21 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 30 Sep 2010 09:40:21 -0400 Subject: [Cif2-encoding] Revised Motion In-Reply-To: References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> Message-ID: Dear Colleagues, James and I had a good e-meeting and came up with the following revised wording. If anybody objects to this motion, please speak up now. We intend to bring this to the DDLm group and then COMCIFS and add annexes on particular encoding-disambiguation algorithms and signature later. Regards, Herbert =============================================================== Proposed position on CIF2 character encodings submitted to COMCIFS for a vote as an interim agreement on what can be agreed thus far, subject to extension and refinement in the future. =============================================================== Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U+xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8 as the preferred concrete representation of the information in a CIF2 document. Reference to ASCII characters means characters U+0000 through U+007F, or, equivalently the first 128 characters of the ISO-8859-1 (LATIN-1) character set. Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). Reference to whitespace means the characters ASCII space (U+0020), ASCII horizontal tab (U+0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. References to Unicode and UTF-8 are specifically to identify characters and a concrete representation of those characters in an established and widely available standard. It is understood that CIF2 documents may be constructed and maintained on computers that implement other character encodings. However, for maximum portability only the clearly identified equivalents to the Unicode characters identified above and below should be used and use of UTF-8 for a concrete representation is highly recommended. If a CIF2 file contains characters equivalent to Unicode code points greater than U+0076 (126 decimal), then the particular encoding used must be either be UTF8 or algorithmically identifiable from the CIF2 file itself. UTF16 with a BOM conforms to this requirement. The use of a BOM for unicode encodings including UTF8 is recommended. Acceptable identification algorithms will be published as necessary as annexes to this standard (see discussion of magic code and encoding-disambiguation below). A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is #\#CIF_2.0 followed immediately by whitespace. The immediately following space on this is reserved for encoding-disambiguation signatures (see above). If there is a BOM the magic code should follow the BOM. In keeping with XML restrictions we allow the characters U+0009 U+000A U+000D U+0020 -- U+007E U+00A0 -- U+D7FF U+E000 -- U+FDCF U+FDF0 -- U+FFFD U+10000 -- U+10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 - F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. CIF2 processors are required to treat , and as newline characters, by normalising them to on read. No other characters or character sequences may represent newline. In particular, CIF2 processors should not interpret the Unicode characters U+2028 (line separator) or U+2029 (paragraph separator) as newline. -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From John.Bollinger at STJUDE.ORG Thu Sep 30 15:01:47 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Thu, 30 Sep 2010 09:01:47 -0500 Subject: [Cif2-encoding] Revised Motion In-Reply-To: References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> On Thursday, September 30, 2010 8:40 AM, Herbert J. Bernstein wrote: > James and I had a good e-meeting and came up with the following >revised wording. If anybody objects to this motion, please speak >up now. With apologies, I object. This proposal has exactly the same problem that options (1) and (2) did: it does not define "text file". It is worse in this case, however, because the problem cannot be fixed merely by adding Herbert's definition (or mine). In most environments that definition does not encompass UTF-8 encoded text containing non-ASCII characters, so the recommendation to use UTF-8 implies some other, ill-defined definition. I am quite surprised that the result presented is so different from James's recent compromise proposal, which seemed poised to serve as the basis for a consensus result. Perhaps a viable solution would be to include a definition of "text file" derived from that proposal. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From John.Bollinger at STJUDE.ORG Thu Sep 30 15:21:00 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Thu, 30 Sep 2010 09:21:00 -0500 Subject: [Cif2-encoding] Revised Motion In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDF1@SJMEMXMBS11.stjude.sjcrh.local> On Thursday, September 30, 2010 9:02 AM, I wrote: >Perhaps a viable solution would be to include a definition of "text file" derived from that proposal. Specifically, that might look something like this: Reference to text files means binary representations of sequences of characters, either in a system-dependent form, provided that the characters are all drawn from the ASCII set, or alternatively as the sequence of bytes resulting from encoding the character sequence according to UTF-8. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From John.Bollinger at STJUDE.ORG Thu Sep 30 15:23:18 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Thu, 30 Sep 2010 09:23:18 -0500 Subject: [Cif2-encoding] Revised Motion In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDF1@SJMEMXMBS11.stjude.sjcrh.local> References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDF1@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDF2@SJMEMXMBS11.stjude.sjcrh.local> On Thursday, September 30, 2010 9:21 AM, I wrote: >Perhaps a viable solution would be to include a definition of "text file" derived from that proposal. > >Specifically, that might look something like this: > >Reference to text files means binary representations of sequences of characters, either in a system-dependent form, provided that the characters are all drawn from the ASCII set, or alternatively as the sequence of bytes resulting from encoding the character sequence according to UTF-8. Add UTF-16 in an analogous way, if desired. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From simonwestrip at btinternet.com Thu Sep 30 16:13:56 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Thu, 30 Sep 2010 08:13:56 -0700 (PDT) Subject: [Cif2-encoding] Revised Motion In-Reply-To: References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> Message-ID: <644408.5178.qm@web87003.mail.ird.yahoo.com> I do not object :-) Cheers Simon ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Thursday, 30 September, 2010 14:40:21 Subject: [Cif2-encoding] Revised Motion Dear Colleagues, James and I had a good e-meeting and came up with the following revised wording. If anybody objects to this motion, please speak up now. We intend to bring this to the DDLm group and then COMCIFS and add annexes on particular encoding-disambiguation algorithms and signature later. Regards, Herbert =============================================================== Proposed position on CIF2 character encodings submitted to COMCIFS for a vote as an interim agreement on what can be agreed thus far, subject to extension and refinement in the future. =============================================================== Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U+xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8 as the preferred concrete representation of the information in a CIF2 document. Reference to ASCII characters means characters U+0000 through U+007F, or, equivalently the first 128 characters of the ISO-8859-1 (LATIN-1) character set. Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). Reference to whitespace means the characters ASCII space (U+0020), ASCII horizontal tab (U+0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. References to Unicode and UTF-8 are specifically to identify characters and a concrete representation of those characters in an established and widely available standard. It is understood that CIF2 documents may be constructed and maintained on computers that implement other character encodings. However, for maximum portability only the clearly identified equivalents to the Unicode characters identified above and below should be used and use of UTF-8 for a concrete representation is highly recommended. If a CIF2 file contains characters equivalent to Unicode code points greater than U+0076 (126 decimal), then the particular encoding used must be either be UTF8 or algorithmically identifiable from the CIF2 file itself. UTF16 with a BOM conforms to this requirement. The use of a BOM for unicode encodings including UTF8 is recommended. Acceptable identification algorithms will be published as necessary as annexes to this standard (see discussion of magic code and encoding-disambiguation below). A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is #\#CIF_2.0 followed immediately by whitespace. The immediately following space on this is reserved for encoding-disambiguation signatures (see above). If there is a BOM the magic code should follow the BOM. In keeping with XML restrictions we allow the characters U+0009 U+000A U+000D U+0020 -- U+007E U+00A0 -- U+D7FF U+E000 -- U+FDCF U+FDF0 -- U+FFFD U+10000 -- U+10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 - F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. CIF2 processors are required to treat , and as newline characters, by normalising them to on read. No other characters or character sequences may represent newline. In particular, CIF2 processors should not interpret the Unicode characters U+2028 (line separator) or U+2029 (paragraph separator) as newline. -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100930/44079491/attachment-0001.html From yaya at bernstein-plus-sons.com Thu Sep 30 19:04:37 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 30 Sep 2010 14:04:37 -0400 (EDT) Subject: [Cif2-encoding] Revised Motion In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear John, It appears you are proposing to add the words "Reference to text files means binary representations of sequences of characters, either in a system-dependent form, provided that the characters are all drawn from the ASCII set, or alternatively as the sequence of bytes resulting from encoding the character sequence according to UTF-8." Is, unfortunately, inaccurate and confusing and gets us back into the looping dicussion of binary versus text. It opens up exactly the issues we just tried to get away from of making it appear that CIF2 is going to invalidate encodings that happen to be neither ASCII nor UTF8. I realize that is not what you intend, but that is what your paragraph seems to imply. This is no an easy concept to define. I just went through a large number of text file definitions on the web, and it is amazing how flawed they are are in one way or another. For example, wordiq says, "Text files (plain text files) are files with generally a one-to-one correspondence between the bytes and ordinary readable characters such as letters and digits," but that defintion fails to consider UTF8 a text file deifnition because it maps multiple bytes to readable characters and multiple, very different byte sequences, all map to the same redable character. The W3C definition is even more vague than the CIF non-definition: "The text Content-Type is intended for sending material which is principally textual in form. It is the default Content- Type. A "charset" parameter may be used to indicate the character set of the body text. The primary subtype of text is "plain". This indicates plain (unformatted) text. The default Content-Type for Internet mail is "text/plain; charset=us-ascii". Beyond plain text, there are many formats for representing what might be known as "extended text" -- text with embedded formatting and presentation information. An interesting characteristic of many such representations is that they are to some extent readable even without the software that interprets them. It is useful, then, to distinguish them, at the highest level, from such unreadable data as images, audio, or text represented in an unreadable form. In the absence of appropriate interpretation software, it is reasonable to show subtypes of text to the user, while it is not reasonable to do so with most nontextual data. Such formatted textual data should be represented using subtypes of text. Plausible subtypes of text are typically given by the common name of the representation format, e.g., "text/richtext". " Coming to an acceptable formal resolution on the meaning of "text" would seem likely to take a very, very long time. We need to move on. Please recall that what we are discussing is a revision to the existing. larger CIF 1.1 syntax definition to create the CIF2 syntax definition, and are just trying to get a clear enough definition of what users and software developers need to do to cope with the extension of the number of code points past 126. I would suggest that we go forward with the motion as it stands now and that we all carefully read CIF 1.1 syntax definition to see if and where it might make sense to insert some clear, agreed definition of a text file at some future time, but I really don't think most users or software developers will have a serious problem in getting started with CIF2 leaving the any ambiguty about the concept of a text file at the same level it has been under CIF1 with this motion added. Once we have a clear, agreed understanding of the more metaphysical aspects of what text is, we can then share that with the community. Meanwhile, they hopefully will already be using CIF2. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 30 Sep 2010, Bollinger, John C wrote: > > On Thursday, September 30, 2010 8:40 AM, Herbert J. Bernstein wrote: >> James and I had a good e-meeting and came up with the following >> revised wording. If anybody objects to this motion, please speak >> up now. > > With apologies, I object. This proposal has exactly the same problem > that options (1) and (2) did: it does not define "text file". It is > worse in this case, however, because the problem cannot be fixed merely > by adding Herbert's definition (or mine). In most environments that > definition does not encompass UTF-8 encoded text containing non-ASCII > characters, so the recommendation to use UTF-8 implies some other, > ill-defined definition. > > I am quite surprised that the result presented is so different from > James's recent compromise proposal, which seemed poised to serve as the > basis for a consensus result. Perhaps a viable solution would be to > include a definition of "text file" derived from that proposal. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From John.Bollinger at STJUDE.ORG Thu Sep 30 23:12:59 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Thu, 30 Sep 2010 17:12:59 -0500 Subject: [Cif2-encoding] Revised Motion In-Reply-To: References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDFA@SJMEMXMBS11.stjude.sjcrh.local> Dear Herbert, On Thursday, September 30, 2010 1:05 PM, Herbert J. Bernstein wrote: >It appears you are proposing to add the words > >"Reference to text files means binary representations of sequences of >characters, either in a system-dependent form, provided that the >characters are all drawn from the ASCII set, or alternatively as the >sequence of bytes resulting from encoding the character sequence according >to UTF-8." Yes. I am open to variations on the wording, but I'm looking for something along those lines to be added to the spec. Am I wrong that yesterday we were close to doing just that via James's proposal? >Is, unfortunately, inaccurate and confusing and gets us back into the >looping dicussion of binary versus text. It opens up exactly the >issues we just tried to get away from of making it appear that >CIF2 is going to invalidate encodings that happen to be neither >ASCII nor UTF8. I realize that is not what you intend, but that >is what your paragraph seems to imply. I can accept that the wording may be confusing, and I would welcome constructive criticism on that topic. You cannot sustain a claim that my text is inaccurate, however, without providing at least a partial alternative definition that conflicts. In other words, what's inaccurate about it? This is a highly relevant question, because I find my text to be entirely reasonable, and I might well program according to that interpretation without some guidance otherwise. If I don't use that, then what *do* I use? As for binary vs. text, I have realized that's a false dichotomy in our context. Every computer file is binary, in the sense that it is a sequence of bytes. Some are a _particular type_ of binary that we call "text" (but can't seem to define). The two are not mutually exclusive. This is quite different from the traditional binary vs. text issue, which relates to questions such as whether to represent the number 12345 in IEEE 32-bit floating-point format or as five decimal digits. >This is no an easy concept to define. I just went through a large >number of text file definitions on the web, and it is amazing how >flawed they are are in one way or another. That is precisely why I am so persistent about putting a definition in the spec. If I choose the definition I think best, and you the one you think best, and James and Simon likewise, then will any of our programs be fully compatible with each other? Simon likes identifiable encodings, so maybe he'll feel free to write UTF-32LE CIFs. Will your programs accept those? Should they? To be prepared to process all conformant CIFs, does my program need to be able to handle KOI8-R and Shift-JIS CIFs? If I use MS Word to create CIFs, and I save them in Rich *Text* Format, then should I be upset when James's software rejects them? I don't think it's correct to say that the concept is difficult to define. I could write half a dozen definitions in as many minutes, each appropriate for some particular purpose. It's more accurate to say that there are many alternative definitions in use, none of them completely compatible with the others. There is no reason why we can't choose the one we find most suitable, or write one of our own. [...] >Coming to an acceptable formal resolution on the meaning of "text" would seem likely to take a very, very long time. You already provided a definition that was good enough for me. My proposed text summarizes and abridges it, perhaps too much, as "a system-dependent form". I would be content to replace that phrase with your full text, or with the functionally equivalent text I labeled "local". > We need to move on. We need to answer the question. Or COMCIFS does, if we're not up to it. The spec is incomplete and inadequate without an answer. >Please recall that what we are discussing is a revision to the existing. >larger CIF 1.1 syntax definition to create the CIF2 syntax definition, >and are just trying to get a clear enough definition of what users and >software developers need to do to cope with the extension of the >number of code points past 126. And what definition are we then providing? The only clear thing I see is that users and developers are *probably* safe if they write UTF-8. If UTF-8 is the only safe option for CIFs with non-ASCII characters, then how does that differ from my proposal? >I would suggest that we go forward with the motion as it stands now >and that we all carefully read CIF 1.1 syntax definition to see if >and where it might make sense to insert some clear, agreed definition of >a text file at some future time, but I really don't think most users or >software developers will have a serious problem in getting started with >CIF2 leaving the any ambiguty about the concept of a text file at the same >level it has been under CIF1 with this motion added. This area presents a much greater problem for CIF2 and its expanded character set than it did for CIF1. I quite agree that most users and software developers would get started with CIF2 despite the ambiguity. I cannot see how we would then avoid a slew of problems of the form "X software doesn't handle my CIF" and "Y software produces broken CIFs" and "Z software is incompatible with W software". I do not see how that could be construed as a win for CIF2. >Once we have a clear, agreed understanding of the more metaphysical >aspects of what text is, we can then share that with the >community. Meanwhile, they hopefully will already be using CIF2. This is not an arcane subject that we need to "understand", it is a question that we have the opportunity to *answer* for the purposes of CIF. We do not need a definition that everyone, everywhere will acclaim as the full and perfect meaning of "text". We just need to be clear what the specification means by the term. If we don't know what the specification means by the term, then we should be embarrassed to advance it. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer