From John.Bollinger at STJUDE.ORG Tue Aug 3 15:58:10 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Tue, 3 Aug 2010 09:58:10 -0500 Subject: [Cif2-encoding] The discussion so far Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DED7B@SJMEMXMBS11.stjude.sjcrh.local> Thanks, Brian, for creating this list. Since no one else has had the combination of time, energy, and inclination to do so, I'll open with a summary of the state of the CIF 2.0 character encoding discussion so far, as it currently stands at its original location on the DDLm group. Specifically, the previous summary and the discussion that proceeded from it can be found at http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00744.html and http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00690.html, though some of the most recent messages seem not to be available at present via the web interface. The controversy derives from CIF 2.0's expansion of character set to all of Unicode. It is magnified by CIF 1.0's explicit self description as an encoding-independent text format, and by the accumulated body of CIF software and author practices that rely on that text orientation. There has been considerable debate about what it would mean for CIF 2.0 to be a text format vs. a binary format, and the relative advantages and disadvantages of each. Among the points covered were: 1) A 'text' format implies that CIF content may comply with local, locale-specific conventions for electronic text representation, including details such as line termination conventions and, especially, character encoding. Such files are suitable input for general-purpose text tools such as text editors, text extraction utilities, and text indexers. Alternatively, a conformant text CIF might be expressed according to some other convention suitable for a particular application or foreign environment. Because conventions differ, correctly archiving text or moving it between environments requires accounting for the text conventions in use, and may involve conversions such as line terminator changes and transcoding. This is the CIF 1.0 position, though CIF1's restricted character set significantly reduces the impact of character encoding considerations relative to CIF2. 2) A 'binary' format is anything else, but in this context, the key characteristic of binary-CIF2 proposals is that they add to the text specification a specification for text serialization to byte-oriented media, such as disks and networks. In particular, one strongly advocated position in the CIF 2.0 standardization discussion is that CIF 2.0 should require serialization of the underlying CIF text according to the UTF-8 character encoding scheme. This would be a text-like binary format, in that some text tools can handle UTF-8 encoded text (sometimes requiring a little persuasion), and therefore could be used to read, modify, and write binary-CIF2 files. 3) The many specific issues and arguments that have been raised mostly fall into one or both of two general areas: 3a) reliability, by which we mean that a CIF consumer should have justifiable high confidence that he is interpreting CIF data in the way the CIF producer intended, and 3b) usability, by which we mean that human authors, and to a lesser extent, software, should be able to manipulate CIF2 files as they are accustomed to manipulating CIF1 files, e.g. using the default configurations of their systems' text editors. In addition to general usability, arguments in this category include some appealing to respect of scientists of many nationalities, and similar ones appealing to freedom / liberty. Furthermore, a few of the points have appealed to 3c) practicality, by which we mean that CIF stakeholders should be able to use the specification effectively. This aspect is subordinate to reliability and usability, but it cuts across both. For the most part, it relates to the ability to develop software and practices that address the likely real-world usage (and misusage) of the standard. 4) The group as a whole appears to have agreed that UTF-8 is a highly suitable encoding for CIF2. It can encode the entire Unicode code space, it is a superset of ASCII, it can be recognized heuristically with low probability of error, and it is widely implemented. These characteristics yield high reliability at the cost of some usability. The debate is not about whether UTF-8 should be used, but rather about whether the standard should forbid use of other encodings. Essentially, this is a recasting of the text vs. binary debate. 5) Inasmuch as consensus on the issues described above has not yet been reached and does not appear likely, the group has issued a call for comments from a wider group of stakeholders. No results of that call have yet been reported back to the group. 6) In the interim, the discussion has moved toward finding middle ground. In particular, James Hester asked: >If we consider CIF as text as the overriding priority: > >1. How do we then make exchanging and storing files according to text conventions sufficiently reliable for the purposes of CIF? How far are we prepared to compromise? > >If we consider reliable exchange of information as the top priority: > >2. How do we then make CIFs sufficiently accessible to text-based tools? How far are we prepared to compromise? A short series of proposed schemes for CIF exchange and storage proceeded from that call: 7) In response to question (2), James offered a scheme (A) that primarily would relax the explicit specification of UTF-8 into a set of characteristics that a CIF encoding would need to satisfy. His characteristics would need to be further pared down or relaxed to in practice permit encodings other than UTF-8. This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive. 8) In response to question (1), John Bollinger offered a scheme (B) that retains text character for CIF2, and relies on labeling when wanted or needed to convey text conventions, and on hashing to provide verification and reliability. This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive. A limited amount of additional discussion proceeded from these proposed exchange and archiving schemes, and that's where we currently stand. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital ________________________________ Email Disclaimer: www.stjude.org/emaildisclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100803/b9979deb/attachment.html From John.Bollinger at STJUDE.ORG Tue Aug 3 16:31:20 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Tue, 3 Aug 2010 10:31:20 -0500 Subject: [Cif2-encoding] The discussion so far. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DED7B@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA5416659DED7B@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DED7C@SJMEMXMBS11.stjude.sjcrh.local> On Tuesday, August 03, 2010 9:58 AM, John Bollinger wrote: >7) In response to question (2), James offered a scheme (A) [...] This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive. Scheme A: 1. For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes. 2. Any encoding specifying the mapping from bytes to Unicode code points may be used, provided that: 1. The encoding is specified in an international standard 2. The encoding is distinguishable from all other standard encodings at the binary level. This requirement may be satisfied by an initial 'signature', provided that this signature is specified in the relevant international standard as being mandatory 3. The encoding is supported across the range of platforms likely to use CIF2. "Support" includes: 1. Availability of text input and output functions using this encoding across a range of programming languages 2. Availability of applications on the platform to manipulate text in this encoding, most importantly text editors but also tools such as search 4. The encoding is coincident with US-ASCII for codepoints <= 127. This requirement may be dropped in the future if CIF2 becomes the dominant form of CIF file. >8) In response to question (1), John Bollinger offered a scheme (B) [...] This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive. Note: This is my amended form of scheme B, not the original. Note2: James's alternative amended version of scheme B (dubbed "Scheme C") is not presented, as, for reasons explained elsewhere, it does not meet scheme B's objectives. Scheme B': 1. This scheme provides for reliable archiving and exchange of CIF text. Although it depends in some cases on metadata embedded in the CIF text, presence of such metadata is not a well-formedness constraint on the text itself. 2. For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes. 3. Any text encoding may be used. If the encoding does not comply with either (5a) or (5b) below, then its name must be given via an encoding tag following the magic code, on the same line. Otherwise, an encoding tag is optional, but if present then it must correctly name the encoding. 4. Archiving or exchange of CIF text complies with this scheme if the CIF text contains a correct content hash: a. The hash value is computed by applying the MD5 algorithm to the Unicode code point values of the CIF text, in the order they appear, excluding all code points of CIF comments and all other CIF whitespace appearing outside data values or separating List or Table elements. b. The code point stream is converted to a byte stream for input to the hash function by interpreting each code point as a 24-bit integer, appearing on the byte stream in order from most-significant to least-significant byte. c. The hash value is expressed in the CIF itself as a structured comment of the form: #\#content_hash_md5:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX where the Xs represent the hexadecimal digits of the computed hash value. d. The hash comment may appear anywhere in the CIF that a comment may appear, but conventionally it is at the end of the CIF. 5. Archiving or exchange of CIF text that does not contain a content hash complies with this scheme if a. the text encoding is specified in an international standard and is distinguishable from all other encodings at the binary level, or b. the text encoding is coincident with US-ASCII for all code points appearing in the CIF. For the purposes of (5a), distinguishing encodings may rely on the characteristics of CIF, such as the allowed character set and the required CIF version comment, and also on the actual CIF text (such as for recognition of UTF-8 by its encoding of non-ASCII characters). Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital ________________________________ Email Disclaimer: www.stjude.org/emaildisclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100803/65a1687d/attachment-0001.html From jamesrhester at gmail.com Thu Aug 5 02:44:13 2010 From: jamesrhester at gmail.com (James Hester) Date: Thu, 5 Aug 2010 11:44:13 +1000 Subject: [Cif2-encoding] The discussion so far In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DED7B@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA5416659DED7B@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Thanks John for this summary. I think it is a fair description of the current state of our deliberations. I am currently drafting a response to your previous email which I hope to be able to send fairly soon. On Wed, Aug 4, 2010 at 12:58 AM, Bollinger, John C < John.Bollinger at stjude.org> wrote: > Thanks, Brian, for creating this list. > > > > Since no one else has had the combination of time, energy, and inclination > to do so, I?ll open with a summary of the state of the CIF 2.0 character > encoding discussion so far, as it currently stands at its original location > on the DDLm group . > Specifically, the previous summary and the discussion that proceeded from it > can be found at > http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00744.html and > http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00690.html, though > some of the most recent messages seem not to be available at present via the > web interface. > > > > The controversy derives from CIF 2.0's expansion of character set to all of > Unicode. It is magnified by CIF 1.0's explicit self description as an > encoding-independent text format, and by the accumulated body of CIF > software and author practices that rely on that text orientation. There has > been considerable debate about what it would mean for CIF 2.0 to be a text > format *vs*. a binary format, and the relative advantages and > disadvantages of each. Among the points covered were: > > > > 1) A 'text' format implies that CIF content may comply with local, > locale-specific conventions for electronic text representation, including > details such as line termination conventions and, especially, *character > encoding*. Such files are suitable input for general-purpose text tools > such as text editors, text extraction utilities, and text indexers. > Alternatively, a conformant text CIF might be expressed according to some > other convention suitable for a particular application or foreign > environment. Because conventions differ, correctly archiving text or moving > it between environments requires accounting for the text conventions in use, > and may involve conversions such as line terminator changes and > transcoding. This is the CIF 1.0 position, though CIF1's restricted > character set significantly reduces the impact of character encoding > considerations relative to CIF2. > > > > 2) A 'binary' format is anything else, but in this context, the key > characteristic of binary-CIF2 proposals is that they add to the text > specification a specification for text serialization to byte-oriented media, > such as disks and networks. In particular, one strongly advocated position > in the CIF 2.0 standardization discussion is that CIF 2.0 should require > serialization of the underlying CIF text according to the UTF-8 character > encoding scheme. This would be a text-like binary format, in that some text > tools can handle UTF-8 encoded text (sometimes requiring a little > persuasion), and therefore could be used to read, modify, and write > binary-CIF2 files. > > > > 3) The many specific issues and arguments that have been raised mostly fall > into one or both of two general areas: > > > > 3a) *reliability*, by which we mean that a CIF consumer should have > justifiable high confidence that he is interpreting CIF data in the way the > CIF producer intended, and > > > > 3b) *usability*, by which we mean that human authors, and to a lesser > extent, software, should be able to manipulate CIF2 files as they are > accustomed to manipulating CIF1 files, e.g. using the default configurations > of their systems? text editors. In addition to general usability, arguments > in this category include some appealing to respect of scientists of many > nationalities, and similar ones appealing to freedom / liberty. > > > > Furthermore, a few of the points have appealed to > > > > 3c) *practicality*, by which we mean that CIF stakeholders should be able > to use the specification effectively. This aspect is subordinate to > reliability and usability, but it cuts across both. For the most part, it > relates to the ability to develop software and practices that address the > likely real-world usage (and misusage) of the standard. > > > > 4) The group as a whole appears to have agreed that UTF-8 is a highly > suitable encoding for CIF2. It can encode the entire Unicode code space, it > is a superset of ASCII, it can be recognized heuristically with low > probability of error, and it is widely implemented. These characteristics > yield high reliability at the cost of some usability. The debate is not > about whether UTF-8 should be used, but rather about whether the standard > should forbid use of other encodings. Essentially, this is a recasting of > the text *vs*. binary debate. > > > > 5) Inasmuch as consensus on the issues described above has not yet been > reached and does not appear likely, the group has issued a call for comments > from a wider group of stakeholders. No results of that call have yet been > reported back to the group. > > > > 6) In the interim, the discussion has moved toward finding middle ground. > In particular, James Hester asked: > > > > >If we consider CIF as text as the overriding priority: > > > >1. How do we then make exchanging and storing files according to text > conventions sufficiently reliable for the purposes of CIF? How far are we > prepared to compromise? > > > >If we consider reliable exchange of information as the top priority: > > > >2. How do we then make CIFs sufficiently accessible to text-based tools? > How far are we prepared to compromise? > > A short series of proposed schemes for CIF exchange and storage proceeded > from that call: > > > > 7) In response to question (2), James offered a scheme (A) that primarily > would relax the explicit specification of UTF-8 into a set of > characteristics that a CIF encoding would need to satisfy. His > characteristics would need to be further pared down or relaxed to in > practice permit encodings other than UTF-8. This scheme will be reproduced > in a separate e-mail, as it is not currently available from the DDLm-list > web archive. > > > > 8) In response to question (1), John Bollinger offered a scheme (B) that > retains text character for CIF2, and relies on labeling when wanted or > needed to convey text conventions, and on hashing to provide verification > and reliability. This scheme will be reproduced in a separate e-mail, as it > is not currently available from the DDLm-list web archive. > > > > A limited amount of additional discussion proceeded from these proposed > exchange and archiving schemes, and that?s where we currently stand. > > > > > > Regards, > > > > John > > > > -- > > John C. Bollinger, Ph.D. > > Department of Structural Biology > > St. Jude Children's Research Hospital > > ------------------------------ > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100805/8ff3741d/attachment.html From jamesrhester at gmail.com Tue Aug 10 04:20:10 2010 From: jamesrhester at gmail.com (James Hester) Date: Tue, 10 Aug 2010 13:20:10 +1000 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> References: <20100623103310.GD15883@emerald.iucr.org> <381469.52475.qm@web87004.mail.ird.yahoo.com> <984921.99613.qm@web87011.mail.ird.yahoo.com> <826180.50656.qm@web87010.mail.ird.yahoo.com> <563298.52532.qm@web87005.mail.ird.yahoo.com> <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: I had not fully appreciated that Scheme B is intended to be applied only at the moment of transfer or archiving, and envisions users normally saving files in their preferred encoding with no hash codes or encoding hints required (I will call the inclusion of such hints and hashes as 'decoration'). A direct result of allowing undecorated files to reside on disk is that CIF software producers will need to write software that will function with arbitrary encodings with no decoration to help them, as that is the form that users' files will be most often be in. Furthermore, given the ease with which files can be transferred between users (email attachment, saved in shared, network-mounted directory, drag and drop onto USB stick etc.) it is unlikely that Scheme B or anything involving extra effort would be applied unless the recipient demanded it. And given how many times that file might have changed hands across borders and operating systems within a single group collaboration, there would only be a qualified guarantee that the character to binary mapping has not been mangled en route, making any scheme applied subsequently rather pointless. We would thus go from a situation where we had a single, reliable and sometimes slightly inconvenient encoding (UTF8), to one where a CIF processor should be prepared for any given CIF file to be one of a wide range of encodings which need to be guessed. This CIF processor is thus forced to autodetect (possibly unreliably) or interact with the user. This looks a lot like a step backward to me. I would much prefer a scheme which did not compromise reliability in such a significant way. My previous (somewhat clunky) attempts to adjust Scheme B were directed at trying to force any file with the CIF2.0 magic number to be either decorated or UTF-8, meaning that software has a reasonably high confidence in file integrity. An alternative way of thinking about this is that CIF files also act as the mechanism of information transfer between software programs. This may be less pronounced in mmCIF environments, but in small molecule work a large number of programs work with CIFs and pass structural information around using CIF files. Therefore, the act of writing a CIF file is essentially already an act of information transfer: when a separate program is asked to input that CIF, the information has been transferred, even if that software is running on the same computer. Now, moving on to the detailed contours of Scheme B and addressing the particular points that John and I have been discussing. My original criticisms are the ones preceded by numerals. >(1) Information will not be correctly transferred if the hasher uses the > wrong encoding for calculating the hash, and the recipient uses the same > wrong encoding. The recipient is likely to use the encoding suggested by > the creator, so the probability of this type of failure occurring is > essentially the probability of the CIF writer instructing the hash > calculator to use the wrong encoding. Other mistakes by the CIF writer > (forgetting to add a hash, leaving an old hash in the file) are likely to > simply result in rejection, which I don't see as a failure. > > This is a valid criticism, but in practice I think it can be significantly > mitigated by good design of the hashing program (such as reliance by default > on the environmental default encoding, and detecting probable encoding > mismatches). In the case of a CIF-specific editor, there is no need for any > separate step and no chance of encoding mismatch. James has an additional > suggestion in his (ii) below. > > >(2) In order to read the hash value, the encoding of the file needs to be > known (!) > > Yes and no. In many cases, either the encoding can be determined from the > content (even without a correct encoding tag) or it can be determined well > enough to parse the file to find the hash (most ASCII supersets). > Nevertheless, something along the lines of James's (ii) below can do > better. > If we restrict the allowed encodings to those for which the ASCII codepoints can be autodetected assuming CIF2 layout (in particular the first line) I think that would be sufficiently robust. > > >(3) The recipient doesn't know if a hash value is present until they have > parsed the entire file > > This is correct. The recipient also cannot *use* the hash without parsing > the entire file, however, so it doesn't make a lot of difference. > Nevertheless, it would be possible to provide a hint at the beginning of > the file, so that parsers that wanted to avoid the overhead of the hash > computation could do so. > The point of having the hash at the front is so that a parsing program can immediately reject an undecorated, non UTF-8 file, or alternatively branch based on how reliable the encoding hint is thought to be. For example, if a hash is present, there is a somewhat stronger guarantee that the encoding hint has been checked or detected by a program rather than manually inserted. > > >(4) Assumption that all recipients will be able to handle all encodings > > There is no such assumption. Rather, there is an acknowledgement that some > systems may be unable to handle some CIFs. That is already the case with > CIF1, and it is not completely resolved by standardizing on UTF-8 (i.e. > scheme A). > There is no such thing as 'optional' for an information interchange standard. A file that conforms to the standard must be readable by parsers written according to the standard. If reading a standard-conformant file might fail or (worse) the file might be misinterpreted, information cannot always reliably be exchanged using this standard, so that optional behaviour needs to be either discarded, or made mandatory. There is thus no point in including optional behaviour in the standard. So: if the standard allows files to be written in encoding XYZ, then all readers should be able to read files written in encoding XYZ. I view the CIF1 stance of allowing any encoding as a mistake, but a benign one, as in the case of CIF1 ASCII was so entrenched that it was the defacto standard for the characters appearing in CIF1 files. In short, we have to specify a limited set of acceptable encodings. > > (5) Potential for intermediate files to be lying around the users' system > which are neither CIF2/UTF-8 nor CIF2/hashed but are in some sense CIF2 > files. > > This is intentional. Scheme B provides for reliable exchange and > archiving; it is not intended to be an integral part of the CIF format. It > would serve more as a gateway protocol, used when people transmit CIF text > or deposit it in a local archive. For all other purposes, there is no need > to make users decorate their CIFs with hashes, nor to prevent them from > treating CIFs as ordinary text files, complying with local conventions. Or > at any rate, that's the direction from which the scheme is proposed. > > I have addressed this above. > >A strong point: > >(6): user must run a CIF-aware program to produce the hash value, so there > is an opportunity to hide complexity inside the program (or just convert to > UTF-8...) > > Just converting to UTF-8 for archiving and exchange would be scheme D. Or > perhaps scheme 0, as it has come up before in a couple of different forms. > It is distinct from scheme A in that it applies only to archiving and > exchange, not generally to the CIF format. Note that scheme B does not > require UTF-8 (or UTF-16 or UTF-32) CIFs to carry a hash, so converting to > UTF-8 for storage and exchange in fact is a special case of scheme B. > Indeed. > > >We can reduce the likelihood of (1) by producing interactive CIF-hash > calculators that present the file text to the user in the nominated encoding > scheme for checking before the hash is calculated, with intelligent choice > of file contents to find non-ASCII code points. > > Indeed so. > > >We can reduce the impact of the remaining issues with the following > adjusted Scheme B (Scheme C). I would find something like Scheme C > acceptable. Relevant changes: > > > >(i) mandate putting the hash comment (if necessary) on the very first line > of the file, using ASCII encoding for each character. Most text editors > would find such mixed encoding a challenge, but as hashing must be done > programmatically I don't see an issue. Likewise, before any further text > processing is attempted, the file should be put through a hash checker, > which would output a file ready for the local environment (without a hash > check at the top). Note that the hash comment effectively replaces the > CIF2.0 magic number, reducing potential for confusion. Note that a > non-UTF-8 file without hash comment should not have the CIF2.0 magic number. > This change addresses points (2) and (3) above > > I can't accept that, for two main reasons: > > a) An important aspect of the scheme is that a file that complies with it > can be handled as an ordinary text file, at least in an environment that > correctly autodetects the encoding or that happens to assume the correct > encoding (e.g. because it is the environmental default). Most particularly, > it can still be handled as a text file on the system where it was generated. > > b) Another important aspect of the scheme is that text I carries are > compliant with the CIF2 text specifications, which require the magic code. > > In addition, > > c) For encoding autodetection, it is of great advantage to have a known > character sequence at the beginning of the file. Although having instead > two alternatives would not break autodetection, it would make autodetection > more complicated. Also, > > d) as a practical matter, it is most convenient for a program adding the > hash to write it at the end. A program checking the hash can't be > significantly bothered by such placement because it isn't be ready to use > the hash until it reaches the end. > > > As discussed above, I don't think (2) is a compelling problem in practice. > > If (3) is an issue of significant concern, then I could agree to yield on > (d) by putting the content hash on the same line as the magic code. I don't > see that being much advantage, however, given the need to parse the entire > file anyway before the hash is useful. See also below. > I would be pleased if you would agree to mandate the hash at the beginning. > > >(ii) state the encoding scheme as part of the hash line, inserted as part > of the hash calculation. In this way, at least the hasher's choice of > encoding scheme is known, rather than allowing the further possible errors > arising from hasher and user having different ideas of the encoding. > Addresses point (1) > > Scheme B already provides for an encoding tag as a hint about what encoding > was used (B.3), and I think it reasonable to expect a hasher to create > and/or rewrite that tag as necessary. I would be willing to make it a > requirement that if present, the tag must correctly indicate the encoding. > That would also address (3) to some extent by serving as a hint that a > content hash follows (somewhere in the file). > Good. I don't think the presence of the encoding hint by itself is enough to serve as a guarantee that a hash will be found, as there will be users who simply insert that by themselves, being unaware that it is supposed to be machine generated. > > >(iii) restrict possible encodings to internationally recognised ones with > well-specified Unicode mappings. This addresses point (4) > > I don't see the need for this, and to some extent I think it could be > harmful. For example, if Herb sees a use for a scheme of this sort in > conjunction with imgCIF (unknown at this point whether he does), then he > might want to be able to specify an encoding specific to imgCIF, such as one > that provides for multiple text segments, each with its own character > encoding. To the extent that imgCIF is an international standard, perhaps > that could still satisfy the restriction, but I don't think that was the > intended meaning of "internationally recognised". > > As for "well-specified Unicode mappings", I think maybe I'm missing > something. CIF text is already limited to Unicode characters, and any > encoding that can serve for a particular piece of CIF text must map at least > the characters actually present in the text. What encodings or scenarios > would be excluded, then, by that aspect of this suggestion? > My intention was to make sure that not only the particular user who created the file knew this mapping, but that the mapping was publically available. Certainly only Unicode encodable code points will appear, but the recipient needs to be able to recover the mapping from the file bytes to Unicode without relying on e.g. files that will be supplied on request by someone whose email address no longer works. > > > I offer a few additional comments about scheme B: > > A) Because it does not require a hash for UTF-8 (including pure US-ASCII; > see B.5), scheme B is a superset of scheme A. > > B) Scheme B does not use quite the same language as scheme A with respect > to detectable encodings. As a result, it supports (without tagging or > hashing) not just UTF-8, but also all UTF-16 and UTF-32 variants. This is > intentional. > I am concerned that the vast majority of users based in English speaking countries (and many non English speaking countries) will be quite annoyed if they have to deal with UTF-16/32 CIF2 files that are no longer accessible to the simple ASCII-based tools and software that they are used to. Because of this, allowing undecorated UTF16/32 would be far more disruptive than forcing people to use UTF8 only. Thus my stipulation on maintaining compatibility with ASCII for undecorated files. > > C) Scheme B is not aimed at ensuring that every conceivable receiver be > able to interpret every scheme-B-compliant CIF. Instead, it provides > receivers the ability to *judge* whether they can interpret particular CIFs, > and afterwards to *verify* that they have done so correctly. Ensuring that > receivers can interpret CIFs is thus a responsibility of the sender / > archive maintainer, possibly in cooperation with the receiver / retriever. > As I've said before, I don't see the paradigm of live negotiation between senders and receivers as very useful, as it fails to account for CIFs being passed between different software (via reading/writing to a file system), or CIFs where the creator is no longer around, or technically unsophisticated senders where, for example, the software has produced an undecorated CIF in some native encoding and the sender has absolutely no idea why the receiver (if they even have contact with the receiver!) can't read the file properly. I prefer to see the standard that we set as a substitute for live negotiation, so leaving things up to the users is in that sense an abrogation of our responsibility. > > D) I reiterate that the scheme's self-description as being "for archiving > and exchange of CIF text" is intentional and meaningful. It is not intended > to require that CIFs carry hashes or encoding tags when used for other > purposes. The scheme is positioned for use on the front end of an archive > ingestion system or as part of a sending-side agent for CIF exchange, though > it need not be restricted to such scenarios. > > E) Furthermore, scheme B does not interfere with the ability of any party > to transcode CIF text if a different encoding is more suitable for their > purposes. I would expect many archivers to perform such transcoding as a > matter of course, though none are obligated to do so. > > > In light of James' comments and suggestions, then, I offer scheme B'. It > differs from scheme B at point 3, by requiring, in those cases where the > encoding cannot reliably be autodetected, that the correct encoding name be > written in an encoding tag at the beginning of the file. To my knowledge, > all cases of interest either succumb to autodetection or are sufficiently > congruent with US-ASCII (or otherwise sufficiently decodable, given known > initial characters) to allow the encoding tag to be read before the exact > encoding is known. I emphasize that although the encoding tag being > incorrect makes a CIF non-compliant with this scheme, that does not prevent > the correct encoding being discovered via the content hash, by iteration > over all available schemes (provided that the correct scheme is available). > > Scheme B': > 1. This scheme provides for reliable archiving and exchange of CIF text. > Although it depends in some cases on metadata embedded in the CIF text, > presence of such metadata is not a well-formedness constraint on the text > itself. > 2. For the purposes of storage and transfer, CIF files must be treated by > all file handling protocols as streams of bytes. > 3. Any text encoding may be used. If the encoding does not comply with > either (5a) or (5b) below, then its name must be given via an encoding tag > following the magic code, on the same line. Otherwise, an encoding tag is > optional, but if present then it must correctly name the encoding. > 4. Archiving or exchange of CIF text complies with this scheme if the CIF > text contains a correct content hash: > a) The hash value is computed by applying the MD5 algorithm to the > Unicode code point values of the CIF text, in the order they appear, > excluding all code points of CIF comments and all other CIF whitespace > appearing outside data values or separating List or Table elements. > b) The code point stream is converted to a byte stream for input to the > hash function by interpreting each code point as a 24-bit integer, appearing > on the byte stream in order from most-significant to least-significant byte. > c) The hash value is expressed in the CIF itself as a structured comment > of the form: > #\#content_hash_md5:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > where the Xs represent the hexadecimal digits of the computed hash > value. > d) The hash comment may appear anywhere in the CIF that a comment may > appear, but conventionally it is at the end of the CIF. > 5. Archiving or exchange of CIF text that does not contain a content hash > complies with this scheme if > a) the text encoding is specified in an international standard and is > distinguishable from all other encodings at the binary level, or > b) the text encoding is coincident with US-ASCII for all code points > appearing in the CIF. > For the purposes of (5a), distinguishing encodings may rely on the > characteristics of CIF, such as the allowed character set and the required > CIF version comment, and also on the actual CIF text (such as for > recognition of UTF-8 by its encoding of non-ASCII characters). > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > ddlm-group mailing list > ddlm-group at iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100810/137e73f0/attachment-0001.html From yaya at bernstein-plus-sons.com Tue Aug 10 11:42:26 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 10 Aug 2010 06:42:26 -0400 (EDT) Subject: [Cif2-encoding] Fundamental source of disagreement In-Reply-To: References: <826180.50656.qm@web87010.mail.ird.yahoo.com> <563298.52532.qm@web87005.mail.ird.yahoo.com> <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: With all due respect to James and others who adhere to the view that: "There is no such thing as 'optional' for an information interchange standard." I believe this the fundamental source of our disagreement on the the direction for CIF2. Optional features are common in almost all current successful standards for information interchange, including HTML4, XMF and CIF1. As a practical matter, one tries to have strict writers and liberal readers for interchange standards to encourage migration to as common a convention as possible. Even so, if we are too strict in our rules for what is and is not a proper CIF, we will probably encourage the growth of multiple unofficial, unmanaged and non-interchangeable CIF2 dialects. As for John's hashing scheme, I suspect some variation of it will find signficant use in major archives, just as associating MD5 checksums with tarballs does for many software distributors, but that we also will need some easier-to-generate-and-transfer _optional_ encoding hint schemes, such as the accented "o's". One simple way to handle it would be: 1. Put some variant of the accented "o's" into the _optional_ magic number; and 2. Adopt the tarball approach to MD5 checksums by having it not in the header but in a separate file, simply generating it from a canonical UTF8 representation of the CIF2 file. The accented o's are easy to carry along as an encoding hint, and if you get the encoding hint right, then you will easily be able to generate a canonical UTF8 file to validate the MD5 checksum against if you wish for a critical file transfer, e.g. to an archive or a journal. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Tue, 10 Aug 2010, James Hester wrote: > I had not fully appreciated that Scheme B is intended to be applied only at the > moment of transfer or archiving, and envisions users normally saving files in their > preferred encoding with no hash codes or encoding hints required (I will call the > inclusion of such hints and hashes as 'decoration').? A direct result of allowing > undecorated files to reside on disk is that CIF software producers will need to > write software that will function with arbitrary encodings with no decoration to > help them, as that is the form that users' files will be most often be in.? > Furthermore, given the ease with which files can be transferred between users > (email attachment, saved in shared, network-mounted directory, drag and drop onto > USB stick etc.) it is unlikely that Scheme B or anything involving extra effort > would be applied unless the recipient demanded it.? And given how many times that > file might have changed hands across borders and operating systems within a single > group collaboration, there would only be a qualified guarantee that the character > to binary mapping has not been mangled en route, making any scheme applied > subsequently rather pointless. > > We would thus go from a situation where we had a single, reliable and sometimes > slightly inconvenient encoding (UTF8), to one where a CIF processor should be > prepared for any given CIF file to be one of a wide range of encodings which need > to be guessed. This CIF processor is thus forced to autodetect (possibly > unreliably) or interact with the user.? This looks a lot like a step backward to > me. > > I would much prefer a scheme which did not compromise reliability in such a > significant way.? My previous (somewhat clunky) attempts to adjust Scheme B were > directed at trying to force any file with the CIF2.0 magic number to be either > decorated or UTF-8, meaning that software has a reasonably high confidence in file > integrity. > > An alternative way of thinking about this is that CIF files also act as the > mechanism of information transfer between software programs.? This may be less > pronounced in mmCIF environments, but in small molecule work a large number of > programs work with CIFs and pass structural information around using CIF files.? > Therefore, the act of writing a CIF file is essentially already an act of > information transfer: when a separate program is asked to input that CIF, the > information has been transferred, even if that software is running on the same > computer.? > > Now, moving on to the detailed contours of Scheme B and addressing the particular > points that John and I have been discussing.? My original criticisms are the ones > preceded by numerals. > > >(1) Information will not be correctly transferred if the hasher uses > the wrong encoding for calculating the hash, and the recipient uses the > same wrong encoding. ?The recipient is likely to use the encoding > suggested by the creator, so the probability of this type of failure > occurring is essentially the probability of the CIF writer instructing > the hash calculator to use the wrong encoding. ?Other mistakes by the > CIF writer (forgetting to add a hash, leaving an old hash in the file) > are likely to simply result in rejection, which I don't see as a > failure. > > This is a valid criticism, but in practice I think it can be > significantly mitigated by good design of the hashing program (such as > reliance by default on the environmental default encoding, and > detecting probable encoding mismatches). ?In the case of a CIF-specific > editor, there is no need for any separate step and no chance of > encoding mismatch. ?James has an additional suggestion in his (ii) > below. > > >(2) In order to read the hash value, the encoding of the file needs to > be known (!) > > Yes and no. ?In many cases, either the encoding can be determined from > the content (even without a correct encoding tag) or it can be > determined well enough to parse the file to find the hash (most ASCII > supersets). ?Nevertheless, something along the lines of James's (ii) > below can do better. > > > If we restrict the allowed encodings to those for which the ASCII codepoints can be > autodetected assuming CIF2 layout (in particular the first line) I think that would > be sufficiently robust. > > >(3) The recipient doesn't know if a hash value is present until they > have parsed the entire file > > This is correct. ?The recipient also cannot *use* the hash without > parsing the entire file, however, so it doesn't make a lot of > difference. ?Nevertheless, it would be possible to provide a hint at > the beginning of the file, so that parsers that wanted to avoid the > overhead of the hash computation could do so. > > > The point of having the hash at the front is so that a parsing program can > immediately reject an undecorated, non UTF-8 file, or alternatively branch based on > how reliable the encoding hint is thought to be.? For example, if a hash is > present, there is a somewhat stronger guarantee that the encoding hint has been > checked or detected by a program rather than manually inserted.? > > >(4) Assumption that all recipients will be able to handle all > encodings > > There is no such assumption. ?Rather, there is an acknowledgement that > some systems may be unable to handle some CIFs. ?That is already the > case with CIF1, and it is not completely resolved by standardizing on > UTF-8 (i.e. scheme A). > > > There is no such thing as 'optional' for an information interchange standard.? A > file that conforms to the standard must be readable by parsers written according to > the standard. If reading a standard-conformant file might fail or (worse) the file > might be misinterpreted, information cannot always reliably be exchanged using this > standard, so that optional behaviour needs to be either discarded, or made > mandatory. There is thus no point in including optional behaviour in the standard. > So: if the standard allows files to be written in encoding XYZ, then all readers > should be able to read files written in encoding XYZ.? I view the CIF1 stance of > allowing any encoding as a mistake, but a benign one, as in the case of CIF1 ASCII > was so entrenched that it was the defacto standard for the characters appearing in > CIF1 files.? In short, we have to specify a limited set of acceptable encodings. > > > > (5) Potential for intermediate files to be lying around the users' > system which are neither CIF2/UTF-8 nor CIF2/hashed but are in some > sense CIF2 files. > > This is intentional. ?Scheme B provides for reliable exchange and > archiving; it is not intended to be an integral part of the CIF format. > ?It would serve more as a gateway protocol, used when people transmit > CIF text or deposit it in a local archive. ?For all other purposes, > there is no need to make users decorate their CIFs with hashes, nor to > prevent them from treating CIFs as ordinary text files, complying with > local conventions. ?Or at any rate, that's the direction from which the > scheme is proposed. > > I have addressed this above. > ? > >A strong point: > >(6): user must run a CIF-aware program to produce the hash value, so > there is an opportunity to hide complexity inside the program (or just > convert to UTF-8...) > > Just converting to UTF-8 for archiving and exchange would be scheme D. > ?Or perhaps scheme 0, as it has come up before in a couple of different > forms. ?It is distinct from scheme A in that it applies only to > archiving and exchange, ?not generally to the CIF format. ?Note that > scheme B does not require UTF-8 (or UTF-16 or UTF-32) CIFs to carry a > hash, so converting to UTF-8 for storage and exchange in fact is a > special case of scheme B. > > > Indeed. > > >We can reduce the likelihood of (1) by producing interactive CIF-hash > calculators that present the file text to the user in the nominated > encoding scheme for checking before the hash is calculated, with > intelligent choice of file contents to find non-ASCII code points. > > Indeed so. > > >We can reduce the impact of the remaining issues with the following > adjusted Scheme B (Scheme C). ?I would find something like Scheme C > acceptable. ?Relevant changes: > > > >(i) mandate putting the hash comment (if necessary) on the very first > line of the file, using ASCII encoding for each character. ?Most text > editors would find such mixed encoding a challenge, but as hashing must > be done programmatically I don't see an issue. ?Likewise, before any > further text processing is attempted, the file should be put through a > hash checker, which would output a file ready for the local environment > (without a hash check at the top). ?Note that the hash comment > effectively replaces the CIF2.0 magic number, reducing potential for > confusion. ?Note that a non-UTF-8 file without hash comment should not > have the CIF2.0 magic number. This change addresses points (2) and (3) > above > ? > > I can't accept that, for two main reasons: > > a) An important aspect of the scheme is that a file that complies with > it can be handled as an ordinary text file, at least in an environment > that correctly autodetects the encoding or that happens to assume the > correct encoding (e.g. because it is the environmental default). ?Most > particularly, it can still be handled as a text file on the system > where it was generated. > > b) Another important aspect of the scheme is that text I carries are > compliant with the CIF2 text specifications, which require the magic > code. > > In addition, > > c) For encoding autodetection, it is of great advantage to have a known > character sequence at the beginning of the file. ?Although having > instead two alternatives would not break autodetection, it would make > autodetection more complicated. ?Also, > > d) as a practical matter, it is most convenient for a program adding > the hash to write it at the end. ?A program checking the hash can't be > significantly bothered by such placement because it isn't be ready to > use the hash until it reaches the end. > > > As discussed above, I don't think (2) is a compelling problem in > practice. > > If (3) is an issue of significant concern, then I could agree to yield > on (d) by putting the content hash on the same line as the magic code. > ?I don't see that being much advantage, however, given the need to > parse the entire file anyway before the hash is useful. ?See also > below. > > > I would be pleased if you would agree to mandate the hash at the beginning. > > >(ii) state the encoding scheme as part of the hash line, inserted as > part of the hash calculation. ?In this way, at least the hasher's > choice of encoding scheme is known, rather than allowing the further > possible errors arising from hasher and user having different ideas of > the encoding. ?Addresses point (1) > > Scheme B already provides for an encoding tag as a hint about what > encoding was used (B.3), and I think it reasonable to expect a hasher > to create and/or rewrite that tag as necessary. ?I would be willing to > make it a requirement that if present, the tag must correctly indicate > the encoding. ?That would also address (3) to some extent by serving as > a hint that a content hash follows (somewhere in the file). > > > Good.? I don't think the presence of the encoding hint by itself is enough to serve > as a guarantee that a hash will be found, as there will be users who simply insert > that by themselves, being unaware that it is supposed to be machine generated. > > >(iii) restrict possible encodings to internationally recognised ones > with well-specified Unicode mappings. ?This addresses point (4) > > I don't see the need for this, and to some extent I think it could be > harmful. ?For example, if Herb sees a use for a scheme of this sort in > conjunction with imgCIF (unknown at this point whether he does), then > he might want to be able to specify an encoding specific to imgCIF, > such as one that provides for multiple text segments, each with its own > character encoding. ?To the extent that imgCIF is an international > standard, perhaps that could still satisfy the restriction, but I don't > think that was the intended meaning of "internationally recognised". > > As for "well-specified Unicode mappings", I think maybe I'm missing > something. ?CIF text is already limited to Unicode characters, and any > encoding that can serve for a particular piece of CIF text must map at > least the characters actually present in the text. ?What encodings or > scenarios would be excluded, then, by that aspect of this suggestion? > > > My intention was to make sure that not only the particular user who created the > file knew this mapping, but that the mapping was publically available.? Certainly > only Unicode encodable code points will appear, but the recipient needs to be able > to recover the mapping from the file bytes to Unicode without relying on e.g. files > that will be supplied on request by someone whose email address no longer works. > > > I offer a few additional comments about scheme B: > > A) Because it does not require a hash for UTF-8 (including pure > US-ASCII; see B.5), scheme B is a superset of scheme A. > > B) Scheme B does not use quite the same language as scheme A with > respect to detectable encodings. ?As a result, it supports (without > tagging or hashing) not just UTF-8, but also all UTF-16 and UTF-32 > variants. ?This is intentional. > > > I am concerned that the vast majority of users based in English speaking countries > (and many non English speaking countries) will be quite annoyed if they have to > deal with UTF-16/32 CIF2 files that are no longer accessible to the simple > ASCII-based tools and software that they are used to.? Because of this, allowing > undecorated UTF16/32 would be far more disruptive than forcing people to use UTF8 > only. Thus my stipulation on maintaining compatibility with ASCII for undecorated > files. > ? > > C) Scheme B is not aimed at ensuring that every conceivable receiver be > able to interpret every scheme-B-compliant CIF. ?Instead, it provides > receivers the ability to *judge* whether they can interpret particular > CIFs, and afterwards to *verify* that they have done so correctly. > ?Ensuring that receivers can interpret CIFs is thus a responsibility of > the sender / archive maintainer, possibly in cooperation with the > receiver / retriever. > > > As I've said before, I don't see the paradigm of live negotiation between senders > and receivers as very useful, as it fails to account for CIFs being passed between > different software (via reading/writing to a file system), or CIFs where the > creator is no longer around, or technically unsophisticated senders where, for > example, the software has produced an undecorated CIF in some native encoding and > the sender has absolutely no idea why the receiver (if they even have contact with > the receiver!) can't read the file properly.?? I prefer to see the standard that we > set as a substitute for live negotiation, so leaving things up to the users is in > that sense an abrogation of our responsibility. > > D) I reiterate that the scheme's self-description as being "for > archiving and exchange of CIF text" is intentional and meaningful. ?It > is not intended to require that CIFs carry hashes or encoding tags when > used for other purposes. ?The scheme is positioned for use on the front > end of an archive ingestion system or as part of a sending-side agent > for CIF exchange, though it need not be restricted to such scenarios. > > E) Furthermore, scheme B does not interfere with the ability of any > party to transcode CIF text if a different encoding is more suitable > for their purposes. ?I would expect many archivers to perform such > transcoding as a matter of course, though none are obligated to do so. > > > In light of James' comments and suggestions, then, I offer scheme B'. > ?It differs from scheme B at point 3, by requiring, in those cases > where the encoding cannot reliably be autodetected, that the correct > encoding name be written in an encoding tag at the beginning of the > file. ?To my knowledge, all cases of interest either succumb to > autodetection or are sufficiently congruent with US-ASCII (or otherwise > sufficiently decodable, given known initial characters) to allow the > encoding tag to be read before the exact encoding is known. ?I > emphasize that although the encoding tag being incorrect makes a CIF > non-compliant with this scheme, that does not prevent the correct > encoding being discovered via the content hash, by iteration over all > available schemes (provided that the correct scheme is available). > > Scheme B': > 1. This scheme provides for reliable archiving and exchange of CIF > text. ?Although it depends in some cases on metadata embedded in the > CIF text, presence of such metadata is not a well-formedness constraint > on the text itself. > 2. For the purposes of storage and transfer, CIF files must be treated > by all file handling protocols as streams of bytes. > 3. Any text encoding may be used. ?If the encoding does not comply with > either (5a) or (5b) below, then its name must be given via an encoding > tag following the magic code, on the same line. ?Otherwise, an encoding > tag is optional, but if present then it must correctly name the > encoding. > 4. Archiving or exchange of CIF text complies with this scheme if the > CIF text contains a correct content hash: > ? a) The hash value is computed by applying the MD5 algorithm to the > Unicode code point values of the CIF text, in the order they appear, > excluding all code points of CIF comments and all other CIF whitespace > appearing outside data values or separating List or Table elements. > ? b) The code point stream is converted to a byte stream for input to > the hash function by interpreting each code point as a 24-bit integer, > appearing on the byte stream in order from most-significant to > least-significant byte. > ? c) The hash value is expressed in the CIF itself as a structured > comment of the form: > ? ? ? #\#content_hash_md5:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > ? ? ?where the Xs represent the hexadecimal digits of the computed hash > value. > ? d) The hash comment may appear anywhere in the CIF that a comment may > appear, but conventionally it is at the end of the CIF. > 5. Archiving or exchange of CIF text that does not contain a content > hash complies with this scheme if > ? a) the text encoding is specified in an international standard and is > distinguishable from all other encodings at the binary level, or > ? b) the text encoding is coincident with US-ASCII for all code points > appearing in the CIF. > For the purposes of (5a), distinguishing encodings may rely on the > characteristics of CIF, such as the allowed character set and the > required CIF version comment, and also on the actual CIF text (such as > for recognition of UTF-8 by its encoding of non-ASCII characters). > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > > Email Disclaimer: ?www.stjude.org/emaildisclaimer > > _______________________________________________ > ddlm-group mailing list > ddlm-group at iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > From John.Bollinger at STJUDE.ORG Wed Aug 11 17:30:28 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Wed, 11 Aug 2010 11:30:28 -0500 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <20100623103310.GD15883@emerald.iucr.org> <381469.52475.qm@web87004.mail.ird.yahoo.com> <984921.99613.qm@web87011.mail.ird.yahoo.com> <826180.50656.qm@web87010.mail.ird.yahoo.com> <563298.52532.qm@web87005.mail.ird.yahoo.com> <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> On Monday, August 09, 2010 10:20 PM, James Hester wrote: >I had not fully appreciated that Scheme B is intended to be applied only at the moment of transfer or archiving, and envisions users normally saving files in their preferred encoding with no hash codes or encoding hints required (I will call the inclusion of such hints and hashes as 'decoration'). "Envisions users normally [...]" is a bit stronger than my position or the intended orientation of Scheme B. "Accommodates" would be my choice of wording. > A direct result of allowing undecorated files to reside on disk is that CIF software producers will need to write software that will function with arbitrary encodings with no decoration to help them, as that is the form that users' files will be most often be in. The standard can do no more to prevent users from storing undecorated CIFs than it can to prevent users from storing CIF text encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding. More generally, all the standard can do is define the characteristics of a conformant CIF -- it can never prevent CIF-like but non-conformant files from being created, used, exchanged, or archived as if they were conformant CIFs. Regardless of the standard's ultimate position on this issue, software authors will have to be guided by practical considerations and by the real-world requirements placed on their programs. In particular, they will have to decide whether to accept "CIF" input that in fact violates the standard in various ways, and / or they will have to decide which optional CIF behaviors they will support. As such, I don't see a significant distinction between the alternatives before us as regards the difficulty, complexity, or requirements of CIF2 software. Furthermore, no formulation of CIF is inherently reliable or unreliable, because reliability (in this sense) is a characteristic of data transfer, not of data themselves. Scheme B targets the activities that require reliability assurance, and disregards those that don't. In a practical sense, this isn't any different from scheme A, because it is only when the encoding is potentially uncertain -- to wit, in the context of data transfer -- that either scheme need be applied (see also below). I suppose I would be willing to make scheme B a general requirement of the CIF format, but I don't see any advantage there over the current formulation. The actual behavior of people and the practical requirements on CIF software would not appreciably change. > Furthermore, given the ease with which files can be transferred between users (email attachment, saved in shared, network-mounted directory, drag and drop onto USB stick etc.) it is unlikely that Scheme B or anything involving extra effort would be applied unless the recipient demanded it. For hand-created or hand-edited CIFs, I agree. CIFs manipulated via a CIF2-compliant editor could be relied upon to conform to scheme B, however, provided that is standardized. But the same applies to scheme A, given that few operating environments default to UTF-8 for text. > And given how many times that file might have changed hands across borders and operating systems within a single group collaboration, there would only be a qualified guarantee that the character to binary mapping has not been mangled en route, making any scheme applied subsequently rather pointless. That also does not distinguish among the alternatives before us. I appreciate the desire for an absolute guarantee of reliability, but none is available. Qualified guarantees are the best we can achieve (and that's a technical assessment, not an aphorism). >We would thus go from a situation where we had a single, reliable and sometimes slightly inconvenient encoding (UTF8), to one where a CIF processor should be prepared for any given CIF file to be one of a wide range of encodings which need to be guessed. Under scheme A or the present draft text, we have "a single, reliable [...] encoding" only in the sense that the standard *specifies* that that encoding be used. So far, however, I see little will to produce or use processors that are restricted to UTF-8, and I have every expectation that authors will continue to produce CIFs in various encodings regardless of the standard's ultimate stance. Yes, it might be nice if everyone and every system converged on UTF-8 for text encoding, but CIF2 cannot force that to happen, not even among crystallographers. In practice, then, we really have a situation where the practical / useful CIF2 processor must be prepared to handle a variety of encodings (details dependent on system requirements), which may need to be guessed, with no standard mechanism for helping the processor make that determination or for allowing it to check its guess. Scheme B improves that situation by standardizing a general reliability assurance mechanism, which otherwise would be missing. In view of the practical situation, I see no down side at all. A CIF processor working with scheme B is *more* able, not less. [...] >I would much prefer a scheme which did not compromise reliability in such a significant way. There is no such compromise, because in practice, we're not starting from a reliable position. >My previous (somewhat clunky) attempts to adjust Scheme B were directed at trying to force any file with the CIF2.0 magic number to be either decorated or UTF-8, meaning that software has a reasonably high confidence in file integrity. > >An alternative way of thinking about this is that CIF files also act as the mechanism of information transfer between software programs. [... W]hen a separate program is asked to input that CIF, the information has been transferred, even if that software is running on the same computer. So in that sense, one could argue that Scheme B already applies to all CIFs, its assertion to the contrary notwithstanding. Honestly, though, I don't think debating semantic details of terms such as "data transfer" is useful because in practice, and independent of scheme A, B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to choose what form of reliability assurance to accept or demand, if any. >Now, moving on to the detailed contours of Scheme B and addressing the particular points that John and I have been discussing. My original criticisms are the ones preceded by numerals. (Quote-indentation levels have been adjusted in what follows.) [...] >>>(2) In order to read the hash value, the encoding of the file needs to be known (!) >> >>Yes and no. In many cases, either the encoding can be determined from the content (even without a correct encoding tag) or it can be determined well enough to parse the file to find the hash (most ASCII supersets). Nevertheless, something along the lines of James's (ii) below can do better. > >If we restrict the allowed encodings to those for which the ASCII codepoints can be autodetected assuming CIF2 layout (in particular the first line) I think that would be sufficiently robust. I'm glad we can agree on something. :-) >>>(3) The recipient doesn't know if a hash value is present until they have parsed the entire file >> >>This is correct. The recipient also cannot *use* the hash without parsing the entire file, however, so it doesn't make a lot of difference. Nevertheless, it would be possible to provide a hint at the beginning of the file, so that parsers that wanted to avoid the overhead of the hash computation could do so. > >The point of having the hash at the front is so that a parsing program can immediately reject an undecorated, non UTF-8 file, or alternatively branch based on how reliable the encoding hint is thought to be. For example, if a hash is present, there is a somewhat stronger guarantee that the encoding hint has been checked or detected by a program rather than manually inserted. I can see some advantage to that, offsetting the added complication for CIF writers. I agree in principle to put the computed hash at the front of the file. >>>(4) Assumption that all recipients will be able to handle all encodings >> >>There is no such assumption. Rather, there is an acknowledgement that some systems may be unable to handle some CIFs. That is already the case with CIF1, and it is not completely resolved by standardizing on UTF-8 (i.e. scheme A). > >There is no such thing as 'optional' for an information interchange standard. A file that conforms to the standard must be readable by parsers written according to the standard. If reading a standard-conformant file might fail or (worse) the file might be misinterpreted, information cannot always reliably be exchanged using this standard, so that optional behaviour needs to be either discarded, or made mandatory. There is thus no point in including optional behaviour in the standard. So: if the standard allows files to be written in encoding XYZ, then all readers should be able to read files written in encoding XYZ. I view the CIF1 stance of allowing any encoding as a mistake, but a benign one, as in the case of CIF1 ASCII was so entrenched that it was the defacto standard for the characters appearing in CIF1 files. In short, we have to specify a limited set of acceptable encodings. As Herb astutely observed, those assertions reflect a fundamental source of our disagreement. I think we can all agree that a standard that permits conforming software to misinterpret conforming data is undesirable. Surely we can also agree that an information interchange standard does not serve its purpose if it does not support information being successfully interchanged. It does not follow, however, that the artifacts by which any two parties realize an information interchange must be interpretable by all other conceivable parties, nor does it follow that that would be a supremely advantageous characteristic if it were achievable. It also does not follow that recognizable failure of any particular attempt at interchange must at all costs be avoided, or that a data interchange standard must take no account of its usage context. Optional and alternative behaviors are not fundamentally incompatible with a data interchange standard, as XML and HTML demonstrate. Or consider the extreme variability of CIF text content: whether a particular CIF is suitable for a particular purpose depends intimately on exactly which data are present in it, and even to some extent on which data names are used to present them, even though ALL are optional as far as the format is concerned. If I say 'This CIF is unsuitable for my present purpose because it does not contain _symmetry_space_group_name_H-M', that does not mean the CIF standard is broken. Yet, it is not qualitatively different for me to say 'This CIF is unsuitable because it is encoded in CCSID 500' despite CIF2 (hypothetically) permitting arbitrary encodings. [...] >>>(iii) restrict possible encodings to internationally recognised ones with well-specified Unicode mappings. This addresses point (4) >> >>I don't see the need for this, and to some extent I think it could be harmful. For example, if Herb sees a use for a scheme of this sort in conjunction with imgCIF (unknown at this point whether he does), then he might want to be able to specify an encoding specific to imgCIF, such as one that provides for multiple text segments, each with its own character encoding. To the extent that imgCIF is an international standard, perhaps that could still satisfy the restriction, but I don't think that was the intended meaning of "internationally recognised". >> >>As for "well-specified Unicode mappings", I think maybe I'm missing something. CIF text is already limited to Unicode characters, and any encoding that can serve for a particular piece of CIF text must map at least the characters actually present in the text. What encodings or scenarios would be excluded, then, by that aspect of this suggestion? > >My intention was to make sure that not only the particular user who created the file knew this mapping, but that the mapping was publically available. Certainly only Unicode encodable code points will appear, but the recipient needs to be able to recover the mapping from the file bytes to Unicode without relying on e.g. files that will be supplied on request by someone whose email address no longer works. This issue is relevant only to the parties among whom a particular CIF is exchanged. The standard would not particularly assist those parties by restricting the permitted encodings, because they can safely ignore such restrictions if they mutually agree to do so (whether statically or dynamically), and they (specifically, the CIF originator) must anyway comply with them if no such agreement is implicit or can be reached. >>I offer a few additional comments about scheme B: [...] >>B) Scheme B does not use quite the same language as scheme A with respect to detectable encodings. As a result, it supports (without tagging or hashing) not just UTF-8, but also all UTF-16 and UTF-32 variants. This is intentional. > >I am concerned that the vast majority of users based in English speaking countries (and many non English speaking countries) will be quite annoyed if they have to deal with UTF-16/32 CIF2 files that are no longer accessible to the simple ASCII-based tools and software that they are used to. Because of this, allowing undecorated UTF16/32 would be far more disruptive than forcing people to use UTF8 only. Thus my stipulation on maintaining compatibility with ASCII for undecorated files. Supporting UTF-16/32 without tagging or hashing is not a key provision of scheme B, and I could live without it, but I don't think that would significantly change the likelihood of a user unexpectedly encountering undecorated UTF-16/32 CIFs. It would change only whether such files were technically CIF-conformant, which doesn't much matter to the user on the spot. In any case, it is not the lack of decoration that is the basic problem here. >>C) Scheme B is not aimed at ensuring that every conceivable receiver be able to interpret every scheme-B-compliant CIF. Instead, it provides receivers the ability to *judge* whether they can interpret particular CIFs, and afterwards to *verify* that they have done so correctly. Ensuring that receivers can interpret CIFs is thus a responsibility of the sender / archive maintainer, possibly in cooperation with the receiver / retriever. > >As I've said before, I don't see the paradigm of live negotiation between senders and receivers as very useful, as it fails to account for CIFs being passed between different software (via reading/writing to a file system), or CIFs where the creator is no longer around, or technically unsophisticated senders where, for example, the software has produced an undecorated CIF in some native encoding and the sender has absolutely no idea why the receiver (if they even have contact with the receiver!) can't read the file properly. I prefer to see the standard that we set as a substitute for live negotiation, so leaving things up to the users is in that sense an abrogation of our responsibility. That scenario will undoubtedly occur occasionally regardless of the outcome of this discussion. If it is our responsibility to avoid it at all costs then we are doomed to fail in that regard. Software *will* under some circumstances produce undecorated, non-UTF-8 "CIFs" because that is sometimes convenient, efficient, and appropriate for the program's purpose. I think, though, those comments reflect a bit of a misconception. The overall purpose of CIF supporting multiple encodings would be to allow specific CIFs to be better adapted for specific purposes. Such purposes include, but are not limited to () exchanging data with general-purpose program(s) on the same system () exchanging data with crystallography program(s) on the same system () supporting performance or storage objectives of specific programs or systems () efficiently supporting problem or data domains in which Latin text is a minority of the content (e.g. imgCIF) () storing data in a personal archive () exchanging data with known third parties () publishing data to a general audience *Few, if any, of those uses would be likely to involve live negotiation.* That's why I assigned primary responsibility for selecting encodings to the entity providing the CIF. I probably should not even have mentioned cooperation of the receiver; I did so more because it is conceivable than because it is likely. Under any scheme I can imagine, some CIFs will not be well suited to some purposes. I want to avoid the situation that *no* conformant CIF can be well suited to some reasonable purposes. I am willing to forgo the result that *every* conformant CIF is suited to certain other, also reasonable purposes. Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From jamesrhester at gmail.com Mon Aug 16 07:53:25 2010 From: jamesrhester at gmail.com (James Hester) Date: Mon, 16 Aug 2010 16:53:25 +1000 Subject: [Cif2-encoding] Fundamental source of disagreement In-Reply-To: References: <826180.50656.qm@web87010.mail.ird.yahoo.com> <563298.52532.qm@web87005.mail.ird.yahoo.com> <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: I'm not sure that you have identified the fundamental source of disagreement, but if we disagree on our approaches to optional behaviour we will have trouble finalising the standard, so I have addressed Herb's comments below. On Tue, Aug 10, 2010 at 8:42 PM, Herbert J. Bernstein wrote: > > With all due respect to James and others who adhere to the view that: > > "There is no such thing as 'optional' for an information interchange standard." > > I believe this the fundamental source of our disagreement on the > the direction for CIF2. > > Optional features are common in almost all current successful standards > for information interchange, including HTML4, XMF and CIF1. ?As a > practical matter, one tries to have strict writers and liberal readers > for interchange standards to encourage migration to as common a > convention as possible. ?Even so, if we are too strict in our rules > for what is and is not a proper CIF, we will probably encourage > the growth of multiple unofficial, unmanaged and non-interchangeable > CIF2 dialects. I dispute all three unsupported statements in the above paragraph. Taking the first one, where HTML, XML and CIF1 are put forward as successful standards for information interchange that have optional features: (1) HTML is no more an information interchange standard that Rich Text Format. It is primarily a standard for marking up documents for presentation to the human reader. If you wish to argue by analogy with HTML, you will need to draw a much tighter parallel to the goals of CIF. (2) I agree that the goals of XML are similar to those of CIF, and I would be pleased if we adopted their approach to optional behaviour. The fifth of the 10 design goals for XML was (see the XML 1.1 standard at http://www.w3.org/TR/2006/REC-xml11-20060816): "5.The number of optional features in XML is to be kept to the absolute minimum, ideally zero." So, if XML is to be our guiding light, then we should avoid optional behaviour. (3) As for CIF1 having optional behaviours, what might those be? I would assert that regardless of the wording of the standard, those optional features are either never supported, or else always supported, or else irrelevant to the core use of CIF. So: I don't think that appealling to HTML or XML proves that optional behaviour is a good thing, and lacking supporting argument, the appeal to CIF1 does not prove it either. Moving on to your second assertion about liberal readers and strict writers: while that philosophy has its adherents, an alternative philosophy also exists: readers should exit gracefully on standards violations. I quote a recent Linux Weekly News article which bears on this discussion: "The notion that one should be liberal in what one accepts while being conservative in what one sends is often expressed in the networking field, but it shows up in a number of other areas as well. Often, though, it can make more sense to be conservative on the accepting side; the condition of many web pages would have been far better had early browsers not been so forgiving of bad HTML." (http://lwn.net/Articles/394175/) So: which approach you adopt to writing standards-conformant readers requires some thought, particularly given the possibility that liberal readers will encourage liberal writers. The final assertion about the consequences of being too strict might in theory be true, but it will require clear use-cases to support it rather than simply asserting it as a truism. I would suggest that we are nowhere near the point of forcing incompatible dialects to emerge, given that the addition of UTF8 to the standard does not meaningfully restrict the choices offered to CIF users, and any other restrictions that we have introduced into CIF2 relative to CIF1 are very minor. Based on this observation, my expectation is that CIF2 will no more produce incompatible dialects than CIF1, *provided we have no optional behaviour*. I will address what I believe is the real source of disagreement on this point, which is my statement that "Standards-conformant readers must be able to read all files produced by standards-conformant writers", in an answer to John B's other post. > > As for John's hashing scheme, I suspect some variation of it will find signficant use in major archives, just as associating MD5 checksums > with tarballs does for many software distributors, but that we also > will need some easier-to-generate-and-transfer _optional_ encoding > hint schemes, such as the accented "o's". ?One simple way to handle > it would be: > > ?1. ?Put some variant of the accented "o's" into the _optional_ > magic number; and > ?2. ?Adopt the tarball approach to MD5 checksums by having it not > in the header but in a separate file, simply generating it from > a canonical UTF8 representation of the CIF2 file. > > The accented o's are easy to carry along as an encoding hint, and > if you get the encoding hint right, then you will easily be able > to generate a canonical UTF8 file to validate the MD5 checksum against > if you wish for a critical file transfer, e.g. to an archive or a journal. > > Regards, > ? Herbert > > > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Mon Aug 16 12:29:25 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Mon, 16 Aug 2010 07:29:25 -0400 (EDT) Subject: [Cif2-encoding] Fundamental source of disagreement In-Reply-To: References: <563298.52532.qm@web87005.mail.ird.yahoo.com> <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear Colleagues, James presents an interesting view, but one that, with respect to text encodings is without support in fact. HTML may have begun as a markup languages, but it has become one of the most important standards for interchange of documents we have and is, in the current XHTML form, an important dialect of XML. We all seem to agree that XML is an interchange standard, and that CIF1 is an interchange standard. Now let us look at what is and is not, in fact optional in all three, beginning with encodings and beginning with XML and its pious hope to avoid having optional features: For our reference on XML let us confine our attention to the XML-5 document at http://www.w3.org/TR/REC-xml/ and ignore any deviations from that standard that have arisen in practice. The point at issue is, of course, having optional alternative in character encodings, on which the document says: "The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities." and let us count a few of the optional features in that document: Section 3: "optional white space" Section 3.2: "At user option, an XML processor MAY issue a warning when a declaration mentions an element type for which no declaration is provided, but this is not an error." Section 3.2.1: "An element type has element content when elements of that type MUST contain only child elements (no character data), optionally separated by white space" Section 4.6: "the double escaping here is OPTIONAL but harmless" There are many more in the document. Yes, the document expresses the design goal (i.e. pious hope) that "# "The number of optional features in XML is to be kept to the absolute minimum, ideally zero." but the fact is that XML has many options, including the option of being expressed in a wide range of character encodings. Lest anyone think this an abberation, please note the highly complex optional behavior of XML with respect to DTDs and the various "non-normative" practices at the end of the document. For XML, the standard specifies many options. For HTML we are all aware of how many optional features and encodings there are, even more than with XML. There is an effort of tighten this up in XHTML with strict conformance. As for CIF1, I recommen reading the specification, which was written to allow writing of CIFs as text dociments on a wide range of computers with a with range of character encodings (incuding even EBCDIC). Just to pick one obvious example of optional behavior -- look at the stripping of optional blanks at the ends of lines. Once again, with all due respect to the adherents to goal of not having any options -- the interchange specifications that are working and in common use have options (including for encodings) and the well-established practice of liberal readers and strict writers. Those are the facts, and, while it is fine for each of us to form our own opions, it is not viable for each of us to create our own facts. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Mon, 16 Aug 2010, James Hester wrote: > I'm not sure that you have identified the fundamental source of > disagreement, but if we disagree on our approaches to optional > behaviour we will have trouble finalising the standard, so I have > addressed Herb's comments below. > > On Tue, Aug 10, 2010 at 8:42 PM, Herbert J. Bernstein > wrote: >> >> With all due respect to James and others who adhere to the view that: >> >> "There is no such thing as 'optional' for an information interchange standard." >> >> I believe this the fundamental source of our disagreement on the >> the direction for CIF2. >> >> Optional features are common in almost all current successful standards >> for information interchange, including HTML4, XMF and CIF1. ?As a >> practical matter, one tries to have strict writers and liberal readers >> for interchange standards to encourage migration to as common a >> convention as possible. ?Even so, if we are too strict in our rules >> for what is and is not a proper CIF, we will probably encourage >> the growth of multiple unofficial, unmanaged and non-interchangeable >> CIF2 dialects. > > I dispute all three unsupported statements in the above paragraph. > Taking the first one, where HTML, XML and CIF1 are put forward as > successful standards for information interchange that have optional > features: > (1) HTML is no more an information interchange standard that Rich Text > Format. It is primarily a standard for marking up documents for > presentation to the human reader. If you wish to argue by analogy > with HTML, you will need to draw a much tighter parallel to the goals > of CIF. > > (2) I agree that the goals of XML are similar to those of CIF, and I > would be pleased if we adopted their approach to optional behaviour. > The fifth of the 10 design goals for XML was (see the XML 1.1 standard > at http://www.w3.org/TR/2006/REC-xml11-20060816): > "5.The number of optional features in XML is to be kept to the > absolute minimum, ideally zero." > > So, if XML is to be our guiding light, then we should avoid optional behaviour. > > (3) As for CIF1 having optional behaviours, what might those be? I > would assert that regardless of the wording of the standard, those > optional features are either never supported, or else always > supported, or else irrelevant to the core use of CIF. > > So: I don't think that appealling to HTML or XML proves that optional > behaviour is a good thing, and lacking supporting argument, the appeal > to CIF1 does not prove it either. > > Moving on to your second assertion about liberal readers and strict > writers: while that philosophy has its adherents, an alternative > philosophy also exists: readers should exit gracefully on standards > violations. I quote a recent Linux Weekly News article which bears on > this discussion: > "The notion that one should be liberal in what one accepts while being > conservative in what one sends is often expressed in the networking > field, but it shows up in a number of other areas as well. Often, > though, it can make more sense to be conservative on the accepting > side; the condition of many web pages would have been far better had > early browsers not been so forgiving of bad HTML." > (http://lwn.net/Articles/394175/) > > So: which approach you adopt to writing standards-conformant readers > requires some thought, particularly given the possibility that liberal > readers will encourage liberal writers. > > The final assertion about the consequences of being too strict might > in theory be true, but it will require clear use-cases to support it > rather than simply asserting it as a truism. I would suggest that we > are nowhere near the point of forcing incompatible dialects to emerge, > given that the addition of UTF8 to the standard does not meaningfully > restrict the choices offered to CIF users, and any other restrictions > that we have introduced into CIF2 relative to CIF1 are very minor. > Based on this observation, my expectation is that CIF2 will no more > produce incompatible dialects than CIF1, *provided we have no optional > behaviour*. > > I will address what I believe is the real source of disagreement on > this point, which is my statement that "Standards-conformant readers > must be able to read all files produced by standards-conformant > writers", in an answer to John B's other post. > >> >> As for John's hashing scheme, I suspect some variation of it will find signficant use in major archives, just as associating MD5 checksums >> with tarballs does for many software distributors, but that we also >> will need some easier-to-generate-and-transfer _optional_ encoding >> hint schemes, such as the accented "o's". ?One simple way to handle >> it would be: >> >> ?1. ?Put some variant of the accented "o's" into the _optional_ >> magic number; and >> ?2. ?Adopt the tarball approach to MD5 checksums by having it not >> in the header but in a separate file, simply generating it from >> a canonical UTF8 representation of the CIF2 file. >> >> The accented o's are easy to carry along as an encoding hint, and >> if you get the encoding hint right, then you will easily be able >> to generate a canonical UTF8 file to validate the MD5 checksum against >> if you wish for a critical file transfer, e.g. to an archive or a journal. >> >> Regards, >> ? Herbert >> >> >> >> > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From jamesrhester at gmail.com Tue Aug 24 04:38:27 2010 From: jamesrhester at gmail.com (James Hester) Date: Tue, 24 Aug 2010 13:38:27 +1000 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> References: <20100623103310.GD15883@emerald.iucr.org> <381469.52475.qm@web87004.mail.ird.yahoo.com> <984921.99613.qm@web87011.mail.ird.yahoo.com> <826180.50656.qm@web87010.mail.ird.yahoo.com> <563298.52532.qm@web87005.mail.ird.yahoo.com> <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Thanks John for a detailed response. At the top of this email I will address this whole issue of optional behaviour. I was clearly too telegraphic in previous posts, as Herbert thinks that optional whitespace counts as an optional feature, so I go into some detail below. By "optional features" I mean those aspects of the standard that are not mandatory for both readers and writers, and in addition I am not concerned with features that do not relate directly to the information transferred, e.g. optional warnings. For example, unless "optional whitespace" means that the *reader* may throw a syntax error when whitespace is encountered at some particular point where whitespace is optional, I do not view optional whitespace as an optional feature - it is only optional for the writer. With this definition of "optional feature" it follows logically that, if a standard has such "optional features", not all standard-conformant files will be readable by all standard-conformant readers. This is as true of HTML, XML and CIF1 as it is of CIF2. Whatever the relevance of HTML and XML to CIF, the existence of successful standards with optional features proves only that a standard can achieve widespread acceptance while having optional features - whether these optional features are a help or a hindrance would require some detailed analysis. So: any standard containing optional features requires the addition of external information in order to resolve the choice of optional features before successful information interchange can take place. Into this situation we place software developers. These are the people who play a big role in deciding which optional parts of the standard are used, as they are the ones that write the software that attempts to read and write the files. Developers will typically choose to support optional features based on how likely they are to be used, which depends in part on how likely they are perceived to be implemented in other software. This is a recursive, potentially unstable situation, which will eventually resolve itself in one of three ways: (1) A "standard" subset of optional features develops and is approximately always implemented in readers. Special cases: (a) No optional features are implemented (b) All optional features are implemented (2) A variety of "standard" subsets develop, dividing users into different communities. These communities can't always read each other's files without additional conversion software, but there is little impetus to write this software, because if there were, the developers would have included support for the missing options in the first place. The most obvious example of such communities would be thosed based on options relating to natural languages, if those communities do not care about accessibility of their files to non-users of their language and encoding. (3) A truly chaotic situation develops, with no discernable resolution and a plethora of incompatible files and software. Outcome 1 is the most desirable, as all files are now readable by all readers, meaning no additional negotiation is necessary, just as if we had mandated that set of optional features. Outcome 2 is less desirable, as more software needs to be written and the standard by itself is not necessarily enough information to read a given file. Outcome 3 is obviously pretty unwelcome, but unlikely as it would require a lot of competing influences, which would eventually change and allow resolution into (1) or (2). Think HTML and Microsoft. Now let us apply the above analysis to CIF: some are advocating not exhaustively listing or mandating the possible CIF2 encodings (CIF1 did not list or mandate encoding either), leading to a range of "optional features" as I have defined it above (where support for any given encoding is a single "optional feature"). For CIF1, we had a type 1 outcome (only ASCII encoding was supported and produced). So: my understanding of the previous discussion is that, while we agree that it would be ideal if everyone used only UTF8, some perceive that the desire to use a different encoding will be sufficiently strong that mandating UTF8 will be ineffective and/or inconvenient. So, while I personally would advocate mandating UTF8, the other point of view would have us allowing non UTF8 encoding but hoping that everyone will eventually move to UTF8. In which case I would like to suggest that we use network effects to influence the recursive feedback loop experienced by programmers described above, so that the community settles on UTF8 in the same way as it has settled on ASCII for CIF1. That is, we "load the dice" so that other encodings are disfavoured. Here are some ways to "load the dice": (1) Mandate UTF8 only. (2) Make support for UTF8 mandatory in CIF processors (3) Force non UTF8 files to jump through extra hoops (which I think is necessary anyway) (4) Educate programmers on the drawbacks of non UTF8 encodings and strongly urge them not to support reading non UTF8 CIF files (5) Strongly recommend that the IUCr, wwPDB, and other centralised repositories reject non-UTF8-encoded CIF files (6) Make available hyperlinked information on system tools for dealing with UTF8 files on popular platforms, which could be used in error messages produced by programs (see (4)) I would be interested in hearing comments on the acceptability of these options from the rest of the group (I think we know how we all feel about (1)!). Now, returning to John's email: I will answer each of the points inline, at the same time attempting to get all the attributions correct. (James) I had not fully appreciated that Scheme B is intended to be applied only at the moment of transfer or archiving, and envisions users normally saving files in their preferred encoding with no hash codes or encoding hints required (I will call the inclusion of such hints and hashes as 'decoration'). (John) "Envisions users normally [...]" is a bit stronger than my position or the intended orientation of Scheme B. ?"Accommodates" would be my choice of wording. (James now) No problem with that wording, my point is that such undecorated files will be called CIF2 files and so are a target for CIF2 software developers, thus "unloading" the dice away from UTF8 and closer to encoding chaos. (James) ?A direct result of allowing undecorated files to reside on disk is that CIF software producers will need to write software that will function with arbitrary encodings with no decoration to help them, as that is the form that users' files will be most often be in. (John) The standard can do no more to prevent users from storing undecorated CIFs than it can to prevent users from storing CIF text encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding. More generally, all the standard can do is define the characteristics of a conformant CIF -- it can never prevent CIF-like but non-conformant files from being created, used, exchanged, or archived as if they were conformant CIFs. ?Regardless of the standard's ultimate position on this issue, software authors will have to be guided by practical considerations and by the real-world requirements placed on their programs. ?In particular, they will have to decide whether to accept "CIF" input that in fact violates the standard in various ways, and / or they will have to decide which optional CIF behaviors they will support. ?As such, I don't see a significant distinction between the alternatives before us as regards the difficulty, complexity, or requirements of CIF2 software. (James now) I have described the way the standard works to restrict encodings in the discussion at the top of this email. Briefly, CIF software developers develop programs that conform with the CIF2 standard. If that standard says 'UTF8', they program for UTF8. If you want to work in ISO-8859-15 etc, you have to do extra work. Working in favour of such extra work would be a compelling use case, which I have yet to see (I note that the 'UTF8 only' standard posted to ccp4-bb and pdb-l produced no comments). My strong perception is that any need for other encodings is overwhelmed by the utility of settling on a single encoding, but that perception would need confirmation from a proper survey of non-ASCII users. So, no we can't stop people saving CIF-like files in other encodings, but we can discourage it by creating significant barriers in terms of software availability. Just like we can't stop CIF1 users saving files in JIS X 0208, but that doesn't happen at any level that causes problems (if it happens at all, which I doubt). (John) Furthermore, no formulation of CIF is inherently reliable or unreliable, because reliability (in this sense) is a characteristic of data transfer, not of data themselves. ?Scheme B targets the activities that require reliability assurance, and disregards those that don't. ?In a practical sense, this isn't any different from scheme A, because it is only when the encoding is potentially uncertain -- to wit, in the context of data transfer -- that either scheme need be applied (see also below). ?I suppose I would be willing to make scheme B a general requirement of the CIF format, but I don't see any advantage there over the current formulation. ?The actual behavior of people and the practical requirements on CIF software would not appreciably change. (James now) I would suggest that Scheme B does not target all activites requiring reliability assurance, as it does not address the situation where people use a mix of CIF-aware software and text tools in a single encoding environment. The real, significant change that occurs when you accept Scheme B is that CIF files can now be in any encoding and undecorated. Programmers are then likely to provide programs that might or might not work with various encodings, and users feel justifiably that their undecorated files should be supported. The software barrier that was encouraging UTF8-only has been removed, and the problem of mismatched encodings that we have been trying to avoid becomes that much more likely to occur. Scheme B has very few teeth to enforce decoration at the point of transfer, as the software at either end is now probably happy with an undecorated file. Requiring decoration as a condition of being a CIF2 file means that software will tend to reject undecorated files, thereby limiting the damage that would be caused by open slather encoding. (James) ?Furthermore, given the ease with which files can be transferred between users (email attachment, saved in shared, network-mounted directory, drag and drop onto USB stick etc.) it is unlikely that Scheme B or anything involving extra effort would be applied unless the recipient demanded it. (John) For hand-created or hand-edited CIFs, I agree. ?CIFs manipulated via a CIF2-compliant editor could be relied upon to conform to scheme B, however, provided that is standardized. ?But the same applies to scheme A, given that few operating environments default to UTF-8 for text. (James now) That is my goal: that any CIF that passes through a CIF-compliant program must be decorated before input and output (if not UTF8). What hand-edited, hand-created CIFs actually have in the way of decoration doesn't bother me much, as these are very rare and of no use unless they can be read into a CIF program, at which point they should be rejected until properly decorated. And I reiterate, the process of applying decoration can be done interactively to minimise the chances of incorrect assignment of encoding. (James) ?And given how many times that file might have changed hands across borders and operating systems within a single group collaboration, there would only be a qualified guarantee that the character to binary mapping has not been mangled en route, making any scheme applied subsequently rather pointless. (John) That also does not distinguish among the alternatives before us. ?I appreciate the desire for an absolute guarantee of reliability, but none is available. ?Qualified guarantees are the best we can achieve (and that's a technical assessment, not an aphorism). (James now) Oh, but I believe it does distinguish, because if CIF software reads only UTF8 (because that is what the standard says), then the file will tend to be in UTF8 at all points in time, with reduced possibilities for encoding errors. I think it highly likely that each group that handles a CIF will at some stage run it through CIF-aware software, which means encoding mistakes are likely to be caught much earlier. (James) We would thus go from a situation where we had a single, reliable and sometimes slightly inconvenient encoding (UTF8), to one where a CIF processor should be prepared for any given CIF file to be one of a wide range of encodings which need to be guessed. (John) Under scheme A or the present draft text, we have "a single, reliable [...] encoding" only in the sense that the standard *specifies* that that encoding be used. ?So far, however, I see little will to produce or use processors that are restricted to UTF-8, and I have every expectation that authors will continue to produce CIFs in various encodings regardless of the standard's ultimate stance. ?Yes, it might be nice if everyone and every system converged on UTF-8 for text encoding, but CIF2 cannot force that to happen, not even among crystallographers. (James now) You see little will to do this: but as far as I can tell, there is even less will not to do it. Authors will not "continue" to produce CIFs in various encodings, as they haven't started doing so yet. As I've said above, CIF2 can certainly, if not force, encourage UTF8 adoption. What's more, non-ASCII characters are only gradually going to find their way into CIF2 files, as the dictionaries and large scale adopters of CIF2 (the IUCr) start to allow non-ASCII characters in names, and the users gradually adapt to this new way of doing things. I have no sense that CIF users will feel a strong desire to use non UTF8 schemes, when they have been happy in an ASCII-only regime up until now. But I'm curious: on what basis are you saying that there is little will to use processors that are restricted to UTF8? (John) In practice, then, we really have a situation where the practical / useful CIF2 processor must be prepared to handle a variety of encodings (details dependent on system requirements), which may need to be guessed, with no standard mechanism for helping the processor make that determination or for allowing it to check its guess. ?Scheme B improves that situation by standardizing a general reliability assurance mechanism, which otherwise would be missing. ?In view of the practical situation, I see no down side at all. ?A CIF processor working with scheme B is *more* able, not less. (James) I would much prefer a scheme which did not compromise reliability in such a significant way. (John) There is no such compromise, because in practice, we're not starting from a reliable position. (James now) I think your statement that our current position is not reliable arises out of a perception that users are likely to use a variety of encodings regardless of what the standard says. I think this danger is way overstated, but I'd like to see you expand on why you think there is such a likelihood of multiple encodings being used (James) My previous (somewhat clunky) attempts to adjust Scheme B were directed at trying to force any file with the CIF2.0 magic number to be either decorated or UTF-8, meaning that software has a reasonably high confidence in file integrity. An alternative way of thinking about this is that CIF files also act as the mechanism of information transfer between software programs. [... W]hen a separate program is asked to input that CIF, the information has been transferred, even if that software is running on the same computer. (John) So in that sense, one could argue that Scheme B already applies to all CIFs, its assertion to the contrary notwithstanding. ?Honestly, though, I don't think debating semantic details of terms such as "data transfer" is useful because in practice, and independent of scheme A, B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to choose what form of reliability assurance to accept or demand, if any. (James now) I was only debating semantic details in order to expose the fact that data transfer occurs between programs, not just between systems, and that therefore Scheme B should apply within a single system, so therefore, all CIF2 files should be decorated. As for who should be demanding reliability assurance, the receiver may not be in a position to demand some level of reliability if the file creator is not in direct contact. Again, we can build this reliability into the standard and save the extra negotiation or loss of information that is otherwise involved. (James) Now, moving on to the detailed contours of Scheme B and addressing the particular points that John and I have been discussing. ?My original criticisms are the ones preceded by numerals. [(James now) I've deleted those points where we have reached agreement. Those points are: (1) Restrict encodings to those for which the first line of a CIF file provides unambiguous encoding for ASCII codepoints (2) Put the hash value on the first line] (James a long time ago) (4) Assumption that all recipients will be able to handle all encodings (John) There is no such assumption. ?Rather, there is an acknowledgement that some systems may be unable to handle some CIFs. That is already the case with CIF1, and it is not completely resolved by standardizing on UTF-8 (i.e. scheme A). (James) There is no such thing as 'optional' for an information interchange standard. ?A file that conforms to the standard must be readable by parsers written according to the standard. If reading a standard-conformant file might fail or (worse) the file might be misinterpreted, information cannot always reliably be exchanged using this standard, so that optional behaviour needs to be either discarded, or made mandatory. There is thus no point in including optional behaviour in the standard. So: if the standard allows files to be written in encoding XYZ, then all readers should be able to read files written in encoding XYZ. ?I view the CIF1 stance of allowing any encoding as a mistake, but a benign one, as in the case of CIF1 ASCII was so entrenched that it was the defacto standard for the characters appearing in CIF1 files. ?In short, we have to specify a limited set of acceptable encodings. (John) As Herb astutely observed, those assertions reflect a fundamental source of our disagreement. ?I think we can all agree that a standard that permits conforming software to misinterpret conforming data is undesirable. Surely we can also agree that an information interchange standard does not serve its purpose if it does not support information being successfully interchanged. ?It does not follow, however, that the artifacts by which any two parties realize an information interchange must be interpretable by all other conceivable parties, nor does it follow that that would be a supremely advantageous characteristic if it were achievable. ?It also does not follow that recognizable failure of any particular attempt at interchange must at all costs be avoided, or that a data interchange standard must take no account of its usage context. (James now) This is where we must make a policy decision: is a CIF2 file to be a universally understandable file? I agree that excluding optional behaviour is not an absolute requirement, but I also consider that optional behaviour should not be introduced without solid justification, given the real cost in interoperability and portability of the standard. You refer to two parties who wish to exchange information: those parties are always free to agree on private enhancements to the CIF2 standard (or to create their very own protocol), if they are in contact. I do not see why this use case need concern us here. Herbert can say to John 'I'm emailing you a CIF2 file but encoded in UTF16'. John has his extremely excellent software which handles UTF16 and these two parties are happy. John mentions a 'usage context'. If the standard is to include some account of usage context, then that context has to be specified sufficiently for a CIF2 programmer to understand what aspects of that context to consider, and not left open to misinterpretation. Perhaps you could enlarge on what particular context should be included? (John) Optional and alternative behaviors are not fundamentally incompatible with a data interchange standard, as XML and HTML demonstrate. ?Or consider the extreme variability of CIF text content: whether a particular CIF is suitable for a particular purpose depends intimately on exactly which data are present in it, and even to some extent on which data names are used to present them, even though ALL are optional as far as the format is concerned. ?If I say 'This CIF is unsuitable for my present purpose because it does not contain _symmetry_space_group_name_H-M', that does not mean the CIF standard is broken. ?Yet, it is not qualitatively different for me to say 'This CIF is unsuitable because it is encoded in CCSID 500' despite CIF2 (hypothetically) permitting arbitrary encodings. (James now) The difference is quantitative and qualitative. Quantitative, because the number of CIF2 files that are unsuitable because of missing tags will always be less than or equal to the number of CIF2 files that are unsuitable because of a missing tag and unknown encoding. Thus, by reducing ambiguity at the lower levels of the standard, we improve the utility at the higher levels. The difference is also qualitative, in that (a) if we have tags with non-ASCII characters, they could conceivably be confused with other tags if the encoding is not correct and so you will have a situation where a file that is not suitable actually appears suitable, because the desired tag appears. Likewise, the value taken by a tag may be wrong. (James a long time ago) (iii) restrict possible encodings to internationally recognised ones with well-specified Unicode mappings. This addresses point (4) (John) I don't see the need for this, and to some extent I think it could be harmful. ?For example, if Herb sees a use for a scheme of this sort in conjunction with imgCIF (unknown at this point whether he does), then he might want to be able to specify an encoding specific to imgCIF, such as one that provides for multiple text segments, each with its own character encoding. ?To the extent that imgCIF is an international standard, perhaps that could still satisfy the restriction, but I don't think that was the intended meaning of "internationally recognised". (James now) Indeed. My intent with this specification was to ensure that third parties would be able to recover the encoding. If imgCIF is going to cause us to make such an open-ended specification, it is probably a sign that imgCIF needs to be addressed separately. For example, should we think about redefining it as a container format, with a CIF header and UTF16 body (but still part of the "Crystallographic Information Framework")? (John) As for "well-specified Unicode mappings", I think maybe I'm missing something. ?CIF text is already limited to Unicode characters, and any encoding that can serve for a particular piece of CIF text must map at least the characters actually present in the text. ?What encodings or scenarios would be excluded, then, by that aspect of this suggestion? (James) My intention was to make sure that not only the particular user who created the file knew this mapping, but that the mapping was publically available. ?Certainly only Unicode encodable code points will appear, but the recipient needs to be able to recover the mapping from the file bytes to Unicode without relying on e.g. files that will be supplied on request by someone whose email address no longer works. (John) This issue is relevant only to the parties among whom a particular CIF is exchanged. ?The standard would not particularly assist those parties by restricting the permitted encodings, because they can safely ignore such restrictions if they mutually agree to do so (whether statically or dynamically), and they (specifically, the CIF originator) must anyway comply with them if no such agreement is implicit or can be reached. (James) Again, any two parties in current contact can send each other files in whatever format and encoding they wish. My concern is that CIF software writers are not drawn into supporting obscure or adhoc encodings. (John) B) Scheme B does not use quite the same language as scheme A with respect to detectable encodings. ?As a result, it supports (without tagging or hashing) not just UTF-8, but also all UTF-16 and UTF-32 variants. ?This is intentional. (James) I am concerned that the vast majority of users based in English speaking countries (and many non English speaking countries) will be quite annoyed if they have to deal with UTF-16/32 CIF2 files that are no longer accessible to the simple ASCII-based tools and software that they are used to. ?Because of this, allowing undecorated UTF16/32 would be far more disruptive than forcing people to use UTF8 only. Thus my stipulation on maintaining compatibility with ASCII for undecorated files. (John) Supporting UTF-16/32 without tagging or hashing is not a key provision of scheme B, and I could live without it, but I don't think that would significantly change the likelihood of a user unexpectedly encountering undecorated UTF-16/32 CIFs. ?It would change only whether such files were technically CIF-conformant, which doesn't much matter to the user on the spot. ?In any case, it is not the lack of decoration that is the basic problem here. (James now) Yes, that is true. A decorated UTF16 file is just as unreadable as an undecorated one in ASCII tools. However, per my comments at the start of this email, I think an extra bit of hoop jumping for non UTF8 encoded files has the desirable property of encouraging UTF8 use. (John) C) Scheme B is not aimed at ensuring that every conceivable receiver be able to interpret every scheme-B-compliant CIF. ?Instead, it provides receivers the ability to *judge* whether they can interpret particular CIFs, and afterwards to *verify* that they have done so correctly. ?Ensuring that receivers can interpret CIFs is thus a responsibility of the sender / archive maintainer, possibly in cooperation with the receiver / retriever. (James) As I've said before, I don't see the paradigm of live negotiation between senders and receivers as very useful, as it fails to account for CIFs being passed between different software (via reading/writing to a file system), or CIFs where the creator is no longer around, or technically unsophisticated senders where, for example, the software has produced an undecorated CIF in some native encoding and the sender has absolutely no idea why the receiver (if they even have contact with the receiver!) can't read the file properly. ? I prefer to see the standard that we set as a substitute for live negotiation, so leaving things up to the users is in that sense an abrogation of our responsibility. (John) That scenario will undoubtedly occur occasionally regardless of the outcome of this discussion. ?If it is our responsibility to avoid it at all costs then we are doomed to fail in that regard. ?Software *will* under some circumstances produce undecorated, non-UTF-8 "CIFs" because that is sometimes convenient, efficient, and appropriate for the program's purpose. I think, though, those comments reflect a bit of a misconception. ?The overall purpose of CIF supporting multiple encodings would be to allow specific CIFs to be better adapted for specific purposes. ?Such purposes include, but are not limited to () exchanging data with general-purpose program(s) on the same system () exchanging data with crystallography program(s) on the same system () supporting performance or storage objectives of specific programs or systems () efficiently supporting problem or data domains in which Latin text is a minority of the content (e.g. imgCIF) () storing data in a personal archive () exchanging data with known third parties () publishing data to a general audience *Few, if any, of those uses would be likely to involve live negotiation.* ?That's why I assigned primary responsibility for selecting encodings to the entity providing the CIF. ?I probably should not even have mentioned cooperation of the receiver; I did so more because it is conceivable than because it is likely. (James now) OK, fair enough. My issues then with the paradigm of provider-based encoding selection is that it only works where the provider is capable of making this choice, and it puts that responsibility on all providers, large and small. Of course, I am keen to construct a CIF ecology where providers always automatically choose UTF8 as the "safe" choice. (John) Under any scheme I can imagine, some CIFs will not be well suited to some purposes. ?I want to avoid the situation that *no* conformant CIF can be well suited to some reasonable purposes. ?I am willing to forgo the result that *every* conformant CIF is suited to certain other, also reasonable purposes. (James now) Fair enough. However, so far the only reasonable purpose that I can see for which a UTF8 file would not be suitable is exchanging data with general-purpose programs that do not cope with UTF8, and it may well be that with a bit of research the list of such programs would turn out to be rather short. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Tue Aug 24 11:56:10 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 24 Aug 2010 06:56:10 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics In-Reply-To: References: <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear Colleagues, James' and John's last interchange is so voluminous, I doubt any of us has been able to fully appreciate the rich complexity of ideas contained therein. For example, one of the suggestions far down in the text is: (James now) Indeed. My intent with this specification was to ensure that third parties would be able to recover the encoding. If imgCIF is going to cause us to make such an open-ended specification, it is probably a sign that imgCIF needs to be addressed separately. For example, should we think about redefining it as a container format, with a CIF header and UTF16 body (but still part of the "Crystallographic Information Framework")? The idea of an imgCIF "header" in CIF format and a image in another is an old, well-established, thoroughly discussed, and mistaken idea, rejected in 1998. The handling of multiple images in a single file (e.g. a jpeg thumbnail and crystal image and a full-size diffraction image) requires the ability to switch among encodings within the file -- something handled by the current DDL2 and MIME-based imgCIF format and which would be a serious problem in CIF2 has currently proposed, increasing the chances that we will have to move imgCIF entirely into HDF5 and abandon the CIF representation entirely, sharing only the dictionary and not the framework. If you look carefully, you will see a similar trend with mmCIF, in which and XML representation sharing the dictionary plays a much more important role than the CIF format. Is it really desirable to make the new CIF format so rigid and unadaptable that major portions of macromolecular crysallography end up migrating to very different formats, as they already are doing? Yes, there is great value in having a common dictionary, but would there not be additional value in having a sufficiently flexible common format to allow for more software sharing than we now have? It is really desirable for us to continue in the direction of a single macromolecular experiment having to deal with HDF5 and CIF/DDL2/MIME representations of the image data during collection, CCP4-style CIF representations during processing and deposition and legacy PDB and PDBML representations in subsequent community use? If we could be a little bit more flexible, we might be able to reduce the data interchange software burdens a little. Right now, this discussion seems headed in the direction of simply adding yet another data representation (DDLm/CIF2) to the mix, increasing the chances of mistranslation and confusion, rather that reducing them. Please, step back a bit from the detailed discussion of UTF8 and look at the work-flow of doing and publishing crystallographic experiments and let us try to make a contribution that simplifies it, not one that makes it more complex than it needs to be. I suggest we need to meet and talk, either face-to-face, or by skype. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== From jamesrhester at gmail.com Tue Aug 24 14:01:02 2010 From: jamesrhester at gmail.com (James Hester) Date: Tue, 24 Aug 2010 23:01:02 +1000 Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics In-Reply-To: References: <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Hi Herbert: regarding imgCIF, I agree that splitting it off is not a desirable outcome. I would like to get an idea of how well imgCIF can be accommodated under the various encoding proposals currently floating around, as you have been rather reticent to bring it up. My naive take on things is that a UTF8-only encoding scheme for CIF2 would not pose significant issues for imgCIF, and a decorated UTF16 encoding in the style of Scheme B would be even better, and quite adequate, so imgCIF is not actually presenting any problems and so was a red herring. I'm not sure that face-to-face or Skype discussions are necessarily going to be more productive. Writing things down, while slower, allows me at least to collect my thoughts and those of other participants, and hopefully make a reasoned contribution (my apologies if I am too long-winded) and as an added bonus those thoughts are recorded for later reference. For example, where would I now find the background on why a container format for imgCIF is such a bad idea? Presumably that was all thrashed out in face to face discussions, and no record now remains. On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein wrote: > Dear Colleagues, > > ? James' and John's last interchange is so voluminous, I doubt any of > us has been able to fully appreciate the rich complexity of ideas > contained therein. ?For example, one of the suggestions far down in > the text is: > > (James now) ?Indeed. ?My intent with this specification was to ensure > that third parties would be able to recover the encoding. If imgCIF is > going to cause us to make such an open-ended specification, it is > probably a sign that imgCIF needs to be addressed separately. ?For > example, should we think about redefining it as a container format, > with a CIF header and UTF16 body (but still part of the > "Crystallographic Information Framework")? > > The idea of an imgCIF "header" in CIF format and a image in another is an > old, well-established, thoroughly discussed, and mistaken idea, rejected > in 1998. ?The handling of multiple images in a single file (e.g. > a jpeg thumbnail and crystal image and a full-size diffraction image) > requires the ability to switch among encodings within the file -- > something handled by the current DDL2 and MIME-based imgCIF format and > which would be a serious problem in CIF2 has currently proposed, > increasing the chances that we will have to move imgCIF entirely into > HDF5 and abandon the CIF representation entirely, sharing only > the dictionary and not the framework. > > If you look carefully, you will see a similar trend with mmCIF, in which > and XML representation sharing the dictionary plays a much more > important role than the CIF format. > > Is it really desirable to make the new CIF format so rigid and > unadaptable that major portions of macromolecular crysallography > end up migrating to very different formats, as they already are > doing? ?Yes, there is great value in having a common dictionary, > but would there not be additional value in having a sufficiently > flexible common format to allow for more software sharing than > we now have? ?It is really desirable for us to continue in the > direction of a single macromolecular experiment having to > deal with HDF5 and CIF/DDL2/MIME representations of the image data > during collection, CCP4-style CIF representations during processing > and deposition and legacy PDB and PDBML representations in subsequent > community use? ?If we could be a little bit more flexible, we might be > able to reduce the data interchange software burdens a little. > Right now, this discussion seems headed in the direction of simply > adding yet another data representation (DDLm/CIF2) to the mix, > increasing the chances of mistranslation and confusion, rather > that reducing them. > > Please, step back a bit from the detailed discussion of UTF8 and > look at the work-flow of doing and publishing crystallographic > experiments and let us try to make a contribution that simplifies > it, not one that makes it more complex than it needs to be. > > I suggest we need to meet and talk, either face-to-face, or by skype. > > Regards, > ? Herbert > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? ?Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > ===================================================== > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Tue Aug 24 14:31:51 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Tue, 24 Aug 2010 09:31:51 -0400 (EDT) Subject: [Cif2-encoding] Splitting of imgCIF and other sub-topics In-Reply-To: References: <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear James, I have not been at all reticent -- imgCIF will be very poorly supported by CIF2 as currently proposed. Of necessity, imgCIF changes encodings internally -- that it why it uses MIME -- same problem as email with images, same solution. Any purely text version has at least a 7% overhead as compared to pure binary. Restricting to UTF-8 increases the overhead to at least 50%. We may get away with the 7% (UTF-16). The 50% version (UTF-8) will be ignored by the community as unworkable. The most likely to be used version will be the current DDL2-based version with embedded compressed binaries that I am augmenting with DDLm-like features and merging in with HDF5. As I noted many months ago, the unfortunate reality is that the current CIF2 effort will not merge well with imgCIF. If avoiding a split is a important -- we need a meeting. I would suggest involving Bob Sweet and holding it at BNL in conjunction with something relevant to NSLS-II. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Tue, 24 Aug 2010, James Hester wrote: > Hi Herbert: regarding imgCIF, I agree that splitting it off is not a > desirable outcome. I would like to get an idea of how well imgCIF can > be accommodated under the various encoding proposals currently > floating around, as you have been rather reticent to bring it up. My > naive take on things is that a UTF8-only encoding scheme for CIF2 > would not pose significant issues for imgCIF, and a decorated UTF16 > encoding in the style of Scheme B would be even better, and quite > adequate, so imgCIF is not actually presenting any problems and so was > a red herring. > > I'm not sure that face-to-face or Skype discussions are necessarily > going to be more productive. Writing things down, while slower, > allows me at least to collect my thoughts and those of other > participants, and hopefully make a reasoned contribution (my apologies > if I am too long-winded) and as an added bonus those thoughts are > recorded for later reference. For example, where would I now find the > background on why a container format for imgCIF is such a bad idea? > Presumably that was all thrashed out in face to face discussions, and > no record now remains. > > On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein > wrote: >> Dear Colleagues, >> >> ? James' and John's last interchange is so voluminous, I doubt any of >> us has been able to fully appreciate the rich complexity of ideas >> contained therein. ?For example, one of the suggestions far down in >> the text is: >> >> (James now) ?Indeed. ?My intent with this specification was to ensure >> that third parties would be able to recover the encoding. If imgCIF is >> going to cause us to make such an open-ended specification, it is >> probably a sign that imgCIF needs to be addressed separately. ?For >> example, should we think about redefining it as a container format, >> with a CIF header and UTF16 body (but still part of the >> "Crystallographic Information Framework")? >> >> The idea of an imgCIF "header" in CIF format and a image in another is an >> old, well-established, thoroughly discussed, and mistaken idea, rejected >> in 1998. ?The handling of multiple images in a single file (e.g. >> a jpeg thumbnail and crystal image and a full-size diffraction image) >> requires the ability to switch among encodings within the file -- >> something handled by the current DDL2 and MIME-based imgCIF format and >> which would be a serious problem in CIF2 has currently proposed, >> increasing the chances that we will have to move imgCIF entirely into >> HDF5 and abandon the CIF representation entirely, sharing only >> the dictionary and not the framework. >> >> If you look carefully, you will see a similar trend with mmCIF, in which >> and XML representation sharing the dictionary plays a much more >> important role than the CIF format. >> >> Is it really desirable to make the new CIF format so rigid and >> unadaptable that major portions of macromolecular crysallography >> end up migrating to very different formats, as they already are >> doing? ?Yes, there is great value in having a common dictionary, >> but would there not be additional value in having a sufficiently >> flexible common format to allow for more software sharing than >> we now have? ?It is really desirable for us to continue in the >> direction of a single macromolecular experiment having to >> deal with HDF5 and CIF/DDL2/MIME representations of the image data >> during collection, CCP4-style CIF representations during processing >> and deposition and legacy PDB and PDBML representations in subsequent >> community use? ?If we could be a little bit more flexible, we might be >> able to reduce the data interchange software burdens a little. >> Right now, this discussion seems headed in the direction of simply >> adding yet another data representation (DDLm/CIF2) to the mix, >> increasing the chances of mistranslation and confusion, rather >> that reducing them. >> >> Please, step back a bit from the detailed discussion of UTF8 and >> look at the work-flow of doing and publishing crystallographic >> experiments and let us try to make a contribution that simplifies >> it, not one that makes it more complex than it needs to be. >> >> I suggest we need to meet and talk, either face-to-face, or by skype. >> >> Regards, >> ? Herbert >> >> ===================================================== >> ?Herbert J. Bernstein, Professor of Computer Science >> ? ?Dowling College, Kramer Science Center, KSC 121 >> ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 >> >> ? ? ? ? ? ? ? ? ?+1-631-244-3035 >> ? ? ? ? ? ? ? ? ?yaya at dowling.edu >> ===================================================== >> >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From simonwestrip at btinternet.com Thu Aug 26 00:08:14 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Wed, 25 Aug 2010 23:08:14 +0000 (GMT) Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <20100623103310.GD15883@emerald.iucr.org> <381469.52475.qm@web87004.mail.ird.yahoo.com> <984921.99613.qm@web87011.mail.ird.yahoo.com> <826180.50656.qm@web87010.mail.ird.yahoo.com> <563298.52532.qm@web87005.mail.ird.yahoo.com> <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F7791 3624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <639601.73559.qm@web87008.mail.ird.yahoo.com> Dear all Recent contributions have stimulated me to revisit some of the fundamental issues of the possible changes in CIF2 with respect to CIF1, in particular, the impact on current practice (as I perceive it, based on my experience). The following is a summary of my thoughts, trying to look at this from two perspectives (forgive me if I repeat earlier opinions): 1) User perspective To date, in the 'core' CIF world (i.e. single-crystal and its extensions), users treat CIFs as text files, and expect to be able to read them as such using plain-text editors, and indeed edit them if necessary for e.g. publication purposes. Furthermore, they expect them to be readable by applications that claim that ability (e.g. graphics software). The situation is slghtly different with mmCIF (and the pdb variants), where users tend to treat these CIFs as data sources that can be read by applications without any need to examine the raw CIF themselves, let alone edit them. Although the above statements only encompass two user groups and are based on my personal experience, I believe these groups are the largest when talking about CIF users? So what is the impact on such users of introducing the use of non-ASCII text and thus raising the text encoding issue? In the latter case, probably minimal, inasmuch as the users dont interact directly with the raw CIF and rely on CIF processing software to manage the data. In the former case, it is quite possible that a user will no longer be able to edit the raw CIF using the same plain-text editor they have always used for such purposes. For example, if a user receives a CIF that has been encoded in UTF16 by some remote CIF processing system, and opens it in a non-UTF16-aware plain-text editor, they will not be presented with what they would expect, even if the character set in that particular CIF doesnt extend beyond ASCII; furthermore, even 'advanced' test editors would struggle if the encoding were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally applicable to CIF1, but by 'opening up' multiple encodings, the probability of their usage increases? So as soon as we move beyond ASCII, we have to accept that a large group of CIF users will, at the very least, have to be aware that CIF is no longer the 'text' format that they once understood it to be? 2) Developer perspective I beleive that developers presented with a documented standard will follow that standard and prefer to work with no uncertainties, especially if they are unfamiliar with the format (perhaps just need to be able to read a CIF to extract data relevant to their application/database...?) Taking the example of XML, in my experience developers seem to follow the standard quite strictly. Most everyday applications that process XML are intolerant of violations of the standard. Fortunately, it is largely only developers that work with raw XML, so the standard works well. In contrast to XML, with HTML/javascript the approach to the 'standard' is far more tolerant. Though these languages are standardized, in order to compete, the leading application developers have had to adopt flexibility (e.g. browsers accept 'dirty' HTML, are remarkably forgiving of syntax violations in javascript, and alter the standard to achieve their own ends or facilitate user requirements). I suspect this results largely from the evolution of the languages: just as in the early days of CIF, encouragement of use and the end results were more important than adherence to the documented standard? Note that these same applications that are so tolerant of HTML/javascript violations are far less forgiving of malformed XML. So is the lesson here that developers expect new standards to be unambiguous and will code accordingly (especially if the new standard was partly designed to address the shortcomings of its ancestors)? Again, forgive me if these all sounds familiar - however, before arguing one way or the other with regard to specifics, perhaps the wider group would like to confirm or otherwise the main points I'm trying to assert, in particular, with respect to *user* practice: 1) CIF2 will require users to change the way they view CIF - i.e. they may be forced to use CIF2-compliant text editors/application software, and abandon their current practice. With respect to developers, recent coverage has been very insightful, but just out of interest, would I be wrong in stating that: 2) Developers, especially those that don't specialize in CIF, are likely to want a clear-cut universal standard that does not require any heuristic interpretatation. Cheers Simon ________________________________ From: James Hester To: Group for discussing encoding and content validation schemes for CIF2 Sent: Tuesday, 24 August, 2010 4:38:27 Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . Thanks John for a detailed response. At the top of this email I will address this whole issue of optional behaviour. I was clearly too telegraphic in previous posts, as Herbert thinks that optional whitespace counts as an optional feature, so I go into some detail below. By "optional features" I mean those aspects of the standard that are not mandatory for both readers and writers, and in addition I am not concerned with features that do not relate directly to the information transferred, e.g. optional warnings. For example, unless "optional whitespace" means that the *reader* may throw a syntax error when whitespace is encountered at some particular point where whitespace is optional, I do not view optional whitespace as an optional feature - it is only optional for the writer. With this definition of "optional feature" it follows logically that, if a standard has such "optional features", not all standard-conformant files will be readable by all standard-conformant readers. This is as true of HTML, XML and CIF1 as it is of CIF2. Whatever the relevance of HTML and XML to CIF, the existence of successful standards with optional features proves only that a standard can achieve widespread acceptance while having optional features - whether these optional features are a help or a hindrance would require some detailed analysis. So: any standard containing optional features requires the addition of external information in order to resolve the choice of optional features before successful information interchange can take place. Into this situation we place software developers. These are the people who play a big role in deciding which optional parts of the standard are used, as they are the ones that write the software that attempts to read and write the files. Developers will typically choose to support optional features based on how likely they are to be used, which depends in part on how likely they are perceived to be implemented in other software. This is a recursive, potentially unstable situation, which will eventually resolve itself in one of three ways: (1) A "standard" subset of optional features develops and is approximately always implemented in readers. Special cases: (a) No optional features are implemented (b) All optional features are implemented (2) A variety of "standard" subsets develop, dividing users into different communities. These communities can't always read each other's files without additional conversion software, but there is little impetus to write this software, because if there were, the developers would have included support for the missing options in the first place. The most obvious example of such communities would be thosed based on options relating to natural languages, if those communities do not care about accessibility of their files to non-users of their language and encoding. (3) A truly chaotic situation develops, with no discernable resolution and a plethora of incompatible files and software. Outcome 1 is the most desirable, as all files are now readable by all readers, meaning no additional negotiation is necessary, just as if we had mandated that set of optional features. Outcome 2 is less desirable, as more software needs to be written and the standard by itself is not necessarily enough information to read a given file. Outcome 3 is obviously pretty unwelcome, but unlikely as it would require a lot of competing influences, which would eventually change and allow resolution into (1) or (2). Think HTML and Microsoft. Now let us apply the above analysis to CIF: some are advocating not exhaustively listing or mandating the possible CIF2 encodings (CIF1 did not list or mandate encoding either), leading to a range of "optional features" as I have defined it above (where support for any given encoding is a single "optional feature"). For CIF1, we had a type 1 outcome (only ASCII encoding was supported and produced). So: my understanding of the previous discussion is that, while we agree that it would be ideal if everyone used only UTF8, some perceive that the desire to use a different encoding will be sufficiently strong that mandating UTF8 will be ineffective and/or inconvenient. So, while I personally would advocate mandating UTF8, the other point of view would have us allowing non UTF8 encoding but hoping that everyone will eventually move to UTF8. In which case I would like to suggest that we use network effects to influence the recursive feedback loop experienced by programmers described above, so that the community settles on UTF8 in the same way as it has settled on ASCII for CIF1. That is, we "load the dice" so that other encodings are disfavoured. Here are some ways to "load the dice": (1) Mandate UTF8 only. (2) Make support for UTF8 mandatory in CIF processors (3) Force non UTF8 files to jump through extra hoops (which I think is necessary anyway) (4) Educate programmers on the drawbacks of non UTF8 encodings and strongly urge them not to support reading non UTF8 CIF files (5) Strongly recommend that the IUCr, wwPDB, and other centralised repositories reject non-UTF8-encoded CIF files (6) Make available hyperlinked information on system tools for dealing with UTF8 files on popular platforms, which could be used in error messages produced by programs (see (4)) I would be interested in hearing comments on the acceptability of these options from the rest of the group (I think we know how we all feel about (1)!). Now, returning to John's email: I will answer each of the points inline, at the same time attempting to get all the attributions correct. (James) I had not fully appreciated that Scheme B is intended to be applied only at the moment of transfer or archiving, and envisions users normally saving files in their preferred encoding with no hash codes or encoding hints required (I will call the inclusion of such hints and hashes as 'decoration'). (John) "Envisions users normally [...]" is a bit stronger than my position or the intended orientation of Scheme B. "Accommodates" would be my choice of wording. (James now) No problem with that wording, my point is that such undecorated files will be called CIF2 files and so are a target for CIF2 software developers, thus "unloading" the dice away from UTF8 and closer to encoding chaos. (James) A direct result of allowing undecorated files to reside on disk is that CIF software producers will need to write software that will function with arbitrary encodings with no decoration to help them, as that is the form that users' files will be most often be in. (John) The standard can do no more to prevent users from storing undecorated CIFs than it can to prevent users from storing CIF text encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding. More generally, all the standard can do is define the characteristics of a conformant CIF -- it can never prevent CIF-like but non-conformant files from being created, used, exchanged, or archived as if they were conformant CIFs. Regardless of the standard's ultimate position on this issue, software authors will have to be guided by practical considerations and by the real-world requirements placed on their programs. In particular, they will have to decide whether to accept "CIF" input that in fact violates the standard in various ways, and / or they will have to decide which optional CIF behaviors they will support. As such, I don't see a significant distinction between the alternatives before us as regards the difficulty, complexity, or requirements of CIF2 software. (James now) I have described the way the standard works to restrict encodings in the discussion at the top of this email. Briefly, CIF software developers develop programs that conform with the CIF2 standard. If that standard says 'UTF8', they program for UTF8. If you want to work in ISO-8859-15 etc, you have to do extra work. Working in favour of such extra work would be a compelling use case, which I have yet to see (I note that the 'UTF8 only' standard posted to ccp4-bb and pdb-l produced no comments). My strong perception is that any need for other encodings is overwhelmed by the utility of settling on a single encoding, but that perception would need confirmation from a proper survey of non-ASCII users. So, no we can't stop people saving CIF-like files in other encodings, but we can discourage it by creating significant barriers in terms of software availability. Just like we can't stop CIF1 users saving files in JIS X 0208, but that doesn't happen at any level that causes problems (if it happens at all, which I doubt). (John) Furthermore, no formulation of CIF is inherently reliable or unreliable, because reliability (in this sense) is a characteristic of data transfer, not of data themselves. Scheme B targets the activities that require reliability assurance, and disregards those that don't. In a practical sense, this isn't any different from scheme A, because it is only when the encoding is potentially uncertain -- to wit, in the context of data transfer -- that either scheme need be applied (see also below). I suppose I would be willing to make scheme B a general requirement of the CIF format, but I don't see any advantage there over the current formulation. The actual behavior of people and the practical requirements on CIF software would not appreciably change. (James now) I would suggest that Scheme B does not target all activites requiring reliability assurance, as it does not address the situation where people use a mix of CIF-aware software and text tools in a single encoding environment. The real, significant change that occurs when you accept Scheme B is that CIF files can now be in any encoding and undecorated. Programmers are then likely to provide programs that might or might not work with various encodings, and users feel justifiably that their undecorated files should be supported. The software barrier that was encouraging UTF8-only has been removed, and the problem of mismatched encodings that we have been trying to avoid becomes that much more likely to occur. Scheme B has very few teeth to enforce decoration at the point of transfer, as the software at either end is now probably happy with an undecorated file. Requiring decoration as a condition of being a CIF2 file means that software will tend to reject undecorated files, thereby limiting the damage that would be caused by open slather encoding. (James) Furthermore, given the ease with which files can be transferred between users (email attachment, saved in shared, network-mounted directory, drag and drop onto USB stick etc.) it is unlikely that Scheme B or anything involving extra effort would be applied unless the recipient demanded it. (John) For hand-created or hand-edited CIFs, I agree. CIFs manipulated via a CIF2-compliant editor could be relied upon to conform to scheme B, however, provided that is standardized. But the same applies to scheme A, given that few operating environments default to UTF-8 for text. (James now) That is my goal: that any CIF that passes through a CIF-compliant program must be decorated before input and output (if not UTF8). What hand-edited, hand-created CIFs actually have in the way of decoration doesn't bother me much, as these are very rare and of no use unless they can be read into a CIF program, at which point they should be rejected until properly decorated. And I reiterate, the process of applying decoration can be done interactively to minimise the chances of incorrect assignment of encoding. (James) And given how many times that file might have changed hands across borders and operating systems within a single group collaboration, there would only be a qualified guarantee that the character to binary mapping has not been mangled en route, making any scheme applied subsequently rather pointless. (John) That also does not distinguish among the alternatives before us. I appreciate the desire for an absolute guarantee of reliability, but none is available. Qualified guarantees are the best we can achieve (and that's a technical assessment, not an aphorism). (James now) Oh, but I believe it does distinguish, because if CIF software reads only UTF8 (because that is what the standard says), then the file will tend to be in UTF8 at all points in time, with reduced possibilities for encoding errors. I think it highly likely that each group that handles a CIF will at some stage run it through CIF-aware software, which means encoding mistakes are likely to be caught much earlier. (James) We would thus go from a situation where we had a single, reliable and sometimes slightly inconvenient encoding (UTF8), to one where a CIF processor should be prepared for any given CIF file to be one of a wide range of encodings which need to be guessed. (John) Under scheme A or the present draft text, we have "a single, reliable [...] encoding" only in the sense that the standard *specifies* that that encoding be used. So far, however, I see little will to produce or use processors that are restricted to UTF-8, and I have every expectation that authors will continue to produce CIFs in various encodings regardless of the standard's ultimate stance. Yes, it might be nice if everyone and every system converged on UTF-8 for text encoding, but CIF2 cannot force that to happen, not even among crystallographers. (James now) You see little will to do this: but as far as I can tell, there is even less will not to do it. Authors will not "continue" to produce CIFs in various encodings, as they haven't started doing so yet. As I've said above, CIF2 can certainly, if not force, encourage UTF8 adoption. What's more, non-ASCII characters are only gradually going to find their way into CIF2 files, as the dictionaries and large scale adopters of CIF2 (the IUCr) start to allow non-ASCII characters in names, and the users gradually adapt to this new way of doing things. I have no sense that CIF users will feel a strong desire to use non UTF8 schemes, when they have been happy in an ASCII-only regime up until now. But I'm curious: on what basis are you saying that there is little will to use processors that are restricted to UTF8? (John) In practice, then, we really have a situation where the practical / useful CIF2 processor must be prepared to handle a variety of encodings (details dependent on system requirements), which may need to be guessed, with no standard mechanism for helping the processor make that determination or for allowing it to check its guess. Scheme B improves that situation by standardizing a general reliability assurance mechanism, which otherwise would be missing. In view of the practical situation, I see no down side at all. A CIF processor working with scheme B is *more* able, not less. (James) I would much prefer a scheme which did not compromise reliability in such a significant way. (John) There is no such compromise, because in practice, we're not starting from a reliable position. (James now) I think your statement that our current position is not reliable arises out of a perception that users are likely to use a variety of encodings regardless of what the standard says. I think this danger is way overstated, but I'd like to see you expand on why you think there is such a likelihood of multiple encodings being used (James) My previous (somewhat clunky) attempts to adjust Scheme B were directed at trying to force any file with the CIF2.0 magic number to be either decorated or UTF-8, meaning that software has a reasonably high confidence in file integrity. An alternative way of thinking about this is that CIF files also act as the mechanism of information transfer between software programs. [... W]hen a separate program is asked to input that CIF, the information has been transferred, even if that software is running on the same computer. (John) So in that sense, one could argue that Scheme B already applies to all CIFs, its assertion to the contrary notwithstanding. Honestly, though, I don't think debating semantic details of terms such as "data transfer" is useful because in practice, and independent of scheme A, B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to choose what form of reliability assurance to accept or demand, if any. (James now) I was only debating semantic details in order to expose the fact that data transfer occurs between programs, not just between systems, and that therefore Scheme B should apply within a single system, so therefore, all CIF2 files should be decorated. As for who should be demanding reliability assurance, the receiver may not be in a position to demand some level of reliability if the file creator is not in direct contact. Again, we can build this reliability into the standard and save the extra negotiation or loss of information that is otherwise involved. (James) Now, moving on to the detailed contours of Scheme B and addressing the particular points that John and I have been discussing. My original criticisms are the ones preceded by numerals. [(James now) I've deleted those points where we have reached agreement. Those points are: (1) Restrict encodings to those for which the first line of a CIF file provides unambiguous encoding for ASCII codepoints (2) Put the hash value on the first line] (James a long time ago) (4) Assumption that all recipients will be able to handle all encodings (John) There is no such assumption. Rather, there is an acknowledgement that some systems may be unable to handle some CIFs. That is already the case with CIF1, and it is not completely resolved by standardizing on UTF-8 (i.e. scheme A). (James) There is no such thing as 'optional' for an information interchange standard. A file that conforms to the standard must be readable by parsers written according to the standard. If reading a standard-conformant file might fail or (worse) the file might be misinterpreted, information cannot always reliably be exchanged using this standard, so that optional behaviour needs to be either discarded, or made mandatory. There is thus no point in including optional behaviour in the standard. So: if the standard allows files to be written in encoding XYZ, then all readers should be able to read files written in encoding XYZ. I view the CIF1 stance of allowing any encoding as a mistake, but a benign one, as in the case of CIF1 ASCII was so entrenched that it was the defacto standard for the characters appearing in CIF1 files. In short, we have to specify a limited set of acceptable encodings. (John) As Herb astutely observed, those assertions reflect a fundamental source of our disagreement. I think we can all agree that a standard that permits conforming software to misinterpret conforming data is undesirable. Surely we can also agree that an information interchange standard does not serve its purpose if it does not support information being successfully interchanged. It does not follow, however, that the artifacts by which any two parties realize an information interchange must be interpretable by all other conceivable parties, nor does it follow that that would be a supremely advantageous characteristic if it were achievable. It also does not follow that recognizable failure of any particular attempt at interchange must at all costs be avoided, or that a data interchange standard must take no account of its usage context. (James now) This is where we must make a policy decision: is a CIF2 file to be a universally understandable file? I agree that excluding optional behaviour is not an absolute requirement, but I also consider that optional behaviour should not be introduced without solid justification, given the real cost in interoperability and portability of the standard. You refer to two parties who wish to exchange information: those parties are always free to agree on private enhancements to the CIF2 standard (or to create their very own protocol), if they are in contact. I do not see why this use case need concern us here. Herbert can say to John 'I'm emailing you a CIF2 file but encoded in UTF16'. John has his extremely excellent software which handles UTF16 and these two parties are happy. John mentions a 'usage context'. If the standard is to include some account of usage context, then that context has to be specified sufficiently for a CIF2 programmer to understand what aspects of that context to consider, and not left open to misinterpretation. Perhaps you could enlarge on what particular context should be included? (John) Optional and alternative behaviors are not fundamentally incompatible with a data interchange standard, as XML and HTML demonstrate. Or consider the extreme variability of CIF text content: whether a particular CIF is suitable for a particular purpose depends intimately on exactly which data are present in it, and even to some extent on which data names are used to present them, even though ALL are optional as far as the format is concerned. If I say 'This CIF is unsuitable for my present purpose because it does not contain _symmetry_space_group_name_H-M', that does not mean the CIF standard is broken. Yet, it is not qualitatively different for me to say 'This CIF is unsuitable because it is encoded in CCSID 500' despite CIF2 (hypothetically) permitting arbitrary encodings. (James now) The difference is quantitative and qualitative. Quantitative, because the number of CIF2 files that are unsuitable because of missing tags will always be less than or equal to the number of CIF2 files that are unsuitable because of a missing tag and unknown encoding. Thus, by reducing ambiguity at the lower levels of the standard, we improve the utility at the higher levels. The difference is also qualitative, in that (a) if we have tags with non-ASCII characters, they could conceivably be confused with other tags if the encoding is not correct and so you will have a situation where a file that is not suitable actually appears suitable, because the desired tag appears. Likewise, the value taken by a tag may be wrong. (James a long time ago) (iii) restrict possible encodings to internationally recognised ones with well-specified Unicode mappings. This addresses point (4) (John) I don't see the need for this, and to some extent I think it could be harmful. For example, if Herb sees a use for a scheme of this sort in conjunction with imgCIF (unknown at this point whether he does), then he might want to be able to specify an encoding specific to imgCIF, such as one that provides for multiple text segments, each with its own character encoding. To the extent that imgCIF is an international standard, perhaps that could still satisfy the restriction, but I don't think that was the intended meaning of "internationally recognised". (James now) Indeed. My intent with this specification was to ensure that third parties would be able to recover the encoding. If imgCIF is going to cause us to make such an open-ended specification, it is probably a sign that imgCIF needs to be addressed separately. For example, should we think about redefining it as a container format, with a CIF header and UTF16 body (but still part of the "Crystallographic Information Framework")? (John) As for "well-specified Unicode mappings", I think maybe I'm missing something. CIF text is already limited to Unicode characters, and any encoding that can serve for a particular piece of CIF text must map at least the characters actually present in the text. What encodings or scenarios would be excluded, then, by that aspect of this suggestion? (James) My intention was to make sure that not only the particular user who created the file knew this mapping, but that the mapping was publically available. Certainly only Unicode encodable code points will appear, but the recipient needs to be able to recover the mapping from the file bytes to Unicode without relying on e.g. files that will be supplied on request by someone whose email address no longer works. (John) This issue is relevant only to the parties among whom a particular CIF is exchanged. The standard would not particularly assist those parties by restricting the permitted encodings, because they can safely ignore such restrictions if they mutually agree to do so (whether statically or dynamically), and they (specifically, the CIF originator) must anyway comply with them if no such agreement is implicit or can be reached. (James) Again, any two parties in current contact can send each other files in whatever format and encoding they wish. My concern is that CIF software writers are not drawn into supporting obscure or adhoc encodings. (John) B) Scheme B does not use quite the same language as scheme A with respect to detectable encodings. As a result, it supports (without tagging or hashing) not just UTF-8, but also all UTF-16 and UTF-32 variants. This is intentional. (James) I am concerned that the vast majority of users based in English speaking countries (and many non English speaking countries) will be quite annoyed if they have to deal with UTF-16/32 CIF2 files that are no longer accessible to the simple ASCII-based tools and software that they are used to. Because of this, allowing undecorated UTF16/32 would be far more disruptive than forcing people to use UTF8 only. Thus my stipulation on maintaining compatibility with ASCII for undecorated files. (John) Supporting UTF-16/32 without tagging or hashing is not a key provision of scheme B, and I could live without it, but I don't think that would significantly change the likelihood of a user unexpectedly encountering undecorated UTF-16/32 CIFs. It would change only whether such files were technically CIF-conformant, which doesn't much matter to the user on the spot. In any case, it is not the lack of decoration that is the basic problem here. (James now) Yes, that is true. A decorated UTF16 file is just as unreadable as an undecorated one in ASCII tools. However, per my comments at the start of this email, I think an extra bit of hoop jumping for non UTF8 encoded files has the desirable property of encouraging UTF8 use. (John) C) Scheme B is not aimed at ensuring that every conceivable receiver be able to interpret every scheme-B-compliant CIF. Instead, it provides receivers the ability to *judge* whether they can interpret particular CIFs, and afterwards to *verify* that they have done so correctly. Ensuring that receivers can interpret CIFs is thus a responsibility of the sender / archive maintainer, possibly in cooperation with the receiver / retriever. (James) As I've said before, I don't see the paradigm of live negotiation between senders and receivers as very useful, as it fails to account for CIFs being passed between different software (via reading/writing to a file system), or CIFs where the creator is no longer around, or technically unsophisticated senders where, for example, the software has produced an undecorated CIF in some native encoding and the sender has absolutely no idea why the receiver (if they even have contact with the receiver!) can't read the file properly. I prefer to see the standard that we set as a substitute for live negotiation, so leaving things up to the users is in that sense an abrogation of our responsibility. (John) That scenario will undoubtedly occur occasionally regardless of the outcome of this discussion. If it is our responsibility to avoid it at all costs then we are doomed to fail in that regard. Software *will* under some circumstances produce undecorated, non-UTF-8 "CIFs" because that is sometimes convenient, efficient, and appropriate for the program's purpose. I think, though, those comments reflect a bit of a misconception. The overall purpose of CIF supporting multiple encodings would be to allow specific CIFs to be better adapted for specific purposes. Such purposes include, but are not limited to () exchanging data with general-purpose program(s) on the same system () exchanging data with crystallography program(s) on the same system () supporting performance or storage objectives of specific programs or systems () efficiently supporting problem or data domains in which Latin text is a minority of the content (e.g. imgCIF) () storing data in a personal archive () exchanging data with known third parties () publishing data to a general audience *Few, if any, of those uses would be likely to involve live negotiation.* That's why I assigned primary responsibility for selecting encodings to the entity providing the CIF. I probably should not even have mentioned cooperation of the receiver; I did so more because it is conceivable than because it is likely. (James now) OK, fair enough. My issues then with the paradigm of provider-based encoding selection is that it only works where the provider is capable of making this choice, and it puts that responsibility on all providers, large and small. Of course, I am keen to construct a CIF ecology where providers always automatically choose UTF8 as the "safe" choice. (John) Under any scheme I can imagine, some CIFs will not be well suited to some purposes. I want to avoid the situation that *no* conformant CIF can be well suited to some reasonable purposes. I am willing to forgo the result that *every* conformant CIF is suited to certain other, also reasonable purposes. (James now) Fair enough. However, so far the only reasonable purpose that I can see for which a UTF8 file would not be suitable is exchanging data with general-purpose programs that do not cope with UTF8, and it may well be that with a bit of research the list of such programs would turn out to be rather short. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding at iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100825/b6f460c1/attachment-0001.html From yaya at bernstein-plus-sons.com Thu Aug 26 00:57:44 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Wed, 25 Aug 2010 19:57:44 -0400 (EDT) Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: <639601.73559.qm@web87008.mail.ird.yahoo.com> References: <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F7791 3624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> Message-ID: While I disagree with these estimates of how various communities will react, the best way to find out is not for us to debate among ourselves, but to present the various ideas to the community in the form of a completed standard with supporting software and see if they accept it. In the case of core CIF, that community has accepted what they were offered. In the case of mmCIF, that community has essentially rejected what they were offered. So, after all these years of effort on CIF2, isn't it past time to finish something, put it out there and see if it flies. As for my own views: I remind you that XML is the end result of the essentially failed SGML effort followed by the highly successful HTML effort. XML saved the SGML effort by adopting a large part of the simplicity and flexibility of HTML. Please bear that in mind. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Wed, 25 Aug 2010, SIMON WESTRIP wrote: > Dear all > > Recent contributions have stimulated me to revisit some of the fundamental > issues of the possible changes in CIF2 with respect to CIF1, > in particular, the impact on current practice (as I perceive it, based on my > experience). The following is a summary of my thoughts, trying to > look at this from two perspectives (forgive me if I repeat earlier > opinions): > > 1) User perspective > > To date, in the 'core' CIF world (i.e. single-crystal and its extensions), > users treat CIFs as text files, and expect to be able to read them as such > using > plain-text editors, and indeed edit them if necessary for e.g. publication > purposes. Furthermore, they expect them to be readable by applications that > claim that > ability (e.g. graphics software). > > The situation is slghtly different with mmCIF (and the pdb variants), where > users tend to treat these CIFs as data sources that can be read by > applications without > any need to examine the raw CIF themselves, let alone edit them. > > Although the above statements only encompass two user groups and are based > on my personal experience, I believe these groups are the largest when > talking about CIF users? > > So what is the impact on such users of introducing the use of non-ASCII text > and thus raising the text encoding issue? > > In the latter case, probably minimal, inasmuch as the users dont interact > directly with the raw CIF and rely on CIF processing software to manage the > data. > > In the former case, it is quite possible that a user will no longer be able > to edit the raw CIF using the same plain-text editor they have always used > for such purposes. > For example, if a user receives a CIF that has been encoded in UTF16 by some > remote CIF processing system, and opens it in a non-UTF16-aware plain-text > editor, > they will not be presented with what they would expect, even if the > character set in that particular CIF doesnt extend beyond ASCII; > furthermore, even 'advanced' test editors would struggle if the encoding > were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally > applicable to CIF1, but by 'opening up' multiple encodings, the probability > of their usage increases? > > So as soon as we move beyond ASCII, we have to accept that a large group of > CIF users will, at the very least, have to be aware that CIF is no longer > the 'text' format > that they once understood it to be? > > 2) Developer perspective > > I beleive that developers presented with a documented standard will follow > that standard and prefer to work with no uncertainties, especially if they > are > unfamiliar with the format (perhaps just need to be able to read a CIF to > extract data relevant to their application/database...?) > > Taking the example of XML, in my experience developers seem to follow the > standard quite strictly. Most everyday applications that process XML are > intolerant of > violations of the standard. Fortunately, it is largely only developers that > work with raw XML, so the standard works well. > > In contrast to XML, with HTML/javascript the approach to the 'standard' is > far more tolerant. Though these languages are standardized, in order to > compete, the leading application > developers have had to adopt flexibility (e.g. browsers accept 'dirty' HTML, > are remarkably forgiving of syntax violations in javascript, and alter the > standard to > achieve their own ends or facilitate user requirements). I suspect this > results largely from the evolution of the languages: just as in the early > days of CIF, encouragement of > use and the end results were more important than adherence to the documented > standard? > > Note that these same applications that are so tolerant of HTML/javascript > violations are far less forgiving of malformed XML. So is the lesson here > that developers expect > new standards to be unambiguous and will code accordingly (especially if the > new standard was partly designed to address the shortcomings of its > ancestors)? > > > Again, forgive me if these all sounds familiar - however, before arguing one > way or the other with regard to specifics, perhaps the wider group would > like to confirm or otherwise the main points I'm trying to assert, in > particular, with respect to *user* practice: > > 1) CIF2 will require users to change the way they view CIF - i.e. they may > be forced to use CIF2-compliant text editors/application software, and > abandon their current practice. > > With respect to developers, recent coverage has been very insightful, but > just out of interest, would I be wrong in stating that: > > 2) Developers, especially those that don't specialize in CIF, are likely to > want a clear-cut universal standard that does not require any heuristic > interpretatation. > > Cheers > > Simon > > > > ____________________________________________________________________________ > From: James Hester > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Tuesday, 24 August, 2010 4:38:27 > Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line > . .. .. .. .. .. .. .. .. .. .. .. .. .. . > > Thanks John for a detailed response. > > At the top of this email I will address this whole issue of optional > behaviour.? I was clearly too telegraphic in previous posts, as > Herbert thinks that optional whitespace counts as an optional feature, > so I go into some detail below. > > By "optional features" I mean those aspects of the standard that are > not mandatory for both readers and writers, and in addition I am not > concerned with features that do not relate directly to the information > transferred, e.g. optional warnings.? For example, unless "optional > whitespace" means that the *reader* may throw a syntax error when > whitespace is encountered at some particular point where whitespace is > optional, I do not view optional whitespace as an optional feature - > it is only optional for the writer.? With this definition of "optional > feature" it follows logically that, if a standard has such "optional > features", not all standard-conformant files will be readable by all > standard-conformant readers.? This is as true of HTML, XML and CIF1 as > it is of CIF2.? Whatever the relevance of HTML and XML to CIF, the > existence of successful standards with optional features proves only > that a standard can achieve widespread acceptance while having > optional features - whether these optional features are a help or a > hindrance would require some detailed analysis. > > So: any standard containing optional features requires the addition of > external information in order to resolve the choice of optional > features before successful information interchange can take place. > > Into this situation we place software developers.? These are the > people who play a big role in deciding which optional parts of the > standard are used, as they are the ones that write the software that > attempts to read and write the files.? Developers will typically > choose to support optional features based on how likely they are to be > used, which depends in part on how likely they are perceived to be > implemented in other software.? This is a recursive, potentially > unstable situation, which will eventually resolve itself in one of > three ways: > > (1) A "standard" subset of optional features develops and is > approximately always implemented in readers.? Special cases: > ? (a) No optional features are implemented > ? (b) All optional features are implemented > (2) A variety of "standard" subsets develop, dividing users into > different communities. These communities can't always read each > other's files without additional conversion software, but there is > little impetus to write this software, because if there were, the > developers would have included support for the missing options in the > first place.? The most obvious example of such communities would be > thosed based on options relating to natural languages, if those > communities do not care about accessibility of their files to > non-users of their language and encoding. > (3) A truly chaotic situation develops, with no discernable resolution > and a plethora of incompatible files and software. > > Outcome 1 is the most desirable, as all files are now readable by all > readers, meaning no additional negotiation is necessary, just as if we > had mandated that set of optional features.? Outcome 2 is less > desirable, as more software needs to be written and the standard by > itself is not necessarily enough information to read a given file. > Outcome 3 is obviously pretty unwelcome, but unlikely as it would > require a lot of competing influences, which would eventually change > and allow resolution into (1) or (2).? Think HTML and Microsoft. > > Now let us apply the above analysis to CIF: some are advocating not > exhaustively listing or mandating the possible CIF2 encodings (CIF1 > did not list or mandate encoding either), leading to a range of > "optional features" as I have defined it above (where support for any > given encoding is a single "optional feature").? For CIF1, we had a > type 1 outcome (only ASCII encoding was supported and produced). > > So: my understanding of the previous discussion is that, while we > agree that it would be ideal if everyone used only UTF8, some perceive > that the desire to use a different encoding will be sufficiently > strong that mandating UTF8 will be ineffective and/or inconvenient. > So, while I personally would advocate mandating UTF8, the other point > of view would have us allowing non UTF8 encoding but hoping that > everyone will eventually move to UTF8. > > In which case I would like to suggest that we use network effects to > influence the recursive feedback loop experienced by programmers > described above, so that the community settles on UTF8 in the same way > as it has settled on ASCII for CIF1.? That is, we "load the dice" so > that other encodings are disfavoured.? Here are some ways to "load the > dice": > > (1) Mandate UTF8 only. > (2) Make support for UTF8 mandatory in CIF processors > (3) Force non UTF8 files to jump through extra hoops (which I think is > necessary anyway) > (4) Educate programmers on the drawbacks of non UTF8 encodings and > strongly urge them not to support reading non UTF8 CIF files > (5) Strongly recommend that the IUCr, wwPDB, and other centralised > repositories reject non-UTF8-encoded CIF files > (6) Make available hyperlinked information on system tools for dealing > with UTF8 files on popular platforms, which could be used in error > messages produced by programs (see (4)) > > I would be interested in hearing comments on the acceptability of > these options from the rest of the group (I think we know how we all > feel about (1)!). > > Now, returning to John's email: I will answer each of the points > inline, at the same time attempting to get all the attributions > correct. > > (James) I had not fully appreciated that Scheme B is intended to be > applied only at the moment of transfer or archiving, and envisions > users normally saving files in their preferred encoding with no hash > codes or encoding hints required (I will call the inclusion of such > hints and hashes as 'decoration'). > > (John) "Envisions users normally [...]" is a bit stronger than my > position or the intended orientation of Scheme B. ?"Accommodates" > would be my choice of wording. > > (James now) No problem with that wording, my point is that such > undecorated files will be called CIF2 files and so are a target for > CIF2 software developers, thus "unloading" the dice away from UTF8 and > closer to encoding chaos. > > (James) ?A direct result of allowing undecorated files to reside on > disk is that CIF software producers will need to write software that > will function with arbitrary encodings with no decoration to help > them, as that is the form that users' files will be most often be in. > > (John) The standard can do no more to prevent users from storing > undecorated CIFs than it can to prevent users from storing CIF text > encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding. > More generally, all the standard can do is define the characteristics > of a conformant CIF -- it can never prevent CIF-like but > non-conformant files from being created, used, exchanged, or archived > as if they were conformant CIFs. ?Regardless of the standard's > ultimate position on this issue, software authors will have to be > guided by practical considerations and by the real-world requirements > placed on their programs. ?In particular, they will have to decide > whether to accept "CIF" input that in fact violates the standard in > various ways, and / or they will have to decide which optional CIF > behaviors they will support. ?As such, I don't see a significant > distinction between the alternatives before us as regards the > difficulty, complexity, or requirements of CIF2 software. > > (James now) I have described the way the standard works to restrict > encodings in the discussion at the top of this email.? Briefly, CIF > software developers develop programs that conform with the CIF2 > standard.? If that standard says 'UTF8', they program for UTF8.? If > you want to work in ISO-8859-15 etc, you have to do extra work. > > Working in favour of such extra work would be a compelling use case, > which I have yet to see (I note that the 'UTF8 only' standard posted > to ccp4-bb and pdb-l produced no comments).? My strong perception is > that any need for other encodings is overwhelmed by the utility of > settling on a single encoding, but that perception would need > confirmation from a proper survey of non-ASCII users. > > So, no we can't stop people saving CIF-like files in other encodings, > but we can discourage it by creating significant barriers in terms of > software availability.? Just like we can't stop CIF1 users saving > files in JIS X 0208, but that doesn't happen at any level that causes > problems (if it happens at all, which I doubt). > > (John) Furthermore, no formulation of CIF is inherently reliable or > unreliable, because reliability (in this sense) is a characteristic of > data transfer, not of data themselves. ?Scheme B targets the > activities that require reliability assurance, and disregards those > that don't. ?In a practical sense, this isn't any different from > scheme A, because it is only when the encoding is potentially > uncertain -- to wit, in the context of data transfer -- that either > scheme need be applied (see also below). ?I suppose I would be willing > to make scheme B a general requirement of the CIF format, but I don't > see any advantage there over the current formulation. ?The actual > behavior of people and the practical requirements on CIF software > would not appreciably change. > > (James now) I would suggest that Scheme B does not target all > activites requiring reliability assurance, as it does not address the > situation where people use a mix of CIF-aware software and text tools > in a single encoding environment. > > The real, significant change that occurs when you accept Scheme B is > that CIF files can now be in any encoding and undecorated. > Programmers are then likely to provide programs that might or might > not work with various encodings, and users feel justifiably that their > undecorated files should be supported.? The software barrier that was > encouraging UTF8-only has been removed, and the problem of mismatched > encodings that we have been trying to avoid becomes that much more > likely to occur.? Scheme B has very few teeth to enforce decoration at > the point of transfer, as the software at either end is now probably > happy with an undecorated file.? Requiring decoration as a condition > of being a CIF2 file means that software will tend to reject > undecorated files, thereby limiting the damage that would be caused by > open slather encoding. > > (James) ?Furthermore, given the ease with which files can be > transferred between users (email attachment, saved in shared, > network-mounted directory, drag and drop onto USB stick etc.) it is > unlikely that Scheme B or anything involving extra effort would be > applied unless the recipient demanded it. > > (John) For hand-created or hand-edited CIFs, I agree. ?CIFs > manipulated via a CIF2-compliant editor could be relied upon to > conform to scheme B, however, provided that is standardized. ?But the > same applies to scheme A, given that few operating environments > default to UTF-8 for text. > > (James now) That is my goal: that any CIF that passes through a > CIF-compliant program must be decorated before input and output (if > not UTF8).? What hand-edited, hand-created CIFs actually have in the > way of decoration doesn't bother me much, as these are very rare and > of no use unless they can be read into a CIF program, at which point > they should be rejected until properly decorated.? And I reiterate, > the process of applying decoration can be done interactively to > minimise the chances of incorrect assignment of encoding. > > (James) ?And given how many times that file might have changed hands > across borders and operating systems within a single group > collaboration, there would only be a qualified guarantee that the > character to binary mapping has not been mangled en route, making any > scheme applied subsequently rather pointless. > > (John) That also does not distinguish among the alternatives before > us. ?I appreciate the desire for an absolute guarantee of reliability, > but none is available. ?Qualified guarantees are the best we can > achieve (and that's a technical assessment, not an aphorism). > > (James now) Oh, but I believe it does distinguish, because if CIF > software reads only UTF8 (because that is what the standard says), > then the file will tend to be in UTF8 at all points in time, with > reduced possibilities for encoding errors.? I think it highly likely > that each group that handles a CIF will at some stage run it through > CIF-aware software, which means encoding mistakes are likely to be > caught much earlier. > > (James) We would thus go from a situation where we had a single, > reliable and sometimes slightly inconvenient encoding (UTF8), to one > where a CIF processor should be prepared for any given CIF file to be > one of a wide range of encodings which need to be guessed. > > (John) Under scheme A or the present draft text, we have "a single, > reliable [...] encoding" only in the sense that the standard > *specifies* that that encoding be used. ?So far, however, I see little > will to produce or use processors that are restricted to UTF-8, and I > have every expectation that authors will continue to produce CIFs in > various encodings regardless of the standard's ultimate stance. ?Yes, > it might be nice if everyone and every system converged on UTF-8 for > text encoding, but CIF2 cannot force that to happen, not even among > crystallographers. > > (James now) You see little will to do this: but as far as I can tell, > there is even less will not to do it.? Authors will not "continue" to > produce CIFs in various encodings, as they haven't started doing so > yet.? As I've said above, CIF2 can certainly, if not force, encourage > UTF8 adoption.? What's more, non-ASCII characters are only gradually > going to find their way into CIF2 files, as the dictionaries and large > scale adopters of CIF2 (the IUCr) start to allow non-ASCII characters > in names, and the users gradually adapt to this new way of doing > things.? I have no sense that CIF users will feel a strong desire to > use non UTF8 schemes, when they have been happy in an ASCII-only > regime up until now.? But I'm curious: on what basis are you saying > that there is little will to use processors that are restricted to > UTF8? > > (John) In practice, then, we really have a situation where the > practical / useful CIF2 processor must be prepared to handle a variety > of encodings (details dependent on system requirements), which may > need to be guessed, with no standard mechanism for helping the > processor make that determination or for allowing it to check its > guess. ?Scheme B improves that situation by standardizing a general > reliability assurance mechanism, which otherwise would be missing. ?In > view of the practical situation, I see no down side at all. ?A CIF > processor working with scheme B is *more* able, not less. > > (James) I would much prefer a scheme which did not compromise > reliability in such a significant way. > > (John) There is no such compromise, because in practice, we're not > starting from a reliable position. > > (James now) I think your statement that our current position is not > reliable arises out of a perception that users are likely to use a > variety of encodings regardless of what the standard says.? I think > this danger is way overstated, but I'd like to see you expand on why > you think there is such a likelihood of multiple encodings being used > > (James) My previous (somewhat clunky) attempts to adjust Scheme B were > directed at trying to force any file with the CIF2.0 magic number to > be either decorated or UTF-8, meaning that software has a reasonably > high confidence in file integrity. > > An alternative way of thinking about this is that CIF files also act > as the mechanism of information transfer between software programs. > [... W]hen a separate program is asked to input that CIF, the > information has been transferred, even if that software is running on > the same computer. > > (John) So in that sense, one could argue that Scheme B already applies > to all CIFs, its assertion to the contrary notwithstanding. ?Honestly, > though, I don't think debating semantic details of terms such as "data > transfer" is useful because in practice, and independent of scheme A, > B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to > choose what form of reliability assurance to accept or demand, if any. > > (James now) I was only debating semantic details in order to expose > the fact that data transfer occurs between programs, not just between > systems, and that therefore Scheme B should apply within a single > system, so therefore, all CIF2 files should be decorated.? As for who > should be demanding reliability assurance, the receiver may not be in > a position to demand some level of reliability if the file creator is > not in direct contact.? Again, we can build this reliability into the > standard and save the extra negotiation or loss of information that is > otherwise involved. > > (James) Now, moving on to the detailed contours of Scheme B and > addressing the particular points that John and I have been discussing. > ?My original criticisms are the ones preceded by numerals. > > [(James now) I've deleted those points where we have reached > agreement.? Those points are: > (1) Restrict encodings to those for which the first line of a CIF file > provides unambiguous encoding for ASCII codepoints > (2) Put the hash value on the first line] > > (James a long time ago) (4) Assumption that all recipients will be > able to handle all encodings > > (John) There is no such assumption. ?Rather, there is an > acknowledgement that some systems may be unable to handle some CIFs. > That is already the case with CIF1, and it is not completely resolved > by standardizing on UTF-8 (i.e. scheme A). > > (James) There is no such thing as 'optional' for an information > interchange standard. ?A file that conforms to the standard must be > readable by parsers written according to the standard. If reading a > standard-conformant file might fail or (worse) the file might be > misinterpreted, information cannot always reliably be exchanged using > this standard, so that optional behaviour needs to be either > discarded, or made mandatory. There is thus no point in including > optional behaviour in the standard. So: if the standard allows files > to be written in encoding XYZ, then all readers should be able to read > files written in encoding XYZ. ?I view the CIF1 stance of allowing any > encoding as a mistake, but a benign one, as in the case of CIF1 ASCII > was so entrenched that it was the defacto standard for the characters > appearing in CIF1 files. ?In short, we have to specify a limited set > of acceptable encodings. > > (John) As Herb astutely observed, those assertions reflect a > fundamental source of our disagreement. ?I think we can all agree that > a standard that permits conforming software to misinterpret conforming > data is undesirable. > > Surely we can also agree that an information interchange standard does > not serve its purpose if it does not support information being > successfully interchanged. ?It does not follow, however, that the > artifacts by which any two parties realize an information interchange > must be interpretable by all other conceivable parties, nor does it > follow that that would be a supremely advantageous characteristic if > it were achievable. ?It also does not follow that recognizable failure > of any particular attempt at interchange must at all costs be avoided, > or that a data interchange standard must take no account of its usage > context. > > (James now) This is where we must make a policy decision: is a CIF2 > file to be a universally understandable file?? I agree that excluding > optional behaviour is not an absolute requirement, but I also consider > that optional behaviour should not be introduced without solid > justification, given the real cost in interoperability and portability > of the standard.? You refer to two parties who wish to exchange > information: those parties are always free to agree on private > enhancements to the CIF2 standard (or to create their very own > protocol), if they are in contact.? I do not see why this use case > need concern us here.? Herbert can say to John 'I'm emailing you a > CIF2 file but encoded in UTF16'.? John has his extremely excellent > software which handles UTF16 and these two parties are happy. > > John mentions a 'usage context'.? If the standard is to include some > account of usage context, then that context has to be specified > sufficiently for a CIF2 programmer to understand what aspects of that > context to consider, and not left open to misinterpretation.? Perhaps > you could enlarge on what particular context should be included? > > (John) Optional and alternative behaviors are not fundamentally > incompatible with a data interchange standard, as XML and HTML > demonstrate. ?Or consider the extreme variability of CIF text content: > whether a particular CIF is suitable for a particular purpose depends > intimately on exactly which data are present in it, and even to some > extent on which data names are used to present them, even though ALL > are optional as far as the format is concerned. ?If I say 'This CIF is > unsuitable for my present purpose because it does not contain > _symmetry_space_group_name_H-M', that does not mean the CIF standard > is broken. ?Yet, it is not qualitatively different for me to say 'This > CIF is unsuitable because it is encoded in CCSID 500' despite CIF2 > (hypothetically) permitting arbitrary encodings. > > (James now)? The difference is quantitative and qualitative. > Quantitative, because the number of CIF2 files that are unsuitable > because of missing tags will always be less than or equal to the > number of CIF2 files that are unsuitable because of a missing tag and > unknown encoding.? Thus, by reducing ambiguity at the lower levels of > the standard, we improve the utility at the higher levels.? The > difference is also qualitative, in that (a) if we have tags with > non-ASCII characters, they could conceivably be confused with other > tags if the encoding is not correct and so you will have a situation > where a file that is not suitable actually appears suitable, because > the desired tag appears. Likewise, the value taken by a tag may be > wrong. > > (James a long time ago) (iii) restrict possible encodings to > internationally recognised ones with well-specified Unicode mappings. > This addresses point (4) > > (John) I don't see the need for this, and to some extent I think it > could be harmful. ?For example, if Herb sees a use for a scheme of > this sort in conjunction with imgCIF (unknown at this point whether he > does), then he might want to be able to specify an encoding specific > to imgCIF, such as one that provides for multiple text segments, each > with its own character encoding. ?To the extent that imgCIF is an > international standard, perhaps that could still satisfy the > restriction, but I don't think that was the intended meaning of > "internationally recognised". > > (James now)? Indeed.? My intent with this specification was to ensure > that third parties would be able to recover the encoding. If imgCIF is > going to cause us to make such an open-ended specification, it is > probably a sign that imgCIF needs to be addressed separately.? For > example, should we think about redefining it as a container format, > with a CIF header and UTF16 body (but still part of the > "Crystallographic Information Framework")? > > (John) As for "well-specified Unicode mappings", I think maybe I'm > missing something. ?CIF text is already limited to Unicode characters, > and any encoding that can serve for a particular piece of CIF text > must map at least the characters actually present in the text. ?What > encodings or scenarios would be excluded, then, by that aspect of this > suggestion? > > (James) My intention was to make sure that not only the particular > user who created the file knew this mapping, but that the mapping was > publically available. ?Certainly only Unicode encodable code points > will appear, but the recipient needs to be able to recover the mapping > from the file bytes to Unicode without relying on e.g. files that will > be supplied on request by someone whose email address no longer works. > > (John) This issue is relevant only to the parties among whom a > particular CIF is exchanged. ?The standard would not particularly > assist those parties by restricting the permitted encodings, because > they can safely ignore such restrictions if they mutually agree to do > so (whether statically or dynamically), and they (specifically, the > CIF originator) must anyway comply with them if no such agreement is > implicit or can be reached. > > (James) Again, any two parties in current contact can send each other > files in whatever format and encoding they wish.? My concern is that > CIF software writers are not drawn into supporting obscure or adhoc > encodings. > > (John) B) Scheme B does not use quite the same language as scheme A > with respect to detectable encodings. ?As a result, it supports > (without tagging or hashing) not just UTF-8, but also all UTF-16 and > UTF-32 variants. ?This is intentional. > > (James) I am concerned that the vast majority of users based in > English speaking countries (and many non English speaking countries) > will be quite annoyed if they have to deal with UTF-16/32 CIF2 files > that are no longer accessible to the simple ASCII-based tools and > software that they are used to. ?Because of this, allowing undecorated > UTF16/32 would be far more disruptive than forcing people to use UTF8 > only. Thus my stipulation on maintaining compatibility with ASCII for > undecorated files. > > (John) Supporting UTF-16/32 without tagging or hashing is not a key > provision of scheme B, and I could live without it, but I don't think > that would significantly change the likelihood of a user unexpectedly > encountering undecorated UTF-16/32 CIFs. ?It would change only whether > such files were technically CIF-conformant, which doesn't much matter > to the user on the spot. ?In any case, it is not the lack of > decoration that is the basic problem here. > > (James now)? Yes, that is true.? A decorated UTF16 file is just as > unreadable as an undecorated one in ASCII tools.? However, per my > comments at the start of this email, I think an extra bit of hoop > jumping for non UTF8 encoded files has the desirable property of > encouraging UTF8 use. > > (John) C) Scheme B is not aimed at ensuring that every conceivable > receiver be able to interpret every scheme-B-compliant CIF. ?Instead, > it provides receivers the ability to *judge* whether they can > interpret particular CIFs, and afterwards to *verify* that they have > done so correctly. ?Ensuring that receivers can interpret CIFs is thus > a responsibility of the sender / archive maintainer, possibly in > cooperation with the receiver / retriever. > > (James) As I've said before, I don't see the paradigm of live > negotiation between senders and receivers as very useful, as it fails > to account for CIFs being passed between different software (via > reading/writing to a file system), or CIFs where the creator is no > longer around, or technically unsophisticated senders where, for > example, the software has produced an undecorated CIF in some native > encoding and the sender has absolutely no idea why the receiver (if > they even have contact with the receiver!) can't read the file > properly. ? I prefer to see the standard that we set as a substitute > for live negotiation, so leaving things up to the users is in that > sense an abrogation of our responsibility. > > (John) That scenario will undoubtedly occur occasionally regardless of > the outcome of this discussion. ?If it is our responsibility to avoid > it at all costs then we are doomed to fail in that regard. ?Software > *will* under some circumstances produce undecorated, non-UTF-8 "CIFs" > because that is sometimes convenient, efficient, and appropriate for > the program's purpose. > > I think, though, those comments reflect a bit of a misconception. ?The > overall purpose of CIF supporting multiple encodings would be to allow > specific CIFs to be better adapted for specific purposes. ?Such > purposes include, but are not limited to > > () exchanging data with general-purpose program(s) on the same system > () exchanging data with crystallography program(s) on the same system > () supporting performance or storage objectives of specific programs or > systems > () efficiently supporting problem or data domains in which Latin text > is a minority of the content (e.g. imgCIF) > () storing data in a personal archive > () exchanging data with known third parties > () publishing data to a general audience > > *Few, if any, of those uses would be likely to involve live > negotiation.* ?That's why I assigned primary responsibility for > selecting encodings to the entity providing the CIF. ?I probably > should not even have mentioned cooperation of the receiver; I did so > more because it is conceivable than because it is likely. > > (James now) OK, fair enough. My issues then with the paradigm of > provider-based encoding selection is that it only works where the > provider is capable of making this choice, and it puts that > responsibility on all providers, large and small.? Of course, I am > keen to construct a CIF ecology where providers always automatically > choose UTF8 as the "safe" choice. > > (John) Under any scheme I can imagine, some CIFs will not be well > suited to some purposes. ?I want to avoid the situation that *no* > conformant CIF can be well suited to some reasonable purposes. ?I am > willing to forgo the result that *every* conformant CIF is suited to > certain other, also reasonable purposes. > > (James now) Fair enough.? However, so far the only reasonable purpose > that I can see for which a UTF8 file would not be suitable is > exchanging data with general-purpose programs that do not cope with > UTF8, and it may well be that with a bit of research the list of such > programs would turn out to be rather short. > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > From simonwestrip at btinternet.com Thu Aug 26 01:16:40 2010 From: simonwestrip at btinternet.com (SIMON WESTRIP) Date: Wed, 25 Aug 2010 17:16:40 -0700 (PDT) Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F7791 3624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> Message-ID: <902931.65953.qm@web87003.mail.ird.yahoo.com> "to present the various ideas to the community in the form of a completed standard with supporting software and see if they accept it" I tend to agree - the stumbling block is the "completed standard" (at least w.r.t. encoding?) :-) ________________________________ From: Herbert J. Bernstein To: Group for discussing encoding and content validation schemes for CIF2 Sent: Thursday, 26 August, 2010 0:57:44 Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . While I disagree with these estimates of how various communities will react, the best way to find out is not for us to debate among ourselves, but to present the various ideas to the community in the form of a completed standard with supporting software and see if they accept it. In the case of core CIF, that community has accepted what they were offered. In the case of mmCIF, that community has essentially rejected what they were offered. So, after all these years of effort on CIF2, isn't it past time to finish something, put it out there and see if it flies. As for my own views: I remind you that XML is the end result of the essentially failed SGML effort followed by the highly successful HTML effort. XML saved the SGML effort by adopting a large part of the simplicity and flexibility of HTML. Please bear that in mind. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Wed, 25 Aug 2010, SIMON WESTRIP wrote: > Dear all > > Recent contributions have stimulated me to revisit some of the fundamental > issues of the possible changes in CIF2 with respect to CIF1, > in particular, the impact on current practice (as I perceive it, based on my > experience). The following is a summary of my thoughts, trying to > look at this from two perspectives (forgive me if I repeat earlier > opinions): > > 1) User perspective > > To date, in the 'core' CIF world (i.e. single-crystal and its extensions), > users treat CIFs as text files, and expect to be able to read them as such > using > plain-text editors, and indeed edit them if necessary for e.g. publication > purposes. Furthermore, they expect them to be readable by applications that > claim that > ability (e.g. graphics software). > > The situation is slghtly different with mmCIF (and the pdb variants), where > users tend to treat these CIFs as data sources that can be read by > applications without > any need to examine the raw CIF themselves, let alone edit them. > > Although the above statements only encompass two user groups and are based > on my personal experience, I believe these groups are the largest when > talking about CIF users? > > So what is the impact on such users of introducing the use of non-ASCII text > and thus raising the text encoding issue? > > In the latter case, probably minimal, inasmuch as the users dont interact > directly with the raw CIF and rely on CIF processing software to manage the > data. > > In the former case, it is quite possible that a user will no longer be able > to edit the raw CIF using the same plain-text editor they have always used > for such purposes. > For example, if a user receives a CIF that has been encoded in UTF16 by some > remote CIF processing system, and opens it in a non-UTF16-aware plain-text > editor, > they will not be presented with what they would expect, even if the > character set in that particular CIF doesnt extend beyond ASCII; > furthermore, even 'advanced' test editors would struggle if the encoding > were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally > applicable to CIF1, but by 'opening up' multiple encodings, the probability > of their usage increases? > > So as soon as we move beyond ASCII, we have to accept that a large group of > CIF users will, at the very least, have to be aware that CIF is no longer > the 'text' format > that they once understood it to be? > > 2) Developer perspective > > I beleive that developers presented with a documented standard will follow > that standard and prefer to work with no uncertainties, especially if they > are > unfamiliar with the format (perhaps just need to be able to read a CIF to > extract data relevant to their application/database...?) > > Taking the example of XML, in my experience developers seem to follow the > standard quite strictly. Most everyday applications that process XML are > intolerant of > violations of the standard. Fortunately, it is largely only developers that > work with raw XML, so the standard works well. > > In contrast to XML, with HTML/javascript the approach to the 'standard' is > far more tolerant. Though these languages are standardized, in order to > compete, the leading application > developers have had to adopt flexibility (e.g. browsers accept 'dirty' HTML, > are remarkably forgiving of syntax violations in javascript, and alter the > standard to > achieve their own ends or facilitate user requirements). I suspect this > results largely from the evolution of the languages: just as in the early > days of CIF, encouragement of > use and the end results were more important than adherence to the documented > standard? > > Note that these same applications that are so tolerant of HTML/javascript > violations are far less forgiving of malformed XML. So is the lesson here > that developers expect > new standards to be unambiguous and will code accordingly (especially if the > new standard was partly designed to address the shortcomings of its > ancestors)? > > > Again, forgive me if these all sounds familiar - however, before arguing one > way or the other with regard to specifics, perhaps the wider group would > like to confirm or otherwise the main points I'm trying to assert, in > particular, with respect to *user* practice: > > 1) CIF2 will require users to change the way they view CIF - i.e. they may > be forced to use CIF2-compliant text editors/application software, and > abandon their current practice. > > With respect to developers, recent coverage has been very insightful, but > just out of interest, would I be wrong in stating that: > > 2) Developers, especially those that don't specialize in CIF, are likely to > want a clear-cut universal standard that does not require any heuristic > interpretatation. > > Cheers > > Simon > > > > ____________________________________________________________________________ > From: James Hester > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Tuesday, 24 August, 2010 4:38:27 > Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line > . .. .. .. .. .. .. .. .. .. .. .. .. .. . > > Thanks John for a detailed response. > > At the top of this email I will address this whole issue of optional > behaviour. I was clearly too telegraphic in previous posts, as > Herbert thinks that optional whitespace counts as an optional feature, > so I go into some detail below. > > By "optional features" I mean those aspects of the standard that are > not mandatory for both readers and writers, and in addition I am not > concerned with features that do not relate directly to the information > transferred, e.g. optional warnings. For example, unless "optional > whitespace" means that the *reader* may throw a syntax error when > whitespace is encountered at some particular point where whitespace is > optional, I do not view optional whitespace as an optional feature - > it is only optional for the writer. With this definition of "optional > feature" it follows logically that, if a standard has such "optional > features", not all standard-conformant files will be readable by all > standard-conformant readers. This is as true of HTML, XML and CIF1 as > it is of CIF2. Whatever the relevance of HTML and XML to CIF, the > existence of successful standards with optional features proves only > that a standard can achieve widespread acceptance while having > optional features - whether these optional features are a help or a > hindrance would require some detailed analysis. > > So: any standard containing optional features requires the addition of > external information in order to resolve the choice of optional > features before successful information interchange can take place. > > Into this situation we place software developers. These are the > people who play a big role in deciding which optional parts of the > standard are used, as they are the ones that write the software that > attempts to read and write the files. Developers will typically > choose to support optional features based on how likely they are to be > used, which depends in part on how likely they are perceived to be > implemented in other software. This is a recursive, potentially > unstable situation, which will eventually resolve itself in one of > three ways: > > (1) A "standard" subset of optional features develops and is > approximately always implemented in readers. Special cases: > (a) No optional features are implemented > (b) All optional features are implemented > (2) A variety of "standard" subsets develop, dividing users into > different communities. These communities can't always read each > other's files without additional conversion software, but there is > little impetus to write this software, because if there were, the > developers would have included support for the missing options in the > first place. The most obvious example of such communities would be > thosed based on options relating to natural languages, if those > communities do not care about accessibility of their files to > non-users of their language and encoding. > (3) A truly chaotic situation develops, with no discernable resolution > and a plethora of incompatible files and software. > > Outcome 1 is the most desirable, as all files are now readable by all > readers, meaning no additional negotiation is necessary, just as if we > had mandated that set of optional features. Outcome 2 is less > desirable, as more software needs to be written and the standard by > itself is not necessarily enough information to read a given file. > Outcome 3 is obviously pretty unwelcome, but unlikely as it would > require a lot of competing influences, which would eventually change > and allow resolution into (1) or (2). Think HTML and Microsoft. > > Now let us apply the above analysis to CIF: some are advocating not > exhaustively listing or mandating the possible CIF2 encodings (CIF1 > did not list or mandate encoding either), leading to a range of > "optional features" as I have defined it above (where support for any > given encoding is a single "optional feature"). For CIF1, we had a > type 1 outcome (only ASCII encoding was supported and produced). > > So: my understanding of the previous discussion is that, while we > agree that it would be ideal if everyone used only UTF8, some perceive > that the desire to use a different encoding will be sufficiently > strong that mandating UTF8 will be ineffective and/or inconvenient. > So, while I personally would advocate mandating UTF8, the other point > of view would have us allowing non UTF8 encoding but hoping that > everyone will eventually move to UTF8. > > In which case I would like to suggest that we use network effects to > influence the recursive feedback loop experienced by programmers > described above, so that the community settles on UTF8 in the same way > as it has settled on ASCII for CIF1. That is, we "load the dice" so > that other encodings are disfavoured. Here are some ways to "load the > dice": > > (1) Mandate UTF8 only. > (2) Make support for UTF8 mandatory in CIF processors > (3) Force non UTF8 files to jump through extra hoops (which I think is > necessary anyway) > (4) Educate programmers on the drawbacks of non UTF8 encodings and > strongly urge them not to support reading non UTF8 CIF files > (5) Strongly recommend that the IUCr, wwPDB, and other centralised > repositories reject non-UTF8-encoded CIF files > (6) Make available hyperlinked information on system tools for dealing > with UTF8 files on popular platforms, which could be used in error > messages produced by programs (see (4)) > > I would be interested in hearing comments on the acceptability of > these options from the rest of the group (I think we know how we all > feel about (1)!). > > Now, returning to John's email: I will answer each of the points > inline, at the same time attempting to get all the attributions > correct. > > (James) I had not fully appreciated that Scheme B is intended to be > applied only at the moment of transfer or archiving, and envisions > users normally saving files in their preferred encoding with no hash > codes or encoding hints required (I will call the inclusion of such > hints and hashes as 'decoration'). > > (John) "Envisions users normally [...]" is a bit stronger than my > position or the intended orientation of Scheme B. "Accommodates" > would be my choice of wording. > > (James now) No problem with that wording, my point is that such > undecorated files will be called CIF2 files and so are a target for > CIF2 software developers, thus "unloading" the dice away from UTF8 and > closer to encoding chaos. > > (James) A direct result of allowing undecorated files to reside on > disk is that CIF software producers will need to write software that > will function with arbitrary encodings with no decoration to help > them, as that is the form that users' files will be most often be in. > > (John) The standard can do no more to prevent users from storing > undecorated CIFs than it can to prevent users from storing CIF text > encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding. > More generally, all the standard can do is define the characteristics > of a conformant CIF -- it can never prevent CIF-like but > non-conformant files from being created, used, exchanged, or archived > as if they were conformant CIFs. Regardless of the standard's > ultimate position on this issue, software authors will have to be > guided by practical considerations and by the real-world requirements > placed on their programs. In particular, they will have to decide > whether to accept "CIF" input that in fact violates the standard in > various ways, and / or they will have to decide which optional CIF > behaviors they will support. As such, I don't see a significant > distinction between the alternatives before us as regards the > difficulty, complexity, or requirements of CIF2 software. > > (James now) I have described the way the standard works to restrict > encodings in the discussion at the top of this email. Briefly, CIF > software developers develop programs that conform with the CIF2 > standard. If that standard says 'UTF8', they program for UTF8. If > you want to work in ISO-8859-15 etc, you have to do extra work. > > Working in favour of such extra work would be a compelling use case, > which I have yet to see (I note that the 'UTF8 only' standard posted > to ccp4-bb and pdb-l produced no comments). My strong perception is > that any need for other encodings is overwhelmed by the utility of > settling on a single encoding, but that perception would need > confirmation from a proper survey of non-ASCII users. > > So, no we can't stop people saving CIF-like files in other encodings, > but we can discourage it by creating significant barriers in terms of > software availability. Just like we can't stop CIF1 users saving > files in JIS X 0208, but that doesn't happen at any level that causes > problems (if it happens at all, which I doubt). > > (John) Furthermore, no formulation of CIF is inherently reliable or > unreliable, because reliability (in this sense) is a characteristic of > data transfer, not of data themselves. Scheme B targets the > activities that require reliability assurance, and disregards those > that don't. In a practical sense, this isn't any different from > scheme A, because it is only when the encoding is potentially > uncertain -- to wit, in the context of data transfer -- that either > scheme need be applied (see also below). I suppose I would be willing > to make scheme B a general requirement of the CIF format, but I don't > see any advantage there over the current formulation. The actual > behavior of people and the practical requirements on CIF software > would not appreciably change. > > (James now) I would suggest that Scheme B does not target all > activites requiring reliability assurance, as it does not address the > situation where people use a mix of CIF-aware software and text tools > in a single encoding environment. > > The real, significant change that occurs when you accept Scheme B is > that CIF files can now be in any encoding and undecorated. > Programmers are then likely to provide programs that might or might > not work with various encodings, and users feel justifiably that their > undecorated files should be supported. The software barrier that was > encouraging UTF8-only has been removed, and the problem of mismatched > encodings that we have been trying to avoid becomes that much more > likely to occur. Scheme B has very few teeth to enforce decoration at > the point of transfer, as the software at either end is now probably > happy with an undecorated file. Requiring decoration as a condition > of being a CIF2 file means that software will tend to reject > undecorated files, thereby limiting the damage that would be caused by > open slather encoding. > > (James) Furthermore, given the ease with which files can be > transferred between users (email attachment, saved in shared, > network-mounted directory, drag and drop onto USB stick etc.) it is > unlikely that Scheme B or anything involving extra effort would be > applied unless the recipient demanded it. > > (John) For hand-created or hand-edited CIFs, I agree. CIFs > manipulated via a CIF2-compliant editor could be relied upon to > conform to scheme B, however, provided that is standardized. But the > same applies to scheme A, given that few operating environments > default to UTF-8 for text. > > (James now) That is my goal: that any CIF that passes through a > CIF-compliant program must be decorated before input and output (if > not UTF8). What hand-edited, hand-created CIFs actually have in the > way of decoration doesn't bother me much, as these are very rare and > of no use unless they can be read into a CIF program, at which point > they should be rejected until properly decorated. And I reiterate, > the process of applying decoration can be done interactively to > minimise the chances of incorrect assignment of encoding. > > (James) And given how many times that file might have changed hands > across borders and operating systems within a single group > collaboration, there would only be a qualified guarantee that the > character to binary mapping has not been mangled en route, making any > scheme applied subsequently rather pointless. > > (John) That also does not distinguish among the alternatives before > us. I appreciate the desire for an absolute guarantee of reliability, > but none is available. Qualified guarantees are the best we can > achieve (and that's a technical assessment, not an aphorism). > > (James now) Oh, but I believe it does distinguish, because if CIF > software reads only UTF8 (because that is what the standard says), > then the file will tend to be in UTF8 at all points in time, with > reduced possibilities for encoding errors. I think it highly likely > that each group that handles a CIF will at some stage run it through > CIF-aware software, which means encoding mistakes are likely to be > caught much earlier. > > (James) We would thus go from a situation where we had a single, > reliable and sometimes slightly inconvenient encoding (UTF8), to one > where a CIF processor should be prepared for any given CIF file to be > one of a wide range of encodings which need to be guessed. > > (John) Under scheme A or the present draft text, we have "a single, > reliable [...] encoding" only in the sense that the standard > *specifies* that that encoding be used. So far, however, I see little > will to produce or use processors that are restricted to UTF-8, and I > have every expectation that authors will continue to produce CIFs in > various encodings regardless of the standard's ultimate stance. Yes, > it might be nice if everyone and every system converged on UTF-8 for > text encoding, but CIF2 cannot force that to happen, not even among > crystallographers. > > (James now) You see little will to do this: but as far as I can tell, > there is even less will not to do it. Authors will not "continue" to > produce CIFs in various encodings, as they haven't started doing so > yet. As I've said above, CIF2 can certainly, if not force, encourage > UTF8 adoption. What's more, non-ASCII characters are only gradually > going to find their way into CIF2 files, as the dictionaries and large > scale adopters of CIF2 (the IUCr) start to allow non-ASCII characters > in names, and the users gradually adapt to this new way of doing > things. I have no sense that CIF users will feel a strong desire to > use non UTF8 schemes, when they have been happy in an ASCII-only > regime up until now. But I'm curious: on what basis are you saying > that there is little will to use processors that are restricted to > UTF8? > > (John) In practice, then, we really have a situation where the > practical / useful CIF2 processor must be prepared to handle a variety > of encodings (details dependent on system requirements), which may > need to be guessed, with no standard mechanism for helping the > processor make that determination or for allowing it to check its > guess. Scheme B improves that situation by standardizing a general > reliability assurance mechanism, which otherwise would be missing. In > view of the practical situation, I see no down side at all. A CIF > processor working with scheme B is *more* able, not less. > > (James) I would much prefer a scheme which did not compromise > reliability in such a significant way. > > (John) There is no such compromise, because in practice, we're not > starting from a reliable position. > > (James now) I think your statement that our current position is not > reliable arises out of a perception that users are likely to use a > variety of encodings regardless of what the standard says. I think > this danger is way overstated, but I'd like to see you expand on why > you think there is such a likelihood of multiple encodings being used > > (James) My previous (somewhat clunky) attempts to adjust Scheme B were > directed at trying to force any file with the CIF2.0 magic number to > be either decorated or UTF-8, meaning that software has a reasonably > high confidence in file integrity. > > An alternative way of thinking about this is that CIF files also act > as the mechanism of information transfer between software programs. > [... W]hen a separate program is asked to input that CIF, the > information has been transferred, even if that software is running on > the same computer. > > (John) So in that sense, one could argue that Scheme B already applies > to all CIFs, its assertion to the contrary notwithstanding. Honestly, > though, I don't think debating semantic details of terms such as "data > transfer" is useful because in practice, and independent of scheme A, > B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to > choose what form of reliability assurance to accept or demand, if any. > > (James now) I was only debating semantic details in order to expose > the fact that data transfer occurs between programs, not just between > systems, and that therefore Scheme B should apply within a single > system, so therefore, all CIF2 files should be decorated. As for who > should be demanding reliability assurance, the receiver may not be in > a position to demand some level of reliability if the file creator is > not in direct contact. Again, we can build this reliability into the > standard and save the extra negotiation or loss of information that is > otherwise involved. > > (James) Now, moving on to the detailed contours of Scheme B and > addressing the particular points that John and I have been discussing. > My original criticisms are the ones preceded by numerals. > > [(James now) I've deleted those points where we have reached > agreement. Those points are: > (1) Restrict encodings to those for which the first line of a CIF file > provides unambiguous encoding for ASCII codepoints > (2) Put the hash value on the first line] > > (James a long time ago) (4) Assumption that all recipients will be > able to handle all encodings > > (John) There is no such assumption. Rather, there is an > acknowledgement that some systems may be unable to handle some CIFs. > That is already the case with CIF1, and it is not completely resolved > by standardizing on UTF-8 (i.e. scheme A). > > (James) There is no such thing as 'optional' for an information > interchange standard. A file that conforms to the standard must be > readable by parsers written according to the standard. If reading a > standard-conformant file might fail or (worse) the file might be > misinterpreted, information cannot always reliably be exchanged using > this standard, so that optional behaviour needs to be either > discarded, or made mandatory. There is thus no point in including > optional behaviour in the standard. So: if the standard allows files > to be written in encoding XYZ, then all readers should be able to read > files written in encoding XYZ. I view the CIF1 stance of allowing any > encoding as a mistake, but a benign one, as in the case of CIF1 ASCII > was so entrenched that it was the defacto standard for the characters > appearing in CIF1 files. In short, we have to specify a limited set > of acceptable encodings. > > (John) As Herb astutely observed, those assertions reflect a > fundamental source of our disagreement. I think we can all agree that > a standard that permits conforming software to misinterpret conforming > data is undesirable. > > Surely we can also agree that an information interchange standard does > not serve its purpose if it does not support information being > successfully interchanged. It does not follow, however, that the > artifacts by which any two parties realize an information interchange > must be interpretable by all other conceivable parties, nor does it > follow that that would be a supremely advantageous characteristic if > it were achievable. It also does not follow that recognizable failure > of any particular attempt at interchange must at all costs be avoided, > or that a data interchange standard must take no account of its usage > context. > > (James now) This is where we must make a policy decision: is a CIF2 > file to be a universally understandable file? I agree that excluding > optional behaviour is not an absolute requirement, but I also consider > that optional behaviour should not be introduced without solid > justification, given the real cost in interoperability and portability > of the standard. You refer to two parties who wish to exchange > information: those parties are always free to agree on private > enhancements to the CIF2 standard (or to create their very own > protocol), if they are in contact. I do not see why this use case > need concern us here. Herbert can say to John 'I'm emailing you a > CIF2 file but encoded in UTF16'. John has his extremely excellent > software which handles UTF16 and these two parties are happy. > > John mentions a 'usage context'. If the standard is to include some > account of usage context, then that context has to be specified > sufficiently for a CIF2 programmer to understand what aspects of that > context to consider, and not left open to misinterpretation. Perhaps > you could enlarge on what particular context should be included? > > (John) Optional and alternative behaviors are not fundamentally > incompatible with a data interchange standard, as XML and HTML > demonstrate. Or consider the extreme variability of CIF text content: > whether a particular CIF is suitable for a particular purpose depends > intimately on exactly which data are present in it, and even to some > extent on which data names are used to present them, even though ALL > are optional as far as the format is concerned. If I say 'This CIF is > unsuitable for my present purpose because it does not contain > _symmetry_space_group_name_H-M', that does not mean the CIF standard > is broken. Yet, it is not qualitatively different for me to say 'This > CIF is unsuitable because it is encoded in CCSID 500' despite CIF2 > (hypothetically) permitting arbitrary encodings. > > (James now) The difference is quantitative and qualitative. > Quantitative, because the number of CIF2 files that are unsuitable > because of missing tags will always be less than or equal to the > number of CIF2 files that are unsuitable because of a missing tag and > unknown encoding. Thus, by reducing ambiguity at the lower levels of > the standard, we improve the utility at the higher levels. The > difference is also qualitative, in that (a) if we have tags with > non-ASCII characters, they could conceivably be confused with other > tags if the encoding is not correct and so you will have a situation > where a file that is not suitable actually appears suitable, because > the desired tag appears. Likewise, the value taken by a tag may be > wrong. > > (James a long time ago) (iii) restrict possible encodings to > internationally recognised ones with well-specified Unicode mappings. > This addresses point (4) > > (John) I don't see the need for this, and to some extent I think it > could be harmful. For example, if Herb sees a use for a scheme of > this sort in conjunction with imgCIF (unknown at this point whether he > does), then he might want to be able to specify an encoding specific > to imgCIF, such as one that provides for multiple text segments, each > with its own character encoding. To the extent that imgCIF is an > international standard, perhaps that could still satisfy the > restriction, but I don't think that was the intended meaning of > "internationally recognised". > > (James now) Indeed. My intent with this specification was to ensure > that third parties would be able to recover the encoding. If imgCIF is > going to cause us to make such an open-ended specification, it is > probably a sign that imgCIF needs to be addressed separately. For > example, should we think about redefining it as a container format, > with a CIF header and UTF16 body (but still part of the > "Crystallographic Information Framework")? > > (John) As for "well-specified Unicode mappings", I think maybe I'm > missing something. CIF text is already limited to Unicode characters, > and any encoding that can serve for a particular piece of CIF text > must map at least the characters actually present in the text. What > encodings or scenarios would be excluded, then, by that aspect of this > suggestion? > > (James) My intention was to make sure that not only the particular > user who created the file knew this mapping, but that the mapping was > publically available. Certainly only Unicode encodable code points > will appear, but the recipient needs to be able to recover the mapping > from the file bytes to Unicode without relying on e.g. files that will > be supplied on request by someone whose email address no longer works. > > (John) This issue is relevant only to the parties among whom a > particular CIF is exchanged. The standard would not particularly > assist those parties by restricting the permitted encodings, because > they can safely ignore such restrictions if they mutually agree to do > so (whether statically or dynamically), and they (specifically, the > CIF originator) must anyway comply with them if no such agreement is > implicit or can be reached. > > (James) Again, any two parties in current contact can send each other > files in whatever format and encoding they wish. My concern is that > CIF software writers are not drawn into supporting obscure or adhoc > encodings. > > (John) B) Scheme B does not use quite the same language as scheme A > with respect to detectable encodings. As a result, it supports > (without tagging or hashing) not just UTF-8, but also all UTF-16 and > UTF-32 variants. This is intentional. > > (James) I am concerned that the vast majority of users based in > English speaking countries (and many non English speaking countries) > will be quite annoyed if they have to deal with UTF-16/32 CIF2 files > that are no longer accessible to the simple ASCII-based tools and > software that they are used to. Because of this, allowing undecorated > UTF16/32 would be far more disruptive than forcing people to use UTF8 > only. Thus my stipulation on maintaining compatibility with ASCII for > undecorated files. > > (John) Supporting UTF-16/32 without tagging or hashing is not a key > provision of scheme B, and I could live without it, but I don't think > that would significantly change the likelihood of a user unexpectedly > encountering undecorated UTF-16/32 CIFs. It would change only whether > such files were technically CIF-conformant, which doesn't much matter > to the user on the spot. In any case, it is not the lack of > decoration that is the basic problem here. > > (James now) Yes, that is true. A decorated UTF16 file is just as > unreadable as an undecorated one in ASCII tools. However, per my > comments at the start of this email, I think an extra bit of hoop > jumping for non UTF8 encoded files has the desirable property of > encouraging UTF8 use. > > (John) C) Scheme B is not aimed at ensuring that every conceivable > receiver be able to interpret every scheme-B-compliant CIF. Instead, > it provides receivers the ability to *judge* whether they can > interpret particular CIFs, and afterwards to *verify* that they have > done so correctly. Ensuring that receivers can interpret CIFs is thus > a responsibility of the sender / archive maintainer, possibly in > cooperation with the receiver / retriever. > > (James) As I've said before, I don't see the paradigm of live > negotiation between senders and receivers as very useful, as it fails > to account for CIFs being passed between different software (via > reading/writing to a file system), or CIFs where the creator is no > longer around, or technically unsophisticated senders where, for > example, the software has produced an undecorated CIF in some native > encoding and the sender has absolutely no idea why the receiver (if > they even have contact with the receiver!) can't read the file > properly. I prefer to see the standard that we set as a substitute > for live negotiation, so leaving things up to the users is in that > sense an abrogation of our responsibility. > > (John) That scenario will undoubtedly occur occasionally regardless of > the outcome of this discussion. If it is our responsibility to avoid > it at all costs then we are doomed to fail in that regard. Software > *will* under some circumstances produce undecorated, non-UTF-8 "CIFs" > because that is sometimes convenient, efficient, and appropriate for > the program's purpose. > > I think, though, those comments reflect a bit of a misconception. The > overall purpose of CIF supporting multiple encodings would be to allow > specific CIFs to be better adapted for specific purposes. Such > purposes include, but are not limited to > > () exchanging data with general-purpose program(s) on the same system > () exchanging data with crystallography program(s) on the same system > () supporting performance or storage objectives of specific programs or > systems > () efficiently supporting problem or data domains in which Latin text > is a minority of the content (e.g. imgCIF) > () storing data in a personal archive > () exchanging data with known third parties > () publishing data to a general audience > > *Few, if any, of those uses would be likely to involve live > negotiation.* That's why I assigned primary responsibility for > selecting encodings to the entity providing the CIF. I probably > should not even have mentioned cooperation of the receiver; I did so > more because it is conceivable than because it is likely. > > (James now) OK, fair enough. My issues then with the paradigm of > provider-based encoding selection is that it only works where the > provider is capable of making this choice, and it puts that > responsibility on all providers, large and small. Of course, I am > keen to construct a CIF ecology where providers always automatically > choose UTF8 as the "safe" choice. > > (John) Under any scheme I can imagine, some CIFs will not be well > suited to some purposes. I want to avoid the situation that *no* > conformant CIF can be well suited to some reasonable purposes. I am > willing to forgo the result that *every* conformant CIF is suited to > certain other, also reasonable purposes. > > (James now) Fair enough. However, so far the only reasonable purpose > that I can see for which a UTF8 file would not be suitable is > exchanging data with general-purpose programs that do not cope with > UTF8, and it may well be that with a bit of research the list of such > programs would turn out to be rather short. > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100825/577ccd68/attachment-0001.html From yaya at bernstein-plus-sons.com Thu Aug 26 02:02:26 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Wed, 25 Aug 2010 21:02:26 -0400 (EDT) Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: <902931.65953.qm@web87003.mail.ird.yahoo.com> References: <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F7791 3624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> <902931.65953.qm@web87003.mail.ird.yahoo.com> Message-ID: With software, we do "release candidates". I would suggest that the proponents of the UTF8-only approach prepare their CIF2 release candidate and that those of us who favor a more general encoding approach prepare our release candidate, that we put both forward to the communities involved and see what reaction we get. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Wed, 25 Aug 2010, SIMON WESTRIP wrote: > "to present the various ideas to the community in the form of > a completed standard with supporting software and see if they accept > it" > > I tend to agree - the stumbling block is the "completed standard" > (at least w.r.t. encoding?) > > :-) > > > ____________________________________________________________________________ > From: Herbert J. Bernstein > To: Group for discussing encoding and content validation schemes for CIF2 > > Sent: Thursday, 26 August, 2010 0:57:44 > Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line > . .. .. .. .. .. .. .. .. .. .. .. .. .. . > > While I disagree with these estimates of how various communities will > react, the best way to find out is not for us to debate among ourselves, > but to present the various ideas to the community in the form of > a completed standard with supporting software and see if they accept > it.? In the case of core CIF, that community has accepted what they > were offered.? In the case of mmCIF, that community has essentially > rejected what they were offered.? So, after all these years of > effort on CIF2, isn't it past time to finish something, put it out > there and see if it flies. > > As for my own views: > I remind you that XML is the end result of the essentially failed > SGML effort followed by the highly successful HTML effort.? XML saved > the SGML effort by adopting a large part of the simplicity and > flexibility of HTML.? Please bear that in mind. > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Wed, 25 Aug 2010, SIMON WESTRIP wrote: > > > Dear all > > > > Recent contributions have stimulated me to revisit some of the fundamental > > issues of the possible changes in CIF2 with respect to CIF1, > > in particular, the impact on current practice (as I perceive it, based on > my > > experience). The following is a summary of my thoughts, trying to > > look at this from two perspectives (forgive me if I repeat earlier > > opinions): > > > > 1) User perspective > > > > To date, in the 'core' CIF world (i.e. single-crystal and its extensions), > > users treat CIFs as text files, and expect to be able to read them as such > > using > > plain-text editors, and indeed edit them if necessary for e.g. publication > > purposes. Furthermore, they expect them to be readable by applications > that > > claim that > > ability (e.g. graphics software). > > > > The situation is slghtly different with mmCIF (and the pdb variants), > where > > users tend to treat these CIFs as data sources that can be read by > > applications without > > any need to examine the raw CIF themselves, let alone edit them. > > > > Although the above statements only encompass two user groups and are based > > on my personal experience, I believe these groups are the largest when > > talking about CIF users? > > > > So what is the impact on such users of introducing the use of non-ASCII > text > > and thus raising the text encoding issue? > > > > In the latter case, probably minimal, inasmuch as the users dont interact > > directly with the raw CIF and rely on CIF processing software to manage > the > > data. > > > > In the former case, it is quite possible that a user will no longer be > able > > to edit the raw CIF using the same plain-text editor they have always used > > for such purposes. > > For example, if a user receives a CIF that has been encoded in UTF16 by > some > > remote CIF processing system, and opens it in a non-UTF16-aware plain-text > > editor, > > they will not be presented with what they would expect, even if the > > character set in that particular CIF doesnt extend beyond ASCII; > > furthermore, even 'advanced' test editors would struggle if the encoding > > were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally > > applicable to CIF1, but by 'opening up' multiple encodings, the > probability > > of their usage increases? > > > > So as soon as we move beyond ASCII, we have to accept that a large group > of > > CIF users will, at the very least, have to be aware that CIF is no longer > > the 'text' format > > that they once understood it to be? > > > > 2) Developer perspective > > > > I beleive that developers presented with a documented standard will follow > > that standard and prefer to work with no uncertainties, especially if they > > are > > unfamiliar with the format (perhaps just need to be able to read a CIF to > > extract data relevant to their application/database...?) > > > > Taking the example of XML, in my experience developers seem to follow the > > standard quite strictly. Most everyday applications that process XML are > > intolerant of > > violations of the standard. Fortunately, it is largely only developers > that > > work with raw XML, so the standard works well. > > > > In contrast to XML, with HTML/javascript the approach to the 'standard' is > > far more tolerant. Though these languages are standardized, in order to > > compete, the leading application > > developers have had to adopt flexibility (e.g. browsers accept 'dirty' > HTML, > > are remarkably forgiving of syntax violations in javascript, and alter the > > standard to > > achieve their own ends or facilitate user requirements). I suspect this > > results largely from the evolution of the languages: just as in the early > > days of CIF, encouragement of > > use and the end results were more important than adherence to the > documented > > standard? > > > > Note that these same applications that are so tolerant of HTML/javascript > > violations are far less forgiving of malformed XML. So is the lesson here > > that developers expect > > new standards to be unambiguous and will code accordingly (especially if > the > > new standard was partly designed to address the shortcomings of its > > ancestors)? > > > > > > Again, forgive me if these all sounds familiar - however, before arguing > one > > way or the other with regard to specifics, perhaps the wider group would > > like to confirm or otherwise the main points I'm trying to assert, in > > particular, with respect to *user* practice: > > > > 1) CIF2 will require users to change the way they view CIF - i.e. they may > > be forced to use CIF2-compliant text editors/application software, and > > abandon their current practice. > > > > With respect to developers, recent coverage has been very insightful, but > > just out of interest, would I be wrong in stating that: > > > > 2) Developers, especially those that don't specialize in CIF, are likely > to > > want a clear-cut universal standard that does not require any heuristic > > interpretatation. > > > > Cheers > > > > Simon > > > > > > > >___________________________________________________________________________ > _ > > From: James Hester > > To: Group for discussing encoding and content validation schemes for CIF2 > > > > Sent: Tuesday, 24 August, 2010 4:38:27 > > Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs > binary/end-of-line > > . .. .. .. .. .. .. .. .. .. .. .. .. .. . > > > > Thanks John for a detailed response. > > > > At the top of this email I will address this whole issue of optional > > behaviour.? I was clearly too telegraphic in previous posts, as > > Herbert thinks that optional whitespace counts as an optional feature, > > so I go into some detail below. > > > > By "optional features" I mean those aspects of the standard that are > > not mandatory for both readers and writers, and in addition I am not > > concerned with features that do not relate directly to the information > > transferred, e.g. optional warnings.? For example, unless "optional > > whitespace" means that the *reader* may throw a syntax error when > > whitespace is encountered at some particular point where whitespace is > > optional, I do not view optional whitespace as an optional feature - > > it is only optional for the writer.? With this definition of "optional > > feature" it follows logically that, if a standard has such "optional > > features", not all standard-conformant files will be readable by all > > standard-conformant readers.? This is as true of HTML, XML and CIF1 as > > it is of CIF2.? Whatever the relevance of HTML and XML to CIF, the > > existence of successful standards with optional features proves only > > that a standard can achieve widespread acceptance while having > > optional features - whether these optional features are a help or a > > hindrance would require some detailed analysis. > > > > So: any standard containing optional features requires the addition of > > external information in order to resolve the choice of optional > > features before successful information interchange can take place. > > > > Into this situation we place software developers.? These are the > > people who play a big role in deciding which optional parts of the > > standard are used, as they are the ones that write the software that > > attempts to read and write the files.? Developers will typically > > choose to support optional features based on how likely they are to be > > used, which depends in part on how likely they are perceived to be > > implemented in other software.? This is a recursive, potentially > > unstable situation, which will eventually resolve itself in one of > > three ways: > > > > (1) A "standard" subset of optional features develops and is > > approximately always implemented in readers.? Special cases: > > ? (a) No optional features are implemented > > ? (b) All optional features are implemented > > (2) A variety of "standard" subsets develop, dividing users into > > different communities. These communities can't always read each > > other's files without additional conversion software, but there is > > little impetus to write this software, because if there were, the > > developers would have included support for the missing options in the > > first place.? The most obvious example of such communities would be > > thosed based on options relating to natural languages, if those > > communities do not care about accessibility of their files to > > non-users of their language and encoding. > > (3) A truly chaotic situation develops, with no discernable resolution > > and a plethora of incompatible files and software. > > > > Outcome 1 is the most desirable, as all files are now readable by all > > readers, meaning no additional negotiation is necessary, just as if we > > had mandated that set of optional features.? Outcome 2 is less > > desirable, as more software needs to be written and the standard by > > itself is not necessarily enough information to read a given file. > > Outcome 3 is obviously pretty unwelcome, but unlikely as it would > > require a lot of competing influences, which would eventually change > > and allow resolution into (1) or (2).? Think HTML and Microsoft. > > > > Now let us apply the above analysis to CIF: some are advocating not > > exhaustively listing or mandating the possible CIF2 encodings (CIF1 > > did not list or mandate encoding either), leading to a range of > > "optional features" as I have defined it above (where support for any > > given encoding is a single "optional feature").? For CIF1, we had a > > type 1 outcome (only ASCII encoding was supported and produced). > > > > So: my understanding of the previous discussion is that, while we > > agree that it would be ideal if everyone used only UTF8, some perceive > > that the desire to use a different encoding will be sufficiently > > strong that mandating UTF8 will be ineffective and/or inconvenient. > > So, while I personally would advocate mandating UTF8, the other point > > of view would have us allowing non UTF8 encoding but hoping that > > everyone will eventually move to UTF8. > > > > In which case I would like to suggest that we use network effects to > > influence the recursive feedback loop experienced by programmers > > described above, so that the community settles on UTF8 in the same way > > as it has settled on ASCII for CIF1.? That is, we "load the dice" so > > that other encodings are disfavoured.? Here are some ways to "load the > > dice": > > > > (1) Mandate UTF8 only. > > (2) Make support for UTF8 mandatory in CIF processors > > (3) Force non UTF8 files to jump through extra hoops (which I think is > > necessary anyway) > > (4) Educate programmers on the drawbacks of non UTF8 encodings and > > strongly urge them not to support reading non UTF8 CIF files > > (5) Strongly recommend that the IUCr, wwPDB, and other centralised > > repositories reject non-UTF8-encoded CIF files > > (6) Make available hyperlinked information on system tools for dealing > > with UTF8 files on popular platforms, which could be used in error > > messages produced by programs (see (4)) > > > > I would be interested in hearing comments on the acceptability of > > these options from the rest of the group (I think we know how we all > > feel about (1)!). > > > > Now, returning to John's email: I will answer each of the points > > inline, at the same time attempting to get all the attributions > > correct. > > > > (James) I had not fully appreciated that Scheme B is intended to be > > applied only at the moment of transfer or archiving, and envisions > > users normally saving files in their preferred encoding with no hash > > codes or encoding hints required (I will call the inclusion of such > > hints and hashes as 'decoration'). > > > > (John) "Envisions users normally [...]" is a bit stronger than my > > position or the intended orientation of Scheme B. ?"Accommodates" > > would be my choice of wording. > > > > (James now) No problem with that wording, my point is that such > > undecorated files will be called CIF2 files and so are a target for > > CIF2 software developers, thus "unloading" the dice away from UTF8 and > > closer to encoding chaos. > > > > (James) ?A direct result of allowing undecorated files to reside on > > disk is that CIF software producers will need to write software that > > will function with arbitrary encodings with no decoration to help > > them, as that is the form that users' files will be most often be in. > > > > (John) The standard can do no more to prevent users from storing > > undecorated CIFs than it can to prevent users from storing CIF text > > encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding. > > More generally, all the standard can do is define the characteristics > > of a conformant CIF -- it can never prevent CIF-like but > > non-conformant files from being created, used, exchanged, or archived > > as if they were conformant CIFs. ?Regardless of the standard's > > ultimate position on this issue, software authors will have to be > > guided by practical considerations and by the real-world requirements > > placed on their programs. ?In particular, they will have to decide > > whether to accept "CIF" input that in fact violates the standard in > > various ways, and / or they will have to decide which optional CIF > > behaviors they will support. ?As such, I don't see a significant > > distinction between the alternatives before us as regards the > > difficulty, complexity, or requirements of CIF2 software. > > > > (James now) I have described the way the standard works to restrict > > encodings in the discussion at the top of this email.? Briefly, CIF > > software developers develop programs that conform with the CIF2 > > standard.? If that standard says 'UTF8', they program for UTF8.? If > > you want to work in ISO-8859-15 etc, you have to do extra work. > > > > Working in favour of such extra work would be a compelling use case, > > which I have yet to see (I note that the 'UTF8 only' standard posted > > to ccp4-bb and pdb-l produced no comments).? My strong perception is > > that any need for other encodings is overwhelmed by the utility of > > settling on a single encoding, but that perception would need > > confirmation from a proper survey of non-ASCII users. > > > > So, no we can't stop people saving CIF-like files in other encodings, > > but we can discourage it by creating significant barriers in terms of > > software availability.? Just like we can't stop CIF1 users saving > > files in JIS X 0208, but that doesn't happen at any level that causes > > problems (if it happens at all, which I doubt). > > > > (John) Furthermore, no formulation of CIF is inherently reliable or > > unreliable, because reliability (in this sense) is a characteristic of > > data transfer, not of data themselves. ?Scheme B targets the > > activities that require reliability assurance, and disregards those > > that don't. ?In a practical sense, this isn't any different from > > scheme A, because it is only when the encoding is potentially > > uncertain -- to wit, in the context of data transfer -- that either > > scheme need be applied (see also below). ?I suppose I would be willing > > to make scheme B a general requirement of the CIF format, but I don't > > see any advantage there over the current formulation. ?The actual > > behavior of people and the practical requirements on CIF software > > would not appreciably change. > > > > (James now) I would suggest that Scheme B does not target all > > activites requiring reliability assurance, as it does not address the > > situation where people use a mix of CIF-aware software and text tools > > in a single encoding environment. > > > > The real, significant change that occurs when you accept Scheme B is > > that CIF files can now be in any encoding and undecorated. > > Programmers are then likely to provide programs that might or might > > not work with various encodings, and users feel justifiably that their > > undecorated files should be supported.? The software barrier that was > > encouraging UTF8-only has been removed, and the problem of mismatched > > encodings that we have been trying to avoid becomes that much more > > likely to occur.? Scheme B has very few teeth to enforce decoration at > > the point of transfer, as the software at either end is now probably > > happy with an undecorated file.? Requiring decoration as a condition > > of being a CIF2 file means that software will tend to reject > > undecorated files, thereby limiting the damage that would be caused by > > open slather encoding. > > > > (James) ?Furthermore, given the ease with which files can be > > transferred between users (email attachment, saved in shared, > > network-mounted directory, drag and drop onto USB stick etc.) it is > > unlikely that Scheme B or anything involving extra effort would be > > applied unless the recipient demanded it. > > > > (John) For hand-created or hand-edited CIFs, I agree. ?CIFs > > manipulated via a CIF2-compliant editor could be relied upon to > > conform to scheme B, however, provided that is standardized. ?But the > > same applies to scheme A, given that few operating environments > > default to UTF-8 for text. > > > > (James now) That is my goal: that any CIF that passes through a > > CIF-compliant program must be decorated before input and output (if > > not UTF8).? What hand-edited, hand-created CIFs actually have in the > > way of decoration doesn't bother me much, as these are very rare and > > of no use unless they can be read into a CIF program, at which point > > they should be rejected until properly decorated.? And I reiterate, > > the process of applying decoration can be done interactively to > > minimise the chances of incorrect assignment of encoding. > > > > (James) ?And given how many times that file might have changed hands > > across borders and operating systems within a single group > > collaboration, there would only be a qualified guarantee that the > > character to binary mapping has not been mangled en route, making any > > scheme applied subsequently rather pointless. > > > > (John) That also does not distinguish among the alternatives before > > us. ?I appreciate the desire for an absolute guarantee of reliability, > > but none is available. ?Qualified guarantees are the best we can > > achieve (and that's a technical assessment, not an aphorism). > > > > (James now) Oh, but I believe it does distinguish, because if CIF > > software reads only UTF8 (because that is what the standard says), > > then the file will tend to be in UTF8 at all points in time, with > > reduced possibilities for encoding errors.? I think it highly likely > > that each group that handles a CIF will at some stage run it through > > CIF-aware software, which means encoding mistakes are likely to be > > caught much earlier. > > > > (James) We would thus go from a situation where we had a single, > > reliable and sometimes slightly inconvenient encoding (UTF8), to one > > where a CIF processor should be prepared for any given CIF file to be > > one of a wide range of encodings which need to be guessed. > > > > (John) Under scheme A or the present draft text, we have "a single, > > reliable [...] encoding" only in the sense that the standard > > *specifies* that that encoding be used. ?So far, however, I see little > > will to produce or use processors that are restricted to UTF-8, and I > > have every expectation that authors will continue to produce CIFs in > > various encodings regardless of the standard's ultimate stance. ?Yes, > > it might be nice if everyone and every system converged on UTF-8 for > > text encoding, but CIF2 cannot force that to happen, not even among > > crystallographers. > > > > (James now) You see little will to do this: but as far as I can tell, > > there is even less will not to do it.? Authors will not "continue" to > > produce CIFs in various encodings, as they haven't started doing so > > yet.? As I've said above, CIF2 can certainly, if not force, encourage > > UTF8 adoption.? What's more, non-ASCII characters are only gradually > > going to find their way into CIF2 files, as the dictionaries and large > > scale adopters of CIF2 (the IUCr) start to allow non-ASCII characters > > in names, and the users gradually adapt to this new way of doing > > things.? I have no sense that CIF users will feel a strong desire to > > use non UTF8 schemes, when they have been happy in an ASCII-only > > regime up until now.? But I'm curious: on what basis are you saying > > that there is little will to use processors that are restricted to > > UTF8? > > > > (John) In practice, then, we really have a situation where the > > practical / useful CIF2 processor must be prepared to handle a variety > > of encodings (details dependent on system requirements), which may > > need to be guessed, with no standard mechanism for helping the > > processor make that determination or for allowing it to check its > > guess. ?Scheme B improves that situation by standardizing a general > > reliability assurance mechanism, which otherwise would be missing. ?In > > view of the practical situation, I see no down side at all. ?A CIF > > processor working with scheme B is *more* able, not less. > > > > (James) I would much prefer a scheme which did not compromise > > reliability in such a significant way. > > > > (John) There is no such compromise, because in practice, we're not > > starting from a reliable position. > > > > (James now) I think your statement that our current position is not > > reliable arises out of a perception that users are likely to use a > > variety of encodings regardless of what the standard says.? I think > > this danger is way overstated, but I'd like to see you expand on why > > you think there is such a likelihood of multiple encodings being used > > > > (James) My previous (somewhat clunky) attempts to adjust Scheme B were > > directed at trying to force any file with the CIF2.0 magic number to > > be either decorated or UTF-8, meaning that software has a reasonably > > high confidence in file integrity. > > > > An alternative way of thinking about this is that CIF files also act > > as the mechanism of information transfer between software programs. > > [... W]hen a separate program is asked to input that CIF, the > > information has been transferred, even if that software is running on > > the same computer. > > > > (John) So in that sense, one could argue that Scheme B already applies > > to all CIFs, its assertion to the contrary notwithstanding. ?Honestly, > > though, I don't think debating semantic details of terms such as "data > > transfer" is useful because in practice, and independent of scheme A, > > B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to > > choose what form of reliability assurance to accept or demand, if any. > > > > (James now) I was only debating semantic details in order to expose > > the fact that data transfer occurs between programs, not just between > > systems, and that therefore Scheme B should apply within a single > > system, so therefore, all CIF2 files should be decorated.? As for who > > should be demanding reliability assurance, the receiver may not be in > > a position to demand some level of reliability if the file creator is > > not in direct contact.? Again, we can build this reliability into the > > standard and save the extra negotiation or loss of information that is > > otherwise involved. > > > > (James) Now, moving on to the detailed contours of Scheme B and > > addressing the particular points that John and I have been discussing. > > ?My original criticisms are the ones preceded by numerals. > > > > [(James now) I've deleted those points where we have reached > > agreement.? Those points are: > > (1) Restrict encodings to those for which the first line of a CIF file > > provides unambiguous encoding for ASCII codepoints > > (2) Put the hash value on the first line] > > > > (James a long time ago) (4) Assumption that all recipients will be > > able to handle all encodings > > > > (John) There is no such assumption. ?Rather, there is an > > acknowledgement that some systems may be unable to handle some CIFs. > > That is already the case with CIF1, and it is not completely resolved > > by standardizing on UTF-8 (i.e. scheme A). > > > > (James) There is no such thing as 'optional' for an information > > interchange standard. ?A file that conforms to the standard must be > > readable by parsers written according to the standard. If reading a > > standard-conformant file might fail or (worse) the file might be > > misinterpreted, information cannot always reliably be exchanged using > > this standard, so that optional behaviour needs to be either > > discarded, or made mandatory. There is thus no point in including > > optional behaviour in the standard. So: if the standard allows files > > to be written in encoding XYZ, then all readers should be able to read > > files written in encoding XYZ. ?I view the CIF1 stance of allowing any > > encoding as a mistake, but a benign one, as in the case of CIF1 ASCII > > was so entrenched that it was the defacto standard for the characters > > appearing in CIF1 files. ?In short, we have to specify a limited set > > of acceptable encodings. > > > > (John) As Herb astutely observed, those assertions reflect a > > fundamental source of our disagreement. ?I think we can all agree that > > a standard that permits conforming software to misinterpret conforming > > data is undesirable. > > > > Surely we can also agree that an information interchange standard does > > not serve its purpose if it does not support information being > > successfully interchanged. ?It does not follow, however, that the > > artifacts by which any two parties realize an information interchange > > must be interpretable by all other conceivable parties, nor does it > > follow that that would be a supremely advantageous characteristic if > > it were achievable. ?It also does not follow that recognizable failure > > of any particular attempt at interchange must at all costs be avoided, > > or that a data interchange standard must take no account of its usage > > context. > > > > (James now) This is where we must make a policy decision: is a CIF2 > > file to be a universally understandable file?? I agree that excluding > > optional behaviour is not an absolute requirement, but I also consider > > that optional behaviour should not be introduced without solid > > justification, given the real cost in interoperability and portability > > of the standard.? You refer to two parties who wish to exchange > > information: those parties are always free to agree on private > > enhancements to the CIF2 standard (or to create their very own > > protocol), if they are in contact.? I do not see why this use case > > need concern us here.? Herbert can say to John 'I'm emailing you a > > CIF2 file but encoded in UTF16'.? John has his extremely excellent > > software which handles UTF16 and these two parties are happy. > > > > John mentions a 'usage context'.? If the standard is to include some > > account of usage context, then that context has to be specified > > sufficiently for a CIF2 programmer to understand what aspects of that > > context to consider, and not left open to misinterpretation.? Perhaps > > you could enlarge on what particular context should be included? > > > > (John) Optional and alternative behaviors are not fundamentally > > incompatible with a data interchange standard, as XML and HTML > > demonstrate. ?Or consider the extreme variability of CIF text content: > > whether a particular CIF is suitable for a particular purpose depends > > intimately on exactly which data are present in it, and even to some > > extent on which data names are used to present them, even though ALL > > are optional as far as the format is concerned. ?If I say 'This CIF is > > unsuitable for my present purpose because it does not contain > > _symmetry_space_group_name_H-M', that does not mean the CIF standard > > is broken. ?Yet, it is not qualitatively different for me to say 'This > > CIF is unsuitable because it is encoded in CCSID 500' despite CIF2 > > (hypothetically) permitting arbitrary encodings. > > > > (James now)? The difference is quantitative and qualitative. > > Quantitative, because the number of CIF2 files that are unsuitable > > because of missing tags will always be less than or equal to the > > number of CIF2 files that are unsuitable because of a missing tag and > > unknown encoding.? Thus, by reducing ambiguity at the lower levels of > > the standard, we improve the utility at the higher levels.? The > > difference is also qualitative, in that (a) if we have tags with > > non-ASCII characters, they could conceivably be confused with other > > tags if the encoding is not correct and so you will have a situation > > where a file that is not suitable actually appears suitable, because > > the desired tag appears. Likewise, the value taken by a tag may be > > wrong. > > > > (James a long time ago) (iii) restrict possible encodings to > > internationally recognised ones with well-specified Unicode mappings. > > This addresses point (4) > > > > (John) I don't see the need for this, and to some extent I think it > > could be harmful. ?For example, if Herb sees a use for a scheme of > > this sort in conjunction with imgCIF (unknown at this point whether he > > does), then he might want to be able to specify an encoding specific > > to imgCIF, such as one that provides for multiple text segments, each > > with its own character encoding. ?To the extent that imgCIF is an > > international standard, perhaps that could still satisfy the > > restriction, but I don't think that was the intended meaning of > > "internationally recognised". > > > > (James now)? Indeed.? My intent with this specification was to ensure > > that third parties would be able to recover the encoding. If imgCIF is > > going to cause us to make such an open-ended specification, it is > > probably a sign that imgCIF needs to be addressed separately.? For > > example, should we think about redefining it as a container format, > > with a CIF header and UTF16 body (but still part of the > > "Crystallographic Information Framework")? > > > > (John) As for "well-specified Unicode mappings", I think maybe I'm > > missing something. ?CIF text is already limited to Unicode characters, > > and any encoding that can serve for a particular piece of CIF text > > must map at least the characters actually present in the text. ?What > > encodings or scenarios would be excluded, then, by that aspect of this > > suggestion? > > > > (James) My intention was to make sure that not only the particular > > user who created the file knew this mapping, but that the mapping was > > publically available. ?Certainly only Unicode encodable code points > > will appear, but the recipient needs to be able to recover the mapping > > from the file bytes to Unicode without relying on e.g. files that will > > be supplied on request by someone whose email address no longer works. > > > > (John) This issue is relevant only to the parties among whom a > > particular CIF is exchanged. ?The standard would not particularly > > assist those parties by restricting the permitted encodings, because > > they can safely ignore such restrictions if they mutually agree to do > > so (whether statically or dynamically), and they (specifically, the > > CIF originator) must anyway comply with them if no such agreement is > > implicit or can be reached. > > > > (James) Again, any two parties in current contact can send each other > > files in whatever format and encoding they wish.? My concern is that > > CIF software writers are not drawn into supporting obscure or adhoc > > encodings. > > > > (John) B) Scheme B does not use quite the same language as scheme A > > with respect to detectable encodings. ?As a result, it supports > > (without tagging or hashing) not just UTF-8, but also all UTF-16 and > > UTF-32 variants. ?This is intentional. > > > > (James) I am concerned that the vast majority of users based in > > English speaking countries (and many non English speaking countries) > > will be quite annoyed if they have to deal with UTF-16/32 CIF2 files > > that are no longer accessible to the simple ASCII-based tools and > > software that they are used to. ?Because of this, allowing undecorated > > UTF16/32 would be far more disruptive than forcing people to use UTF8 > > only. Thus my stipulation on maintaining compatibility with ASCII for > > undecorated files. > > > > (John) Supporting UTF-16/32 without tagging or hashing is not a key > > provision of scheme B, and I could live without it, but I don't think > > that would significantly change the likelihood of a user unexpectedly > > encountering undecorated UTF-16/32 CIFs. ?It would change only whether > > such files were technically CIF-conformant, which doesn't much matter > > to the user on the spot. ?In any case, it is not the lack of > > decoration that is the basic problem here. > > > > (James now)? Yes, that is true.? A decorated UTF16 file is just as > > unreadable as an undecorated one in ASCII tools.? However, per my > > comments at the start of this email, I think an extra bit of hoop > > jumping for non UTF8 encoded files has the desirable property of > > encouraging UTF8 use. > > > > (John) C) Scheme B is not aimed at ensuring that every conceivable > > receiver be able to interpret every scheme-B-compliant CIF. ?Instead, > > it provides receivers the ability to *judge* whether they can > > interpret particular CIFs, and afterwards to *verify* that they have > > done so correctly. ?Ensuring that receivers can interpret CIFs is thus > > a responsibility of the sender / archive maintainer, possibly in > > cooperation with the receiver / retriever. > > > > (James) As I've said before, I don't see the paradigm of live > > negotiation between senders and receivers as very useful, as it fails > > to account for CIFs being passed between different software (via > > reading/writing to a file system), or CIFs where the creator is no > > longer around, or technically unsophisticated senders where, for > > example, the software has produced an undecorated CIF in some native > > encoding and the sender has absolutely no idea why the receiver (if > > they even have contact with the receiver!) can't read the file > > properly. ? I prefer to see the standard that we set as a substitute > > for live negotiation, so leaving things up to the users is in that > > sense an abrogation of our responsibility. > > > > (John) That scenario will undoubtedly occur occasionally regardless of > > the outcome of this discussion. ?If it is our responsibility to avoid > > it at all costs then we are doomed to fail in that regard. ?Software > > *will* under some circumstances produce undecorated, non-UTF-8 "CIFs" > > because that is sometimes convenient, efficient, and appropriate for > > the program's purpose. > > > > I think, though, those comments reflect a bit of a misconception. ?The > > overall purpose of CIF supporting multiple encodings would be to allow > > specific CIFs to be better adapted for specific purposes. ?Such > > purposes include, but are not limited to > > > > () exchanging data with general-purpose program(s) on the same system > > () exchanging data with crystallography program(s) on the same system > > () supporting performance or storage objectives of specific programs or > > systems > > () efficiently supporting problem or data domains in which Latin text > > is a minority of the content (e.g. imgCIF) > > () storing data in a personal archive > > () exchanging data with known third parties > > () publishing data to a general audience > > > > *Few, if any, of those uses would be likely to involve live > > negotiation.* ?That's why I assigned primary responsibility for > > selecting encodings to the entity providing the CIF. ?I probably > > should not even have mentioned cooperation of the receiver; I did so > > more because it is conceivable than because it is likely. > > > > (James now) OK, fair enough. My issues then with the paradigm of > > provider-based encoding selection is that it only works where the > > provider is capable of making this choice, and it puts that > > responsibility on all providers, large and small.? Of course, I am > > keen to construct a CIF ecology where providers always automatically > > choose UTF8 as the "safe" choice. > > > > (John) Under any scheme I can imagine, some CIFs will not be well > > suited to some purposes. ?I want to avoid the situation that *no* > > conformant CIF can be well suited to some reasonable purposes. ?I am > > willing to forgo the result that *every* conformant CIF is suited to > > certain other, also reasonable purposes. > > > > (James now) Fair enough.? However, so far the only reasonable purpose > > that I can see for which a UTF8 file would not be suitable is > > exchanging data with general-purpose programs that do not cope with > > UTF8, and it may well be that with a bit of research the list of such > > programs would turn out to be rather short. > > > > > > > > -- > > T +61 (02) 9717 9907 > > F +61 (02) 9717 3145 > > M +61 (04) 0249 4148 > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > > From jamesrhester at gmail.com Thu Aug 26 09:22:09 2010 From: jamesrhester at gmail.com (James Hester) Date: Thu, 26 Aug 2010 18:22:09 +1000 Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: <639601.73559.qm@web87008.mail.ird.yahoo.com> References: <20100623103310.GD15883@emerald.iucr.org> <381469.52475.qm@web87004.mail.ird.yahoo.com> <984921.99613.qm@web87011.mail.ird.yahoo.com> <826180.50656.qm@web87010.mail.ird.yahoo.com> <563298.52532.qm@web87005.mail.ird.yahoo.com> <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> Message-ID: Hi Simon and others, What Simon describes accords closely with my perception of the situation, except that your final point regarding CIF2 requiring users to abandon text editors will depend on how we resolve the encoding issue. For me the logical conclusion from the points you make is to stick to UTF8-only encoding which will keep the large majority of users and developers happy. Unfortunately others have the perception that UTF8-only will be overly restrictive, and lacking hard data we are having trouble deciding which of these two perceptions are correct. Clearly UTF8-only is not overly restrictive *now* because it is *less* restrictive than the (de-facto) CIF1 situation of ASCII-only which has served us well. UTF8 may be restrictive in the future when users of non Latin-1 code points find that they don't know how or can't use their favourite text editors for putting those code points into a CIF, but I'm not sure even the users themselves could answer the question now as to how likely that is going to be. What I would suggest as a cautious compromise is to leave the door open for adding non UTF8 encodings in the future, but not describing any scheme for doing this at present. One way to leave the door open like this would be to declare that the first line of a CIF2 file is 'special', and is reserved for future expansion. Our discussions on Scheme B are sufficiently far advanced to indicate that conventions relating to encoding schemes could be managed in the first line. The question of how strictly something like Scheme B should be applied remains open, and could be addressed once more in-field experience has been gained. On Thu, Aug 26, 2010 at 9:08 AM, SIMON WESTRIP wrote: > Dear all > > Recent contributions have stimulated me to revisit some of the fundamental > issues of the possible changes in CIF2 with respect to CIF1, > in particular, the impact on current practice (as I perceive it, based on my > experience). The following is a summary of my thoughts, trying to > look at this from two perspectives (forgive me if I repeat earlier > opinions): > > 1) User perspective > > To date, in the 'core' CIF world (i.e. single-crystal and its extensions), > users treat CIFs as text files, and expect to be able to read them as such > using > plain-text editors, and indeed edit them if necessary for e.g. publication > purposes. Furthermore, they expect them to be readable by applications that > claim that > ability (e.g. graphics software). > > The situation is slghtly different with mmCIF (and the pdb variants), where > users tend to treat these CIFs as data sources that can be read by > applications without > any need to examine the raw CIF themselves, let alone edit them. > > Although the above statements only encompass two user groups and are based > on my personal experience, I believe these groups are the largest when > talking about CIF users? > > So what is the impact on such users of introducing the use of non-ASCII text > and thus raising the text encoding issue? > > In the latter case, probably minimal, inasmuch as the users dont interact > directly with the raw CIF and rely on CIF processing software to manage the > data. > > In the former case, it is quite possible that a user will no longer be able > to edit the raw CIF using the same plain-text editor they have always used > for such purposes. > For example, if a user receives a CIF that has been encoded in UTF16 by some > remote CIF processing system, and opens it in a non-UTF16-aware plain-text > editor, > they will not be presented with what they would expect, even if the > character set in that particular CIF doesnt extend beyond ASCII; > furthermore, even 'advanced' test editors would struggle if the encoding > were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally > applicable to CIF1, but by 'opening up' multiple encodings, the probability > of their usage increases? > > So as soon as we move beyond ASCII, we have to accept that a large group of > CIF users will, at the very least, have to be aware that CIF is no longer > the 'text' format > that they once understood it to be? > > 2) Developer perspective > > I beleive that developers presented with a documented standard will follow > that standard and prefer to work with no uncertainties, especially if they > are > unfamiliar with the format (perhaps just need to be able to read a CIF to > extract data relevant to their application/database...?) > > Taking the example of XML, in my experience developers seem to follow the > standard quite strictly. Most everyday applications that process XML are > intolerant of > violations of the standard. Fortunately, it is largely only developers that > work with raw XML, so the standard works well. > > In contrast to XML, with HTML/javascript the approach to the 'standard' is > far more tolerant. Though these languages are standardized, in order to > compete, the leading application > developers have had to adopt flexibility (e.g. browsers accept 'dirty' HTML, > are remarkably forgiving of syntax violations in javascript, and alter the > standard to > achieve their own ends or facilitate user requirements). I suspect this > results largely from the evolution of the languages: just as in the early > days of CIF, encouragement of > use and the end results were more important than adherence to the documented > standard? > > Note that these same applications that are so tolerant of HTML/javascript > violations are far less forgiving of malformed XML. So is the lesson here > that developers expect > new standards to be unambiguous and will code accordingly (especially if the > new standard was partly designed to address the shortcomings of its > ancestors)? > > > Again, forgive me if these all sounds familiar - however, before arguing one > way or the other with regard to specifics, perhaps the wider group would > like to confirm or otherwise the main points I'm trying to assert, in > particular, with respect to *user* practice: > > 1) CIF2 will require users to change the way they view CIF - i.e. they may > be forced to use CIF2-compliant text editors/application software, and > abandon their current practice. > > With respect to developers, recent coverage has been very insightful, but > just out of interest, would I be wrong in stating that: > > 2) Developers, especially those that don't specialize in CIF, are likely to > want a clear-cut universal standard that does not require any heuristic > interpretatation. > > Cheers > > Simon > > > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Thu Aug 26 12:24:49 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 26 Aug 2010 07:24:49 -0400 (EDT) Subject: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . In-Reply-To: References: <520427.68014.qm@web87001.mail.ird.yahoo.com> <614241.93385.qm@web87016.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local> <33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <639601.73559.qm@web87008.mail.ird.yahoo.com> Message-ID: Um, but CIF1 is _not_ ascii-only. It is text in any acceptable local encoding. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 26 Aug 2010, James Hester wrote: > Hi Simon and others, > > What Simon describes accords closely with my perception of the > situation, except that your final point regarding CIF2 requiring users > to abandon text editors will depend on how we resolve the encoding > issue. For me the logical conclusion from the points you make is to > stick to UTF8-only encoding which will keep the large majority of > users and developers happy. Unfortunately others have the perception > that UTF8-only will be overly restrictive, and lacking hard data we > are having trouble deciding which of these two perceptions are > correct. Clearly UTF8-only is not overly restrictive *now* because it > is *less* restrictive than the (de-facto) CIF1 situation of ASCII-only > which has served us well. UTF8 may be restrictive in the future when > users of non Latin-1 code points find that they don't know how or > can't use their favourite text editors for putting those code points > into a CIF, but I'm not sure even the users themselves could answer > the question now as to how likely that is going to be. > > What I would suggest as a cautious compromise is to leave the door > open for adding non UTF8 encodings in the future, but not describing > any scheme for doing this at present. One way to leave the door open > like this would be to declare that the first line of a CIF2 file is > 'special', and is reserved for future expansion. Our discussions on > Scheme B are sufficiently far advanced to indicate that conventions > relating to encoding schemes could be managed in the first line. The > question of how strictly something like Scheme B should be applied > remains open, and could be addressed once more in-field experience has > been gained. > > > On Thu, Aug 26, 2010 at 9:08 AM, SIMON WESTRIP > wrote: >> Dear all >> >> Recent contributions have stimulated me to revisit some of the fundamental >> issues of the possible changes in CIF2 with respect to CIF1, >> in particular, the impact on current practice (as I perceive it, based on my >> experience). The following is a summary of my thoughts, trying to >> look at this from two perspectives (forgive me if I repeat earlier >> opinions): >> >> 1) User perspective >> >> To date, in the 'core' CIF world (i.e. single-crystal and its extensions), >> users treat CIFs as text files, and expect to be able to read them as such >> using >> plain-text editors, and indeed edit them if necessary for e.g. publication >> purposes. Furthermore, they expect them to be readable by applications that >> claim that >> ability (e.g. graphics software). >> >> The situation is slghtly different with mmCIF (and the pdb variants), where >> users tend to treat these CIFs as data sources that can be read by >> applications without >> any need to examine the raw CIF themselves, let alone edit them. >> >> Although the above statements only encompass two user groups and are based >> on my personal experience, I believe these groups are the largest when >> talking about CIF users? >> >> So what is the impact on such users of introducing the use of non-ASCII text >> and thus raising the text encoding issue? >> >> In the latter case, probably minimal, inasmuch as the users dont interact >> directly with the raw CIF and rely on CIF processing software to manage the >> data. >> >> In the former case, it is quite possible that a user will no longer be able >> to edit the raw CIF using the same plain-text editor they have always used >> for such purposes. >> For example, if a user receives a CIF that has been encoded in UTF16 by some >> remote CIF processing system, and opens it in a non-UTF16-aware plain-text >> editor, >> they will not be presented with what they would expect, even if the >> character set in that particular CIF doesnt extend beyond ASCII; >> furthermore, even 'advanced' test editors would struggle if the encoding >> were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally >> applicable to CIF1, but by 'opening up' multiple encodings, the probability >> of their usage increases? >> >> So as soon as we move beyond ASCII, we have to accept that a large group of >> CIF users will, at the very least, have to be aware that CIF is no longer >> the 'text' format >> that they once understood it to be? >> >> 2) Developer perspective >> >> I beleive that developers presented with a documented standard will follow >> that standard and prefer to work with no uncertainties, especially if they >> are >> unfamiliar with the format (perhaps just need to be able to read a CIF to >> extract data relevant to their application/database...?) >> >> Taking the example of XML, in my experience developers seem to follow the >> standard quite strictly. Most everyday applications that process XML are >> intolerant of >> violations of the standard. Fortunately, it is largely only developers that >> work with raw XML, so the standard works well. >> >> In contrast to XML, with HTML/javascript the approach to the 'standard' is >> far more tolerant. Though these languages are standardized, in order to >> compete, the leading application >> developers have had to adopt flexibility (e.g. browsers accept 'dirty' HTML, >> are remarkably forgiving of syntax violations in javascript, and alter the >> standard to >> achieve their own ends or facilitate user requirements). I suspect this >> results largely from the evolution of the languages: just as in the early >> days of CIF, encouragement of >> use and the end results were more important than adherence to the documented >> standard? >> >> Note that these same applications that are so tolerant of HTML/javascript >> violations are far less forgiving of malformed XML. So is the lesson here >> that developers expect >> new standards to be unambiguous and will code accordingly (especially if the >> new standard was partly designed to address the shortcomings of its >> ancestors)? >> >> >> Again, forgive me if these all sounds familiar - however, before arguing one >> way or the other with regard to specifics, perhaps the wider group would >> like to confirm or otherwise the main points I'm trying to assert, in >> particular, with respect to *user* practice: >> >> 1) CIF2 will require users to change the way they view CIF - i.e. they may >> be forced to use CIF2-compliant text editors/application software, and >> abandon their current practice. >> >> With respect to developers, recent coverage has been very insightful, but >> just out of interest, would I be wrong in stating that: >> >> 2) Developers, especially those that don't specialize in CIF, are likely to >> want a clear-cut universal standard that does not require any heuristic >> interpretatation. >> >> Cheers >> >> Simon >> >> >> >> > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding >