From jamesrhester at gmail.com Fri Oct 1 02:21:08 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 1 Oct 2010 11:21:08 +1000 Subject: [Cif2-encoding] Revised Motion In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Hi everybody: I think it is fair to say that we are all agreed on the broad principle of the compromise position I proposed recently. The current lack of consensus I interpret as a desire for a bit of technical polish. One reason for the disparity is that my proposal was implicitly expressed in terms of a new paragraph to be added to our current 'Changes' document that is posted at http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf. Yesterday, Herbert and I (for no particular reason) discussed the changes in the context of Herbert's motion, which I had interpreted as largely repeating the content of that 'Changes' document, with the exception of the encoding paragraphs. I was not aware that there were any other controversial sections of Herbert's motion. My expectation is that we would accept (or decline) Herbert's motion as a joint statement of our position, and then rework the 'Changes' document accordingly. Herbert: I notice (now) that the paragraph immediately preceding the paragraph that we changed could be interpreted as conflicting with the new paragraph that you and I wrote, because it appears to cover the whole code point range. Could I suggest that it be replaced by the following: It is understood that CIF2 documents may be constructed and maintained on computers that implement other character encodings. For maximum portability only the clearly identified equivalents to the Unicode characters identified above and below should be used and use of UTF-8 for a concrete representation is highly recommended. However, for compatibility with CIF1 behaviour, there is no formal restriction on the encoding of CIF2 files providing they contain only code points from the ASCII range. Regarding the meaning of 'text': in the 'Changes' document, there is a section for definitions where I think we can define 'text' if we so wish; personally I think that writing 'plain text' instead of 'text' would be sufficient. On Fri, Oct 1, 2010 at 12:01 AM, Bollinger, John C < John.Bollinger at stjude.org> wrote: > > On Thursday, September 30, 2010 8:40 AM, Herbert J. Bernstein wrote: > > James and I had a good e-meeting and came up with the following > >revised wording. If anybody objects to this motion, please speak > >up now. > > With apologies, I object. This proposal has exactly the same problem that > options (1) and (2) did: it does not define "text file". It is worse in > this case, however, because the problem cannot be fixed merely by adding > Herbert's definition (or mine). In most environments that definition does > not encompass UTF-8 encoded text containing non-ASCII characters, so the > recommendation to use UTF-8 implies some other, ill-defined definition. > > I am quite surprised that the result presented is so different from James's > recent compromise proposal, which seemed poised to serve as the basis for a > consensus result. Perhaps a viable solution would be to include a > definition of "text file" derived from that proposal. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20101001/ac140949/attachment-0001.html From jamesrhester at gmail.com Fri Oct 1 02:35:15 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 1 Oct 2010 11:35:15 +1000 Subject: [Cif2-encoding] Revised Motion In-Reply-To: References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Hi Herbert - I think you are misinterpreting John. He is not trying to impose ASCII encoding, he is simply using the term ASCII to refer to Unicode code points less than 127. The only disagreement that I think you and I could have with his wording is that it does not leave open the possibility of COMCIFS approving encodings other than UTF8 and UTF16 (John is not against adding UTF16). On Fri, Oct 1, 2010 at 4:04 AM, Herbert J. Bernstein < yaya at bernstein-plus-sons.com> wrote: > Dear John, > > It appears you are proposing to add the words > > "Reference to text files means binary representations of sequences of > characters, either in a system-dependent form, provided that the > characters are all drawn from the ASCII set, or alternatively as the > sequence of bytes resulting from encoding the character sequence according > to UTF-8." > > Is, unfortunately, inaccurate and confusing and gets us back into the > looping dicussion of binary versus text. It opens up exactly the > issues we just tried to get away from of making it appear that > CIF2 is going to invalidate encodings that happen to be neither > ASCII nor UTF8. I realize that is not what you intend, but that > is what your paragraph seems to imply. > > This is no an easy concept to define. I just went through a large > number of text file definitions on the web, and it is amazing how > flawed they are are in one way or another. For example, wordiq > says, "Text files (plain text files) are files with generally a one-to-one > correspondence between the bytes and ordinary readable characters such as > letters and digits," but that defintion fails to consider UTF8 a text > file deifnition because it maps multiple bytes to readable characters > and multiple, very different byte sequences, all map to the same > redable character. The W3C definition is even more vague than the > CIF non-definition: > > "The text Content-Type is intended for sending material which is > principally textual in form. It is the default Content- Type. A "charset" > parameter may be used to indicate the character set of the body text. The > primary subtype of text is "plain". This indicates plain (unformatted) > text. The default Content-Type for Internet mail is "text/plain; > charset=us-ascii". > > Beyond plain text, there are many formats for representing what might be > known as "extended text" -- text with embedded formatting and presentation > information. An interesting characteristic of many such representations is > that they are to some extent readable even without the software that > interprets them. It is useful, then, to distinguish them, at the highest > level, from such unreadable data as images, audio, or text represented in > an unreadable form. In the absence of appropriate interpretation software, > it is reasonable to show subtypes of text to the user, while it is not > reasonable to do so with most nontextual data. > > Such formatted textual data should be represented using subtypes of text. > Plausible subtypes of text are typically given by the common name of the > representation format, e.g., "text/richtext". " > > Coming to an acceptable formal resolution on the meaning of "text" would > seem likely to take a very, very long time. We need to move on. > > Please recall that what we are discussing is a revision to the existing. > larger CIF 1.1 syntax definition to create the CIF2 syntax definition, > and are just trying to get a clear enough definition of what users and > software developers need to do to cope with the extension of the > number of code points past 126. > > I would suggest that we go forward with the motion as it stands now > and that we all carefully read CIF 1.1 syntax definition to see if > and where it might make sense to insert some clear, agreed definition of > a text file at some future time, but I really don't think most users or > software developers will have a serious problem in getting started with > CIF2 leaving the any ambiguty about the concept of a text file at the same > level it has been under CIF1 with this motion added. > > Once we have a clear, agreed understanding of the more metaphysical > aspects of what text is, we can then share that with the > community. Meanwhile, they hopefully will already be using CIF2. > > Regards, > Herbert > > > > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya at dowling.edu > ===================================================== > > On Thu, 30 Sep 2010, Bollinger, John C wrote: > > > > > On Thursday, September 30, 2010 8:40 AM, Herbert J. Bernstein wrote: > >> James and I had a good e-meeting and came up with the following > >> revised wording. If anybody objects to this motion, please speak > >> up now. > > > > With apologies, I object. This proposal has exactly the same problem > > that options (1) and (2) did: it does not define "text file". It is > > worse in this case, however, because the problem cannot be fixed merely > > by adding Herbert's definition (or mine). In most environments that > > definition does not encompass UTF-8 encoded text containing non-ASCII > > characters, so the recommendation to use UTF-8 implies some other, > > ill-defined definition. > > > > I am quite surprised that the result presented is so different from > > James's recent compromise proposal, which seemed poised to serve as the > > basis for a consensus result. Perhaps a viable solution would be to > > include a definition of "text file" derived from that proposal. > > > > > > Regards, > > > > John > > -- > > John C. Bollinger, Ph.D. > > Department of Structural Biology > > St. Jude Children's Research Hospital > > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20101001/5b995132/attachment.html From yaya at bernstein-plus-sons.com Fri Oct 1 03:41:12 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 30 Sep 2010 22:41:12 -0400 (EDT) Subject: [Cif2-encoding] Revised Motion In-Reply-To: References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear James, I know what John meanss, but it is not what his amendment says. That is not because he is bad at wording this. I have checked and a lot of people have tried for many years to come up with something with similar intent and have failed. That does not mean we should stop trying, but it does mean that it _has_ to be decoupled from getting CIF2 moving, or we will be at this forever, looping trying to find words that express an idea that does _not_ need to be settled to allow code to be written and files to be created. Please, the current revised motions as it stands expresses what is probably the only comprmise we can reach in finite time. A signficant majority of this group favor it. You were mistaken in thinking John would favor it. So be it. There is _nothing_ bad in the current motion. Let us not screw it up with John's poorly phrased paragraph. If he can come up with decent wording some time for what he is trying to say before this summer and there is broad support, we can always include it at Madrid, but holding up progress on CIF2 any further now is very unwise. Please let us have an end to this and go with what you and I negotiated. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 1 Oct 2010, James Hester wrote: > Hi Herbert - I think you are misinterpreting John.? He is not trying to > impose ASCII encoding, he is simply using the term ASCII to refer to Unicode > code points less than 127.? The only disagreement that I think you and I > could have with his wording is that it does not leave open the possibility > of COMCIFS approving encodings other than UTF8 and UTF16 (John is not > against adding UTF16). > > On Fri, Oct 1, 2010 at 4:04 AM, Herbert J. Bernstein > wrote: > Dear John, > > It appears you are proposing to add the words > > "Reference to text files means binary representations of > sequences of > characters, either in a system-dependent form, provided that the > characters are all drawn from the ASCII set, or alternatively as > the > sequence of bytes resulting from encoding the character sequence > according > to UTF-8." > > Is, unfortunately, inaccurate and confusing and gets us back into the > looping dicussion of binary versus text. ?It opens up exactly the > issues we just tried to get away from of making it appear that > CIF2 is going to invalidate encodings that happen to be neither > ASCII nor UTF8. ?I realize that is not what you intend, but that > is what your paragraph seems to imply. > > This is no an easy concept to define. ?I just went through a large > number of text file definitions on the web, and it is amazing how > flawed they are are in one way or another. ?For example, wordiq > says, "Text files (plain text files) are files with generally a > one-to-one > correspondence between the bytes and ordinary readable characters such > as > letters and digits," but that defintion fails to consider UTF8 a text > file deifnition because it maps multiple bytes to readable characters > and multiple, very different byte sequences, all map to the same > redable character. ?The W3C definition is even more vague than the > CIF non-definition: > > "The text Content-Type is intended for sending material which is > principally textual in form. It is the default Content- Type. A > "charset" > parameter may be used to indicate the character set of the body text. > The > primary subtype of text is "plain". This indicates plain (unformatted) > text. The default Content-Type for Internet mail is "text/plain; > charset=us-ascii". > > Beyond plain text, there are many formats for representing what might > be > known as "extended text" -- text with embedded formatting and > presentation > information. An interesting characteristic of many such > representations is > that they are to some extent readable even without the software that > interprets them. It is useful, then, to distinguish them, at the > highest > level, from such unreadable data as images, audio, or text represented > in > an unreadable form. In the absence of appropriate interpretation > software, > it is reasonable to show subtypes of text to the user, while it is not > reasonable to do so with most nontextual data. > > Such formatted textual data should be represented using subtypes of > text. > Plausible subtypes of text are typically given by the common name of > the > representation format, e.g., "text/richtext". " > > Coming to an acceptable ?formal resolution on the meaning of "text" > would > seem likely to take a very, very long time. ?We need to move on. > > Please recall that what we are discussing is a revision to the > existing. > larger CIF 1.1 syntax definition to create the CIF2 syntax definition, > and are just trying to get a clear enough definition of what users and > software developers need to do to cope with the extension of the > number of code points past 126. > > I would suggest that we go forward with the motion as it stands now > and that we all carefully read CIF 1.1 syntax definition to see if > and where it might make sense to insert some clear, agreed definition > of > a text file at some future time, but I really don't think most users > or > software developers will have a serious problem in getting started > with > CIF2 leaving the any ambiguty about the concept of a text file at the > same > level it has been under CIF1 with this motion added. > > Once we have a clear, agreed understanding of the more metaphysical > aspects of what text is, we can then share that with the > community. ?Meanwhile, they hopefully will already be using CIF2. > > Regards, > ? ? Herbert > > > > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? ?Dowling College, Kramer Science Center, KSC 121 > ? ? ? ? Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? ?+1-631-244-3035 > ? ? ? ? ? ? ? ? ?yaya at dowling.edu > ===================================================== > > On Thu, 30 Sep 2010, Bollinger, John C wrote: > > > > > On Thursday, September 30, 2010 8:40 AM, Herbert J. Bernstein wrote: > >> ? James and I had a good e-meeting and came up with the following > >> revised wording. ?If anybody objects to this motion, please speak > >> up now. > > > > With apologies, I object. ?This proposal has exactly the same > problem > > that options (1) and (2) did: it does not define "text file". ?It is > > worse in this case, however, because the problem cannot be fixed > merely > > by adding Herbert's definition (or mine). ?In most environments that > > definition does not encompass UTF-8 encoded text containing > non-ASCII > > characters, so the recommendation to use UTF-8 implies some other, > > ill-defined definition. > > > > I am quite surprised that the result presented is so different from > > James's recent compromise proposal, which seemed poised to serve as > the > > basis for a consensus result. ?Perhaps a viable solution would be to > > include a definition of "text file" derived from that proposal. > > > > > > Regards, > > > > John > > -- > > John C. Bollinger, Ph.D. > > Department of Structural Biology > > St. Jude Children's Research Hospital > > > > > > Email Disclaimer: ?www.stjude.org/emaildisclaimer > > > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding at iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > From yaya at bernstein-plus-sons.com Fri Oct 1 04:00:01 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Thu, 30 Sep 2010 23:00:01 -0400 (EDT) Subject: [Cif2-encoding] Revised Motion In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDFA@SJMEMXMBS11.stjude.sjcrh.local> References: <646265.82162.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local > <8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local > <20100929102536.GB24670@emerald.iucr.org> <20100930084028.GC9485@emerald.iucr.org> <629785.55688.qm@web87004.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA5416659DEDF0@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDFA@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: Dear John, You miss the point -- many people have tried and failed to clearly specify what you are specifying. Other than your wanting this wording to be written, precisely what is the harm to CIF2 in leaving the ambiguity that has existed for many years CIF and many other text format descriptions. I have tried to come up with wording that will not cause confusion or provike a fight, and I have failed. If you can do it, more power to you, but why should CIF2 be further delayed while you work on it? Enough is enough. It is time to move on. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Thu, 30 Sep 2010, Bollinger, John C wrote: > Dear Herbert, > > On Thursday, September 30, 2010 1:05 PM, Herbert J. Bernstein wrote: > >> It appears you are proposing to add the words >> >> "Reference to text files means binary representations of sequences of >> characters, either in a system-dependent form, provided that the >> characters are all drawn from the ASCII set, or alternatively as the >> sequence of bytes resulting from encoding the character sequence according >> to UTF-8." > > Yes. I am open to variations on the wording, but I'm looking for something along those lines to be added to the spec. Am I wrong that yesterday we were close to doing just that via James's proposal? > >> Is, unfortunately, inaccurate and confusing and gets us back into the >> looping dicussion of binary versus text. It opens up exactly the >> issues we just tried to get away from of making it appear that >> CIF2 is going to invalidate encodings that happen to be neither >> ASCII nor UTF8. I realize that is not what you intend, but that >> is what your paragraph seems to imply. > > I can accept that the wording may be confusing, and I would welcome constructive criticism on that topic. You cannot sustain a claim that my text is inaccurate, however, without providing at least a partial alternative definition that conflicts. In other words, what's inaccurate about it? This is a highly relevant question, because I find my text to be entirely reasonable, and I might well program according to that interpretation without some guidance otherwise. If I don't use that, then what *do* I use? > > As for binary vs. text, I have realized that's a false dichotomy in our context. Every computer file is binary, in the sense that it is a sequence of bytes. Some are a _particular type_ of binary that we call "text" (but can't seem to define). The two are not mutually exclusive. This is quite different from the traditional binary vs. text issue, which relates to questions such as whether to represent the number 12345 in IEEE 32-bit floating-point format or as five decimal digits. > >> This is no an easy concept to define. I just went through a large >> number of text file definitions on the web, and it is amazing how >> flawed they are are in one way or another. > > That is precisely why I am so persistent about putting a definition in the spec. If I choose the definition I think best, and you the one you think best, and James and Simon likewise, then will any of our programs be fully compatible with each other? Simon likes identifiable encodings, so maybe he'll feel free to write UTF-32LE CIFs. Will your programs accept those? Should they? To be prepared to process all conformant CIFs, does my program need to be able to handle KOI8-R and Shift-JIS CIFs? If I use MS Word to create CIFs, and I save them in Rich *Text* Format, then should I be upset when James's software rejects them? > > I don't think it's correct to say that the concept is difficult to define. I could write half a dozen definitions in as many minutes, each appropriate for some particular purpose. It's more accurate to say that there are many alternative definitions in use, none of them completely compatible with the others. There is no reason why we can't choose the one we find most suitable, or write one of our own. > > [...] > >> Coming to an acceptable formal resolution on the meaning of "text" would > seem likely to take a very, very long time. > > You already provided a definition that was good enough for me. My proposed text summarizes and abridges it, perhaps too much, as "a system-dependent form". I would be content to replace that phrase with your full text, or with the functionally equivalent text I labeled "local". > >> We need to move on. > > We need to answer the question. Or COMCIFS does, if we're not up to it. The spec is incomplete and inadequate without an answer. > >> Please recall that what we are discussing is a revision to the existing. >> larger CIF 1.1 syntax definition to create the CIF2 syntax definition, >> and are just trying to get a clear enough definition of what users and >> software developers need to do to cope with the extension of the >> number of code points past 126. > > And what definition are we then providing? The only clear thing I see is that users and developers are *probably* safe if they write UTF-8. If UTF-8 is the only safe option for CIFs with non-ASCII characters, then how does that differ from my proposal? > >> I would suggest that we go forward with the motion as it stands now >> and that we all carefully read CIF 1.1 syntax definition to see if >> and where it might make sense to insert some clear, agreed definition of >> a text file at some future time, but I really don't think most users or >> software developers will have a serious problem in getting started with >> CIF2 leaving the any ambiguty about the concept of a text file at the same >> level it has been under CIF1 with this motion added. > > This area presents a much greater problem for CIF2 and its expanded character set than it did for CIF1. I quite agree that most users and software developers would get started with CIF2 despite the ambiguity. I cannot see how we would then avoid a slew of problems of the form "X software doesn't handle my CIF" and "Y software produces broken CIFs" and "Z software is incompatible with W software". I do not see how that could be construed as a win for CIF2. > >> Once we have a clear, agreed understanding of the more metaphysical >> aspects of what text is, we can then share that with the >> community. Meanwhile, they hopefully will already be using CIF2. > > This is not an arcane subject that we need to "understand", it is a question that we have the opportunity to *answer* for the purposes of CIF. We do not need a definition that everyone, everywhere will acclaim as the full and perfect meaning of "text". We just need to be clear what the specification means by the term. If we don't know what the specification means by the term, then we should be embarrassed to advance it. > > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > From jamesrhester at gmail.com Fri Oct 1 05:37:01 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 1 Oct 2010 14:37:01 +1000 Subject: [Cif2-encoding] Drafting issues Message-ID: Dear Group, As I think we have reached a consensus in principle, and are now moving into discussion of precise definitions, let us have wording arguments only once (that is, for a single document). I think that our base document must be the one that the DDLm group agreed on - the link once again is http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf- simply because it will be unnecessarily confusing for the DDLm group to deal with two documents at once, and the 'Changes' document is admirably precise. I reiterate once again that I am happy with the motion that Herbert presented, with the proviso that one paragraph is rewritten as I have recently proposed. Herbert - if you would like to negotiate that paragraph with me by Skype, I'm happy to do that too. I have appended a text version of what I consider to be the relevant sections of the 'changes' document to this message. I am happy to provide the complete document in OpenOffice format to anybody who would like it. Herbert - if you think any of the non-encoding discussion in your motion is not already covered in the 'Changes' document, please advise. I will be posting my own suggestion, largely based on parts of the motion that Herbert and I drafted yesterday, in a reply to this email. CIF - Changes to the specification 05 July 2010 This document specifies changes to the *syntax *of CIF. We refer to the current syntax specification of CIF as CIF1, and the new specification as CIF2. To date all archival CIFs are CIF1. The changes to syntax are necessitated by the adoption of new dictionary functionalities that introduce several extensions, including new data types, and method definitions using dREL. It is assumed the reader has a thorough understanding of the CIF1 specification. TERMINOLOGY Reference to *character(s)* means abstract characters assigned code points by *Unicode*. Specific characters are referenced according to Unicode convention, U?+?*xxxx*[*x*[*x*]], where *xxxx*[*x*[*x*]]* *is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8. Reference to *ASCII *characters means characters U?+?0000* *through* * U?+?007F,* *or, equivalently the first 128 characters of the *ISO 8859 1* (* LATIN 1*)* *character set. Reference to *newline *or *\n *means the sequence that conventionally terminates a line record (which is environment dependent). * See Change 3.* Reference to *whitespace *means the characters ASCII space (U?+?0020), ASCII horizontal tab (U?+?0009) and the *newline *characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute *whitespace *for the purposes of CIF2. PREAMBLE CIF2 significantly extends CIF1 functionality, primarily through new dictionary features. CIF2 is not fully backwards-compatible with CIF1: many files compliant with CIF1 are also compliant with CIF2, but some are not (see especially change 5, below). The CIF1 standard will continue to operate for the foreseeable future in parallel with CIF2. CHANGE 1 ? NEW (MAGIC CODE) A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is, #\#CIF_2.0 followed immediately by *whitespace*. CHANGE 2 ? NEW (CHARACTER SET) CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. In keeping with XML restrictions we allow the characters U?+?0009 U?+?000A U?+?000D U?+?0020 ? U+007E U+00A0 - U?+?D7FF U?+?E000 ? U+FDCF U?+?FDF0 - U+FFFD U?+?10000 - U?+?10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 ? F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. *Reasoning: There is growing demand for the wider character set afforded by Unicode to be made available in applications, especially those where internationalisation is an issue.* -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20101001/99611fc7/attachment.html From jamesrhester at gmail.com Fri Oct 1 05:44:29 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 1 Oct 2010 14:44:29 +1000 Subject: [Cif2-encoding] Drafting issues In-Reply-To: References: Message-ID: Before I post my revised text, I have only just realised (upon close perusal of the two texts) that Herbert's motion is substantially the same as the 'Changes' document, just without the headings etc, so we are discussing almost the same document. My apologies for the confusion. James. On Fri, Oct 1, 2010 at 2:37 PM, James Hester wrote: > Dear Group, > > As I think we have reached a consensus in principle, and are now moving > into discussion of precise definitions, let us have wording arguments only > once (that is, for a single document). I think that our base document must > be the one that the DDLm group agreed on - the link once again is > http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf- simply because it will be unnecessarily confusing for the DDLm group to > deal with two documents at once, and the 'Changes' document is admirably > precise. I reiterate once again that I am happy with the motion that > Herbert presented, with the proviso that one paragraph is rewritten as I > have recently proposed. Herbert - if you would like to negotiate that > paragraph with me by Skype, I'm happy to do that too. > > I have appended a text version of what I consider to be the relevant > sections of the 'changes' document to this message. I am happy to provide > the complete document in OpenOffice format to anybody who would like it. > Herbert - if you think any of the non-encoding discussion in your motion is > not already covered in the 'Changes' document, please advise. > > I will be posting my own suggestion, largely based on parts of the motion > that Herbert and I drafted yesterday, in a reply to this email. > > CIF - Changes to the specification 05 July 2010 > > This document specifies changes to the *syntax *of CIF. We refer to the > current syntax specification of CIF as CIF1, and the new specification as > CIF2. To date all archival CIFs are CIF1. > > The changes to syntax are necessitated by the adoption of new dictionary > functionalities that introduce several extensions, including new data types, > and method definitions using dREL. > > It is assumed the reader has a thorough understanding of the CIF1 > specification. > TERMINOLOGY > > Reference to *character(s)* means abstract characters assigned code points > by *Unicode*. Specific characters are referenced according to Unicode > convention, U?+?*xxxx*[*x*[*x*]], where *xxxx*[*x*[*x*]]* *is the four- to > six-digit hexadecimal representation of the assigned code point. The > designated character encoding for CIF2 is UTF-8. > > Reference to *ASCII *characters means characters U?+?0000* *through* * > U?+?007F,* *or, equivalently the first 128 characters of the *ISO 8859 1*( > *LATIN 1*)* *character set. > > Reference to *newline *or *\n *means the sequence that conventionally > terminates a line record (which is environment dependent). * See Change 3. > * > > Reference to *whitespace *means the characters ASCII space (U?+?0020), > ASCII horizontal tab (U?+?0009) and the *newline *characters. Without > regard to local convention, the various other characters that Unicode > classifies as whitespace (character categories Zs and Zp) do not constitute > *whitespace *for the purposes of CIF2. > PREAMBLE > > CIF2 significantly extends CIF1 functionality, primarily through new > dictionary features. CIF2 is not fully backwards-compatible with CIF1: many > files compliant with CIF1 are also compliant with CIF2, but some are not > (see especially change 5, below). The CIF1 standard will continue to operate > for the foreseeable future in parallel with CIF2. > CHANGE 1 ? NEW (MAGIC CODE) > > A CIF2 file is uniquely identified by a required magic code at the > beginning of its first line. The code is, > > #\#CIF_2.0 > > followed immediately by *whitespace*. > CHANGE 2 ? NEW (CHARACTER SET) > > CIF2 files are standard variable length text files, which for compatibility > with older processing systems will have a maximum line length of 2048 > characters. As discussed above and below, however, there are some > restrictions on the character set for token delimiters, separators and data > names. > > In keeping with XML restrictions we allow the characters > > U?+?0009 U?+?000A U?+?000D > U?+?0020 ? U+007E > U+00A0 - U?+?D7FF > U?+?E000 ? U+FDCF > U?+?FDF0 - U+FFFD > U?+?10000 - U?+?10FFFD > > In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is > any hexadecimal digit are disallowed. Unicode reserves the code points E000 ? > F8FF for private use. The IUCr and only the IUCr may specify what > characters are assigned to these code points in the context of CIF2. > > *Reasoning: There is growing demand for the wider character set afforded > by Unicode to be made available in applications, especially those where > internationalisation is an issue.* > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20101001/1053ca62/attachment.html From jamesrhester at gmail.com Fri Oct 1 06:17:26 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 1 Oct 2010 15:17:26 +1000 Subject: [Cif2-encoding] Drafting issues In-Reply-To: References: Message-ID: I have included below revised text, this time in plain text format in case the HTML format of the previous email was a problem. I have made the following changes: (1) In first paragraph of the TERMINOLOGY section I have written that UTF8 is the 'preferred' encoding of CIF2 rather than the 'designated' encoding. This is in keeping with the newfound status of UTF16 as an acceptable encoding (2) In Change 1, I've slightly altered the text on encoding disambiguation that Herbert and I added and changed the commentary on BOM for clarity (3) From the 3rd sentence to the end of the first paragraph of CHANGE 2, I have incorporated the paragraph that Herbert and I worked on, and largely removed the preceding paragraph of Herbert's motion. The removal of the preceding paragraph is an attempt to avoid confusion and because it was largely discussion, rather than specification. You will also note some small changes to Herbert's and my text, which I hope is acceptable under the rubric of cleanup. I've also added a clarificatory comment about UTF8 and UTF16. In general this paragraph is quite scrappy and I believe could be better focussed. In particular, does anybody have an objection to UTF8 and UTF16 being the only acceptable CIF2 encodings where non-ASCII codepoints are present and in the absence of a disambiguation signature? If we accept this, then we can chop out the algorithmic determination part (which is really just a statement of principle, and by itself not something that you can write a program based on). (4) I have changed 'text' in the first line of that same paragraph to read 'plain text'. I hope this is acceptable as a proxy for a more long-winded definition. If it is not, then we need to add a full definition of 'text' to the definitions section, and I believe that there are several floating around that are acceptable to everybody. If anybody has an objection to these changes, please identify the change, state your objection *precisely*, and give an alternative that would be satisfactory to you. James. ========================================================================== CIF - Changes to the specification 01 October 2010 This document specifies changes to the syntax of CIF. We refer to the current syntax specification of CIF as CIF1, and the new specification as CIF2. To date all archival CIFs are CIF1. The changes to syntax are necessitated by the adoption of new dictionary functionalities that introduce several extensions, including new data types, and method definitions using dREL. It is assumed the reader has a thorough understanding of the CIF1 specification. TERMINOLOGY Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U?+?xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The preferred character encoding for CIF2 is UTF-8. Reference to ASCII characters means characters U?+?0000 through U?+?007F, or, equivalently the first 128 characters of the ISO?8859?1 (LATIN?1) character set. Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). See Change 3. Reference to whitespace means the characters ASCII space (U?+?0020), ASCII horizontal tab (U?+?0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. PREAMBLE CIF2 significantly extends CIF1 functionality, primarily through new dictionary features. CIF2 is not fully backwards-compatible with CIF1: many files compliant with CIF1 are also compliant with CIF2, but some are not (see especially change 5, below). The CIF1 standard will continue to operate for the foreseeable future in parallel with CIF2. CHANGE 1 ? NEW (MAGIC CODE) A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is, #\#CIF_2.0 followed immediately by whitespace. The immediately following space on this line is reserved for encoding disambiguation signatures. Note that where a Unicode BOM is used, it would appear prior to the magic code in the byte stream and does not form part of the CIF text. CHANGE 2 ? NEW (CHARACTER SET) CIF2 files are standard variable length plain text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. For compatibility with CIF1 behaviour, there is no formal restriction on the encoding of CIF2 files providing they contain only code points from the ASCII range. If a CIF2 file contains characters equivalent to Unicode code points greater than U+0077 (127 decimal), then the particular encoding used must either be UTF8 or algorithmically identifiable from the CIF2 file itself. Note that UTF16 with a BOM conforms to this requirement. The use of a BOM for unicode encodings, including UTF8, is recommended. Acceptable identification algorithms will be published as necessary as annexes to this standard (see description of magic code and encoding disambiguation in Change 1). In the absence of an encoding disambiguation signature, it is safe to assume that the encoding of a CIF2 file containing characters outside the ASCII range is either UTF8 or UTF16. In keeping with XML restrictions we allow the characters U?+?0009 U?+?000A U?+?000D U?+?0020 ? U+007E U+00A0 - U?+?D7FF U?+?E000 ? U+FDCF U?+?FDF0 - U+FFFD U?+?10000 - U?+?10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 ? F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. Reasoning: There is growing demand for the wider character set afforded by Unicode to be made available in applications, especially those where internationalisation is an issue. On Fri, Oct 1, 2010 at 2:44 PM, James Hester wrote: > > Before I post my revised text, I have only just realised (upon close perusal of the two texts) that Herbert's motion is substantially the same as the 'Changes' document, just without the headings etc, so we are discussing almost the same document.? My apologies for the confusion. > > James. > > On Fri, Oct 1, 2010 at 2:37 PM, James Hester wrote: >> >> Dear Group, >> >> As I think we have reached a consensus in principle, and are now moving into discussion of precise definitions, let us have wording arguments only once (that is, for a single document).? I think that our base document must be the one that the DDLm group agreed on - the link once again is http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf - simply because it will be unnecessarily confusing for the DDLm group to deal with two documents at once, and the 'Changes' document is admirably precise.? I reiterate once again that I am happy with the motion that Herbert presented, with the proviso that one paragraph is rewritten as I have recently proposed.? Herbert - if you would like to negotiate that paragraph with me by Skype, I'm happy to do that too. >> >> I have appended a text version of what I consider to be the relevant sections of the 'changes' document to this message.? I am happy to provide the complete document in OpenOffice format to anybody who would like it.? Herbert - if you think any of the non-encoding discussion in your motion is not already covered in the 'Changes' document, please advise. >> >> I will be posting my own suggestion, largely based on parts of the motion that Herbert and I drafted yesterday, in a reply to this email. >> >> CIF - Changes to the specification >> >> 05 July 2010 >> >> This document specifies changes to the syntax of CIF. We refer to the current syntax specification of CIF as CIF1, and the new specification as CIF2. To date all archival CIFs are CIF1. >> >> The changes to syntax are necessitated by the adoption of new dictionary functionalities that introduce several extensions, including new data types, and method definitions using dREL. >> >> It is assumed the reader has a thorough understanding of the CIF1 specification. >> >> TERMINOLOGY >> >> Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U?+?xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8. >> >> Reference to ASCII characters means characters U?+?0000 through U?+?007F, or, equivalently the first 128 characters of the ISO?8859?1 (LATIN?1) character set. >> >> Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). See Change 3. >> >> Reference to whitespace means the characters ASCII space (U?+?0020), ASCII horizontal tab (U?+?0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. >> >> PREAMBLE >> >> CIF2 significantly extends CIF1 functionality, primarily through new dictionary features. CIF2 is not fully backwards-compatible with CIF1: many files compliant with CIF1 are also compliant with CIF2, but some are not (see especially change 5, below). The CIF1 standard will continue to operate for the foreseeable future in parallel with CIF2. >> >> CHANGE 1 ? NEW (MAGIC CODE) >> >> A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is, >> >> #\#CIF_2.0 >> >> followed immediately by whitespace. >> >> CHANGE 2 ? NEW (CHARACTER SET) >> >> CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. >> >> In keeping with XML restrictions we allow the characters >> >> U?+?0009 U?+?000A U?+?000D >> U?+?0020 ? U+007E >> U+00A0 - U?+?D7FF >> U?+?E000 ? U+FDCF >> U?+?FDF0 - U+FFFD >> U?+?10000 - U?+?10FFFD >> >> In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 ? F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. >> >> Reasoning: There is growing demand for the wider character set afforded by Unicode to be made available in applications, especially those where internationalisation is an issue. >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From yaya at bernstein-plus-sons.com Fri Oct 1 12:53:22 2010 From: yaya at bernstein-plus-sons.com (Herbert J. Bernstein) Date: Fri, 1 Oct 2010 07:53:22 -0400 (EDT) Subject: [Cif2-encoding] Drafting issues In-Reply-To: References: Message-ID: Sigh, here we go again into the loop. Yes, I object to saying > UTF8 and UTF16 [are] the only acceptable CIF2 encodings where > non-ASCII codepoints are present and in the absence of a > disambiguation signature?" Because there are BOMs for _all_ the various Unicode encodings (and there are a lot), and we have not resolved any of the disambiguation signatures, so I suggest changing "In the absence of an encoding > disambiguation signature, it is safe to assume that the encoding of a > CIF2 file containing characters outside the ASCII range is either UTF8 > or UTF16." to "A CIF2 file containing charaters outside the ASCII range with no BOM and no disambiguation signature wiill be a UTF8 file. A CIF2 file containing charcaters outside the ASCII range with a valid UTF8 or UTF16 BOM and no disambiguation signature, will be a Unicode file written in the indicated encoding." Frankly, even thought this revised sentence is reaonable, after the fuss we have had over equally reasoable sentences, I think it is a mistake to include either version at all. I can just hear somebody, e.g. raising the issue that "...but, but, but, you did not resolve the open question of the various canoncalized encodings..." or "...but, but, but you did not deal with UCS2 versus UTF16..." As we have discovered, the encoding issue is a quagmire, the more specific we try to get, the more we struggle, the more we get stuck and CIF2 gets delayed. Could we please, please, please, put an end to this!!!!! Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya at dowling.edu ===================================================== On Fri, 1 Oct 2010, James Hester wrote: > I have included below revised text, this time in plain text format in > case the HTML format of the previous email was a problem. > > I have made the following changes: > > (1) In first paragraph of the TERMINOLOGY section I have written that > UTF8 is the 'preferred' encoding of CIF2 rather than the 'designated' > encoding. This is in keeping with the newfound status of UTF16 as an > acceptable encoding > (2) In Change 1, I've slightly altered the text on encoding > disambiguation that Herbert and I added and changed the commentary on > BOM for clarity > (3) From the 3rd sentence to the end of the first paragraph of CHANGE > 2, I have incorporated the paragraph that Herbert and I worked on, and > largely removed the preceding paragraph of Herbert's motion. The > removal of the preceding paragraph is an attempt to avoid confusion > and because it was largely discussion, rather than specification. You > will also note some small changes to Herbert's and my text, which I > hope is acceptable under the rubric of cleanup. I've also added a > clarificatory comment about UTF8 and UTF16. > In general this paragraph is quite scrappy and I believe could be > better focussed. In particular, does anybody have an objection to > UTF8 and UTF16 being the only acceptable CIF2 encodings where > non-ASCII codepoints are present and in the absence of a > disambiguation signature? If we accept this, then we can chop out the > algorithmic determination part (which is really just a statement of > principle, and by itself not something that you can write a program > based on). > (4) I have changed 'text' in the first line of that same paragraph to > read 'plain text'. I hope this is acceptable as a proxy for a more > long-winded definition. If it is not, then we need to add a full > definition of 'text' to the definitions section, and I believe that > there are several floating around that are acceptable to everybody. > > If anybody has an objection to these changes, please identify the > change, state your objection *precisely*, and give an alternative that > would be satisfactory to you. > > James. > > ========================================================================== > CIF - Changes to the specification > 01 October 2010 > This document specifies changes to the syntax of CIF. We refer to the > current syntax specification of CIF as CIF1, and the new specification > as CIF2. To date all archival CIFs are CIF1. > The changes to syntax are necessitated by the adoption of new > dictionary functionalities that introduce several extensions, > including new data types, and method definitions using dREL. > It is assumed the reader has a thorough understanding of the CIF1 specification. > TERMINOLOGY > Reference to character(s) means abstract characters assigned code > points by Unicode. Specific characters are referenced according to > Unicode convention, U?+?xxxx[x[x]], where xxxx[x[x]] is the four- to > six-digit hexadecimal representation of the assigned code point. The > preferred character encoding for CIF2 is UTF-8. > Reference to ASCII characters means characters U?+?0000 through > U?+?007F, or, equivalently the first 128 characters of the ISO?8859?1 > (LATIN?1) character set. > Reference to newline or \n means the sequence that conventionally > terminates a line record (which is environment dependent). See Change > 3. > Reference to whitespace means the characters ASCII space (U?+?0020), > ASCII horizontal tab (U?+?0009) and the newline characters. Without > regard to local convention, the various other characters that Unicode > classifies as whitespace (character categories Zs and Zp) do not > constitute whitespace for the purposes of CIF2. > PREAMBLE > CIF2 significantly extends CIF1 functionality, primarily through new > dictionary features. CIF2 is not fully backwards-compatible with CIF1: > many files compliant with CIF1 are also compliant with CIF2, but some > are not (see especially change 5, below). The CIF1 standard will > continue to operate for the foreseeable future in parallel with CIF2. > CHANGE 1 ? NEW (MAGIC CODE) > A CIF2 file is uniquely identified by a required magic code at the > beginning of its first line. The code is, > #\#CIF_2.0 > followed immediately by whitespace. The immediately following space > on this line is reserved for encoding disambiguation signatures. Note > that where a Unicode BOM is used, it would appear prior to the magic > code in the byte stream and does not form part of the CIF text. > CHANGE 2 ? NEW (CHARACTER SET) > CIF2 files are standard variable length plain text files, which for > compatibility with older processing systems will have a maximum line > length of 2048 characters. As discussed above and below, however, > there are some restrictions on the character set for token delimiters, > separators and data names. For compatibility with CIF1 behaviour, > there is no formal restriction on the encoding of CIF2 files providing > they contain only code points from the ASCII range. If a CIF2 file > contains characters equivalent to Unicode code points greater than > U+0077 (127 decimal), then the particular encoding used must either be > UTF8 or algorithmically identifiable from the CIF2 file itself. Note > that UTF16 with a BOM conforms to this requirement. The use of a BOM > for unicode encodings, including UTF8, is recommended. Acceptable > identification algorithms will be published as necessary as annexes to > this standard (see description of magic code and encoding > disambiguation in Change 1). In the absence of an encoding > disambiguation signature, it is safe to assume that the encoding of a > CIF2 file containing characters outside the ASCII range is either UTF8 > or UTF16. > In keeping with XML restrictions we allow the characters > U?+?0009 U?+?000A U?+?000D > U?+?0020 ? U+007E > U+00A0 - U?+?D7FF > U?+?E000 ? U+FDCF > U?+?FDF0 - U+FFFD > U?+?10000 - U?+?10FFFD > In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where > x is any hexadecimal digit are disallowed. Unicode reserves the code > points E000 ? F8FF for private use. The IUCr and only the IUCr may > specify what characters are assigned to these code points in the > context of CIF2. > Reasoning: There is growing demand for the wider character set > afforded by Unicode to be made available in applications, especially > those where internationalisation is an issue. > > On Fri, Oct 1, 2010 at 2:44 PM, James Hester wrote: >> >> Before I post my revised text, I have only just realised (upon close perusal of the two texts) that Herbert's motion is substantially the same as the 'Changes' document, just without the headings etc, so we are discussing almost the same document.? My apologies for the confusion. >> >> James. >> >> On Fri, Oct 1, 2010 at 2:37 PM, James Hester wrote: >>> >>> Dear Group, >>> >>> As I think we have reached a consensus in principle, and are now moving into discussion of precise definitions, let us have wording arguments only once (that is, for a single document).? I think that our base document must be the one that the DDLm group agreed on - the link once again is http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf - simply because it will be unnecessarily confusing for the DDLm group to deal with two documents at once, and the 'Changes' document is admirably precise.? I reiterate once again that I am happy with the motion that Herbert presented, with the proviso that one paragraph is rewritten as I have recently proposed.? Herbert - if you would like to negotiate that paragraph with me by Skype, I'm happy to do that too. >>> >>> I have appended a text version of what I consider to be the relevant sections of the 'changes' document to this message.? I am happy to provide the complete document in OpenOffice format to anybody who would like it.? Herbert - if you think any of the non-encoding discussion in your motion is not already covered in the 'Changes' document, please advise. >>> >>> I will be posting my own suggestion, largely based on parts of the motion that Herbert and I drafted yesterday, in a reply to this email. >>> >>> CIF - Changes to the specification >>> >>> 05 July 2010 >>> >>> This document specifies changes to the syntax of CIF. We refer to the current syntax specification of CIF as CIF1, and the new specification as CIF2. To date all archival CIFs are CIF1. >>> >>> The changes to syntax are necessitated by the adoption of new dictionary functionalities that introduce several extensions, including new data types, and method definitions using dREL. >>> >>> It is assumed the reader has a thorough understanding of the CIF1 specification. >>> >>> TERMINOLOGY >>> >>> Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U?+?xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8. >>> >>> Reference to ASCII characters means characters U?+?0000 through U?+?007F, or, equivalently the first 128 characters of the ISO?8859?1 (LATIN?1) character set. >>> >>> Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). See Change 3. >>> >>> Reference to whitespace means the characters ASCII space (U?+?0020), ASCII horizontal tab (U?+?0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. >>> >>> PREAMBLE >>> >>> CIF2 significantly extends CIF1 functionality, primarily through new dictionary features. CIF2 is not fully backwards-compatible with CIF1: many files compliant with CIF1 are also compliant with CIF2, but some are not (see especially change 5, below). The CIF1 standard will continue to operate for the foreseeable future in parallel with CIF2. >>> >>> CHANGE 1 ? NEW (MAGIC CODE) >>> >>> A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is, >>> >>> #\#CIF_2.0 >>> >>> followed immediately by whitespace. >>> >>> CHANGE 2 ? NEW (CHARACTER SET) >>> >>> CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. >>> >>> In keeping with XML restrictions we allow the characters >>> >>> U?+?0009 U?+?000A U?+?000D >>> U?+?0020 ? U+007E >>> U+00A0 - U?+?D7FF >>> U?+?E000 ? U+FDCF >>> U?+?FDF0 - U+FFFD >>> U?+?10000 - U?+?10FFFD >>> >>> In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 ? F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. >>> >>> Reasoning: There is growing demand for the wider character set afforded by Unicode to be made available in applications, especially those where internationalisation is an issue. >>> >>> -- >>> T +61 (02) 9717 9907 >>> F +61 (02) 9717 3145 >>> M +61 (04) 0249 4148 >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding From jamesrhester at gmail.com Fri Oct 1 13:59:42 2010 From: jamesrhester at gmail.com (James Hester) Date: Fri, 1 Oct 2010 22:59:42 +1000 Subject: [Cif2-encoding] Drafting issues In-Reply-To: References: Message-ID: Herbert, you have proposed an entirely reasonable rewriting of what I proposed with an entirely reasonable justification. I'm happy to accept your new wording. The worst is behind us, and we are currently mopping up. After making it through the mountain pass, surely you didn't expect to just fall off a cliff to the meadows below? Perhaps that should be a haiku: Crunching through a snowlit pass A distant eagle floats above the sunny meadows Ah! The roads of the air. On Fri, Oct 1, 2010 at 9:53 PM, Herbert J. Bernstein wrote: > Sigh, here we go again into the loop. > > Yes, I object to saying >> >> UTF8 and UTF16 [are] the only acceptable CIF2 encodings where >> non-ASCII codepoints are present and in the absence of a >> disambiguation signature?" > > Because there are BOMs for _all_ the various Unicode encodings > (and there are a lot), and we have not resolved any of the > disambiguation signatures, so I suggest changing > > "In the absence of an encoding >> >> disambiguation signature, it is safe to assume that the encoding of a >> CIF2 file containing characters outside the ASCII range is either UTF8 >> or UTF16." > > > to > > "A CIF2 file containing charaters outside the ASCII range with no BOM > and no disambiguation signature wiill be a UTF8 file. ?A CIF2 file > containing charcaters outside the ASCII range with a valid UTF8 or > UTF16 BOM and no disambiguation signature, will be a Unicode file > written in the indicated encoding." > > Frankly, even thought this revised sentence is reaonable, after > the fuss we have had over equally reasoable sentences, I think > it is a mistake to include either version at all. ?I can just hear > somebody, e.g. raising the issue that "...but, but, but, you > did not resolve the open question of the various canoncalized > encodings..." or "...but, but, but you did not deal with UCS2 versus > UTF16..." > > As we have discovered, the encoding issue is a quagmire, the more > specific we try to get, the more we struggle, the more we get stuck > and CIF2 gets delayed. > > > Could we please, please, please, put an end to this!!!!! > > Regards, > ?Herbert > > > > ===================================================== > ?Herbert J. Bernstein, Professor of Computer Science > ? Dowling College, Kramer Science Center, KSC 121 > ? ? ? ?Idle Hour Blvd, Oakdale, NY, 11769 > > ? ? ? ? ? ? ? ? +1-631-244-3035 > ? ? ? ? ? ? ? ? yaya at dowling.edu > ===================================================== > > On Fri, 1 Oct 2010, James Hester wrote: > >> I have included below revised text, this time in plain text format in >> case the HTML format of the previous email was a problem. >> >> I have made the following changes: >> >> (1) In first paragraph of the TERMINOLOGY section I have written that >> UTF8 is the 'preferred' encoding of CIF2 rather than the 'designated' >> encoding. ?This is in keeping with the newfound status of UTF16 as an >> acceptable encoding >> (2) In Change 1, I've slightly altered the text on encoding >> disambiguation that Herbert and I added and changed the commentary on >> BOM for clarity >> (3) From the 3rd sentence to the end of the first paragraph of CHANGE >> 2, I have incorporated the paragraph that Herbert and I worked on, and >> largely removed the preceding paragraph of Herbert's motion. ?The >> removal of the preceding paragraph is an attempt to avoid confusion >> and because it was largely discussion, rather than specification. ?You >> will also note some small changes to Herbert's and my text, which I >> hope is acceptable under the rubric of cleanup. ?I've also added a >> clarificatory comment about UTF8 and UTF16. >> In general this paragraph is quite scrappy and I believe could be >> better focussed. ?In particular, does anybody have an objection to >> UTF8 and UTF16 being the only acceptable CIF2 encodings where >> non-ASCII codepoints are present and in the absence of a >> disambiguation signature? ?If we accept this, then we can chop out the >> algorithmic determination part (which is really just a statement of >> principle, and by itself not something that you can write a program >> based on). >> (4) I have changed 'text' in the first line of that same paragraph to >> read 'plain text'. ?I hope this is acceptable as a proxy for a more >> long-winded definition. ?If it is not, then we need to add a full >> definition of 'text' to the definitions section, and I believe that >> there are several floating around that are acceptable to everybody. >> >> If anybody has an objection to these changes, please identify the >> change, state your objection *precisely*, and give an alternative that >> would be satisfactory to you. >> >> James. >> >> ========================================================================== >> CIF - Changes to the specification >> 01 October ?2010 >> This document specifies changes to the syntax of CIF. We refer to the >> current syntax specification of CIF as CIF1, and the new specification >> as CIF2. To date all archival CIFs are CIF1. >> The changes to syntax are necessitated by the adoption of new >> dictionary functionalities that introduce several extensions, >> including new data types, and method definitions using dREL. >> It is assumed the reader has a thorough understanding of the CIF1 >> specification. >> TERMINOLOGY >> Reference to character(s) means abstract characters assigned code >> points by Unicode. ?Specific characters are referenced according to >> Unicode convention, U?+?xxxx[x[x]], where xxxx[x[x]] is the four- to >> six-digit hexadecimal representation of the assigned code point. The >> preferred character encoding for CIF2 is UTF-8. >> Reference to ASCII characters means characters U?+?0000 through >> U?+?007F, or, equivalently the first 128 characters of the ISO?8859?1 >> (LATIN?1) character set. >> Reference to newline or \n means the sequence that conventionally >> terminates a line record (which is environment dependent). ?See Change >> 3. >> Reference to whitespace means the characters ASCII space (U?+?0020), >> ASCII horizontal tab (U?+?0009) and the newline characters. Without >> regard to local convention, the various other characters that Unicode >> classifies as whitespace (character categories Zs and Zp) do not >> constitute whitespace for the purposes of CIF2. >> PREAMBLE >> CIF2 significantly extends CIF1 functionality, primarily through new >> dictionary features. CIF2 is not fully backwards-compatible with CIF1: >> many files compliant with CIF1 are also compliant with CIF2, but some >> are not (see especially change 5, below). ?The CIF1 standard will >> continue to operate for the foreseeable future in parallel with CIF2. >> CHANGE 1 ? NEW (MAGIC CODE) >> A CIF2 file is uniquely identified by a required magic code at the >> beginning of its first line. The code is, >> #\#CIF_2.0 >> followed immediately by whitespace. ?The immediately following space >> on this line is reserved for encoding disambiguation signatures. ?Note >> that where a Unicode BOM is used, it would appear prior to the magic >> code in the byte stream and does not form part of the CIF text. >> CHANGE 2 ? NEW (CHARACTER SET) >> CIF2 files are standard variable length plain text files, which for >> compatibility with older processing systems will have a maximum line >> length of 2048 characters. As discussed above and below, however, >> there are some restrictions on the character set for token delimiters, >> separators and data names. For compatibility with CIF1 behaviour, >> there is no formal restriction on the encoding of CIF2 files providing >> they contain only code points from the ASCII range. ?If a CIF2 file >> contains characters equivalent to Unicode code points greater than >> U+0077 (127 decimal), then the particular encoding used must either be >> UTF8 or algorithmically identifiable from the CIF2 file itself. Note >> that UTF16 with a BOM conforms to this requirement. ?The use of a BOM >> for unicode encodings, including UTF8, is recommended. ?Acceptable >> identification algorithms will be published as necessary as annexes to >> this standard (see description of magic code and encoding >> disambiguation in Change 1). ?In the absence of an encoding >> disambiguation signature, it is safe to assume that the encoding of a >> CIF2 file containing characters outside the ASCII range is either UTF8 >> or UTF16. >> In keeping with XML restrictions we allow the characters >> U?+?0009 U?+?000A U?+?000D >> U?+?0020 ? U+007E >> U+00A0 - U?+?D7FF >> U?+?E000 ? U+FDCF >> U?+?FDF0 - U+FFFD >> U?+?10000 - U?+?10FFFD >> In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where >> x is any hexadecimal digit are disallowed. Unicode reserves the code >> points E000 ? F8FF for private use. The IUCr and only the IUCr may >> specify what characters are assigned to these code points in the >> context of CIF2. >> Reasoning: There is growing demand for the wider character set >> afforded by Unicode to be made available in applications, especially >> those where internationalisation is an issue. >> >> On Fri, Oct 1, 2010 at 2:44 PM, James Hester >> wrote: >>> >>> Before I post my revised text, I have only just realised (upon close >>> perusal of the two texts) that Herbert's motion is substantially the same as >>> the 'Changes' document, just without the headings etc, so we are discussing >>> almost the same document.? My apologies for the confusion. >>> >>> James. >>> >>> On Fri, Oct 1, 2010 at 2:37 PM, James Hester >>> wrote: >>>> >>>> Dear Group, >>>> >>>> As I think we have reached a consensus in principle, and are now moving >>>> into discussion of precise definitions, let us have wording arguments only >>>> once (that is, for a single document).? I think that our base document must >>>> be the one that the DDLm group agreed on - the link once again is >>>> http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf >>>> - simply because it will be unnecessarily confusing for the DDLm group to >>>> deal with two documents at once, and the 'Changes' document is admirably >>>> precise.? I reiterate once again that I am happy with the motion that >>>> Herbert presented, with the proviso that one paragraph is rewritten as I >>>> have recently proposed.? Herbert - if you would like to negotiate that >>>> paragraph with me by Skype, I'm happy to do that too. >>>> >>>> I have appended a text version of what I consider to be the relevant >>>> sections of the 'changes' document to this message.? I am happy to provide >>>> the complete document in OpenOffice format to anybody who would like it. >>>> Herbert - if you think any of the non-encoding discussion in your motion is >>>> not already covered in the 'Changes' document, please advise. >>>> >>>> I will be posting my own suggestion, largely based on parts of the >>>> motion that Herbert and I drafted yesterday, in a reply to this email. >>>> >>>> CIF - Changes to the specification >>>> >>>> 05 July 2010 >>>> >>>> This document specifies changes to the syntax of CIF. We refer to the >>>> current syntax specification of CIF as CIF1, and the new specification as >>>> CIF2. To date all archival CIFs are CIF1. >>>> >>>> The changes to syntax are necessitated by the adoption of new dictionary >>>> functionalities that introduce several extensions, including new data types, >>>> and method definitions using dREL. >>>> >>>> It is assumed the reader has a thorough understanding of the CIF1 >>>> specification. >>>> >>>> TERMINOLOGY >>>> >>>> Reference to character(s) means abstract characters assigned code points >>>> by Unicode. Specific characters are referenced according to Unicode >>>> convention, U?+?xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit >>>> hexadecimal representation of the assigned code point. The designated >>>> character encoding for CIF2 is UTF-8. >>>> >>>> Reference to ASCII characters means characters U?+?0000 through >>>> U?+?007F, or, equivalently the first 128 characters of the ISO?8859?1 >>>> (LATIN?1) character set. >>>> >>>> Reference to newline or \n means the sequence that conventionally >>>> terminates a line record (which is environment dependent). See Change 3. >>>> >>>> Reference to whitespace means the characters ASCII space (U?+?0020), >>>> ASCII horizontal tab (U?+?0009) and the newline characters. Without regard >>>> to local convention, the various other characters that Unicode classifies as >>>> whitespace (character categories Zs and Zp) do not constitute whitespace for >>>> the purposes of CIF2. >>>> >>>> PREAMBLE >>>> >>>> CIF2 significantly extends CIF1 functionality, primarily through new >>>> dictionary features. CIF2 is not fully backwards-compatible with CIF1: many >>>> files compliant with CIF1 are also compliant with CIF2, but some are not >>>> (see especially change 5, below). The CIF1 standard will continue to operate >>>> for the foreseeable future in parallel with CIF2. >>>> >>>> CHANGE 1 ? NEW (MAGIC CODE) >>>> >>>> A CIF2 file is uniquely identified by a required magic code at the >>>> beginning of its first line. The code is, >>>> >>>> #\#CIF_2.0 >>>> >>>> followed immediately by whitespace. >>>> >>>> CHANGE 2 ? NEW (CHARACTER SET) >>>> >>>> CIF2 files are standard variable length text files, which for >>>> compatibility with older processing systems will have a maximum line length >>>> of 2048 characters. As discussed above and below, however, there are some >>>> restrictions on the character set for token delimiters, separators and data >>>> names. >>>> >>>> In keeping with XML restrictions we allow the characters >>>> >>>> U?+?0009 U?+?000A U?+?000D >>>> U?+?0020 ? U+007E >>>> U+00A0 - U?+?D7FF >>>> U?+?E000 ? U+FDCF >>>> U?+?FDF0 - U+FFFD >>>> U?+?10000 - U?+?10FFFD >>>> >>>> In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x >>>> is any hexadecimal digit are disallowed. Unicode reserves the code points >>>> E000 ? F8FF for private use. The IUCr and only the IUCr may specify what >>>> characters are assigned to these code points in the context of CIF2. >>>> >>>> Reasoning: There is growing demand for the wider character set afforded >>>> by Unicode to be made available in applications, especially those where >>>> internationalisation is an issue. >>>> >>>> -- >>>> T +61 (02) 9717 9907 >>>> F +61 (02) 9717 3145 >>>> M +61 (04) 0249 4148 >>> >>> >>> >>> -- >>> T +61 (02) 9717 9907 >>> F +61 (02) 9717 3145 >>> M +61 (04) 0249 4148 >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> cif2-encoding mailing list >> cif2-encoding at iucr.org >> http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 From John.Bollinger at STJUDE.ORG Fri Oct 1 15:10:08 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Fri, 1 Oct 2010 09:10:08 -0500 Subject: [Cif2-encoding] Drafting issues In-Reply-To: References: Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDFB@SJMEMXMBS11.stjude.sjcrh.local> On Friday, October 01, 2010 8:00 AM, James Hester wrote: >Herbert, you have proposed an entirely reasonable rewriting of what I >proposed with an entirely reasonable justification. I'm happy to >accept your new wording. I, too, think it an improvement. I believe that gives us the following text in Change 2, preceding the character set enumeration (my formatting): ==== CIF2 files are standard variable length plain text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. For compatibility with CIF1 behaviour, there is no formal restriction on the encoding of CIF2 files providing they contain only code points from the ASCII range. If a CIF2 file contains characters equivalent to Unicode code points greater than U+0077 (127 decimal), then the particular encoding used must either be UTF8 or algorithmically identifiable from the CIF2 file itself. Note that UTF16 with a BOM conforms to this requirement. The use of a BOM for Unicode encodings, including UTF8, is recommended. Acceptable identification algorithms will be published as necessary as annexes to this standard (see description of magic code and encoding disambiguation in Change 1). A CIF2 file containing characters outside the ASCII range with no BOM and no disambiguation signature will be a UTF8 file. A CIF2 file containing characters outside the ASCII range with a valid UTF8 or UTF16 BOM and no disambiguation signature, will be a Unicode file written in the indicated encoding. ==== I think with that we have reached an acceptable position. I do propose three editorial changes, however, that I intend to clarify the wording without changing its meaning in any way: 1) I suggest that Herb's new text (the last two sentences above) be made the first annex, as it in fact constitutes the first acceptable identification algorithm that is defined. Alternatively, let us slightly reword the preceding text to clarify that the last sentences describe one acceptable algorithm among potentially several. 2) I furthermore suggest that the sentence "Note that UTF16 with a BOM conforms to this requirement" be deleted, for that is redundant as a consequence of Herb's wording. 3) Finally, I recommend moving the sentence "The use of a BOM for Unicode encodings, including UTF8, is recommended" to the end of that passage, so as to place the comments about acceptable identification algorithms immediately after the requirement that some encodings be "algorithmically identifiable". This will form a clearer logical progression. I hope these changes will be adopted, but my acceptance of the proposal is not conditioned on that. >The worst is behind us, and we are currently mopping up. After making >it through the mountain pass, surely you didn't expect to just fall >off a cliff to the meadows below? Perhaps that should be a haiku: > >Crunching through a snowlit pass >A distant eagle floats above the sunny meadows >Ah! The roads of the air. The goal before us, our travail yields a bounty. My spirit aloft. John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From John.Bollinger at STJUDE.ORG Fri Oct 1 15:44:26 2010 From: John.Bollinger at STJUDE.ORG (Bollinger, John C) Date: Fri, 1 Oct 2010 09:44:26 -0500 Subject: [Cif2-encoding] Drafting issues In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDFB@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDFB@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: <8F77913624F7524AACD2A92EAF3BFA5416659DEDFD@SJMEMXMBS11.stjude.sjcrh.local> On Friday, October 01, 2010 9:10 AM, I wrote: >I think with that we have reached an acceptable position. I do >propose three editorial changes, however, that I intend to clarify >the wording without changing its meaning in any way: Here is specific proposed wording that realizes my suggestions, keeping everything in the same section rather than moving anything to an annex: ==== CIF2 files are standard variable length plain text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. For compatibility with CIF1 behaviour, there is no formal restriction on the encoding of CIF2 files, providing they contain only code points from the ASCII range. If a CIF2 file contains characters equivalent to Unicode code points greater than U+0077 (127 decimal), then the particular encoding used must either be UTF8 or algorithmically identifiable from the CIF2 file itself. Acceptable identification algorithms will be published as necessary as annexes to this standard (see description of magic code and encoding disambiguation in Change 1). Annexes notwithstanding, (i) a CIF2 file containing characters outside the ASCII range with no BOM and no disambiguation signature will be a UTF8 file, and (ii) a CIF2 file containing characters outside the ASCII range with a valid UTF8 or UTF16 BOM and no disambiguation signature, will be a Unicode file written in the indicated encoding. The use of a BOM for Unicode encodings, including UTF8, is recommended. ==== Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer From jamesrhester at gmail.com Tue Oct 5 12:44:16 2010 From: jamesrhester at gmail.com (James Hester) Date: Tue, 5 Oct 2010 22:44:16 +1100 Subject: [Cif2-encoding] Drafting issues In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDFD@SJMEMXMBS11.stjude.sjcrh.local> References: <8F77913624F7524AACD2A92EAF3BFA5416659DEDFB@SJMEMXMBS11.stjude.sjcrh.local> <8F77913624F7524AACD2A92EAF3BFA5416659DEDFD@SJMEMXMBS11.stjude.sjcrh.local> Message-ID: There having been no objections to this rewrite, I will now incorporate it into the main document and submit the whole document to the DDLm group for their approval. James. On Sat, Oct 2, 2010 at 1:44 AM, Bollinger, John C wrote: > > On Friday, October 01, 2010 9:10 AM, I wrote: > >>I think with that we have reached an acceptable position. ?I do >>propose three editorial changes, however, that I intend to clarify >>the wording without changing its meaning in any way: > > Here is specific proposed wording that realizes my suggestions, keeping everything in the same section rather than moving anything to an annex: > > ==== > CIF2 files are standard variable length plain text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. > > For compatibility with CIF1 behaviour, there is no formal restriction on the encoding of CIF2 files, providing they contain only code points from the ASCII range. ?If a CIF2 file contains characters equivalent to Unicode code points greater than U+0077 (127 decimal), then the particular encoding used must either be UTF8 or algorithmically identifiable from the CIF2 file itself. ?Acceptable identification algorithms will be published as necessary as annexes to this standard (see description of magic code and encoding disambiguation in Change 1). ?Annexes notwithstanding, > (i) a CIF2 file containing characters outside the ASCII range with no BOM and no disambiguation signature will be a UTF8 file, and > (ii) a CIF2 file containing characters outside the ASCII range with a valid UTF8 or UTF16 BOM and no disambiguation signature, will be a Unicode file written in the indicated encoding. > > The use of a BOM for Unicode encodings, including UTF8, is recommended. > ==== > > Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: ?www.stjude.org/emaildisclaimer > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding at iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148