[Cif2-encoding] A new(?) compromise position

Thu Sep 30 03:20:06 BST 2010

Dear James,

   You are mistaken, John said the opposite about determining UTF8from 
context.  The place where he and I differ, is he thinks you can't do it 
even with a BOM and I am willing to accept UTF8 with a BOM as sufficiently 
disambiguated.

   The Wikipedia says no such thing about being able to reliably detect 
non-UTF8 files.

   Let us use the Wikipedea's very own UTF8 example as a test case:

The code point ' U+00A2 = 00000000 10100010  11000010 10100010
which is  0xC2 0xA2 as a UTF8 byte string

   Now let us look at the Latin-1 code page at 
http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.htm

which tells us that 0xc2 is &Acirc; in Latin 1 and 0xa2 is &cent.

   There is no evidence that I have seen anywhere to support your prosition 
that "furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 
files can be reliably detected."  All the evidence I have seen points the 
other way.

   James, please look at the facts -- UTF8 without a BOM does not fit your 
characterization of being self-identifying.

   Regards,
     Herbert 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya at dowling.edu
=====================================================

On Thu, 30 Sep 2010, James Hester wrote:

> My simple objective (for files containing non-ASCII characters) is that an application is
> able to determine the encoding of an incoming file with a high degree of certainty with no
> information beyond the CIF standard, the encoding standard, and the file contents.  If the
> only choices are UTF8 or UTF16 there is no danger of a misassignment of encoding. 
> Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably
> detected(*).  This appears to me to be an excellent state of affairs.
> 
> Although I have apparently restricted encodings to UTF8 and UTF16 in the preceding
> paragraph, this is simply in an effort to get something workable on the table that we can
> move forward with.  I have no particular agenda to limit in future the possible encodings
> for CIF files, provided that those encodings can be reliably identified subject to the above
> restrictions.  Indeed, this particular group was formed in part to work out a system for
> including those other encodings. 
> 
> I realise my wordsmithing on the new proposal is somewhat lax, but if we are in agreement on
> the principle I hope we are able to polish it up to everybody's satisfaction.
> 
> James.
> 
> (*) John's email has addressed the UTF8 question in his post adequately, and the wikipedia
> entry also contains useful discussion.
> 
> On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein <yaya at bernstein-plus-sons.com> wrote:
>       Dear James,
>
>        I know from long and painful experience that files with just a few accented
>       characters are very, very difficult to clearly identify, and can look like valid
>       UTF8 files.  UTF8 is _not_ self-identifying without the BOM.
>
>        The case that really convinced me that there was a problem was a
>       French document with a lower case e with an accent acute on the E.  I nearly
>       missed a  misencoding of a mac native file that because it was being misread as
>       a capital E in a UTF8 file showed the accent as grave.
>
>        There are simply too many cases like that in which a file written in a non-UTF8
>       encoding looks like something reasonable, but wrong, to say that UTF without the
>       BOM is self-identifying.
>
>        As for the question of standards and applications, many programming
>       language standards specify the action of processors of the language.
>       In our case, to have a meaninful standard, we need to specify what
>       is a syntactically valid CIF2 file, to specify the semantics for
>       a compliant CIF2 reader and specify the required actions for
>       a compliant CIF2 writer.  We need to do so in a way that breaks
>       as few existing applications as possible.
>
>        I believe that applications are highly relevant to what we are trying to do.
>        In particular, I favor strict rules on writers and liberal rules
>       on readers, so that files get processed when possible, but tend to get
>       cleaned up when being processed.
>
>        That same frame of mind is why a lot of text editors invisibly add
>       a BOM at the start of all UTF8 files, but try to accept UTF8 files
>       with or without the BOM.
> 
> 
>  Regards,
>    Herbert
> 
> 
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
> 
>                 +1-631-244-3035
>                 yaya at dowling.edu
> =====================================================
> 
> On Thu, 30 Sep 2010, James Hester wrote:
>
>       Hi Herbert (I should be in bed, but whatever): I do not think it is
>       appropriate to require the *application* to unambiguously identify the
>       encoding, as no widely-recognised standard procedure exists to do this. 
>       The
>       means of identification should rather be based on the international
>       standard
>       describing the encoding.  Only UTF16 and UTF8 currently meet this
>       requirement, I believe.  I will try to express this better after a
>       sleep...
>
>       Regarding UTF8: I'm glad to see such vigilance in the cause of correctly
>       identifying file encoding. A UTF8 file, naturally, can also look like a
>       file
>       in a variety of single-byte encodings regardless of a BOM at the front. 
>       However, a file in a non-UTF8 encoding is highly unlikely to be mistaken
>       for
>       a UTF8 file.  Therefore, providing an input file is first checked for UTF8
>       encoding, I do not see any significant danger of a mistaken encoding.  I'd
>       be happy to include recommendations to use a UTF8 BOM and to check for
>       UTF8
>       encoding before any others that we may eventually add to the list.
>
>       I'm curious to see what these files are that you have trouble identifying
>       as
>       UTF8, as they may represent obscure corner cases.  Any chance you could
>       dig
>       one or two up?
>
>       James.
>       On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein
>       <yaya at bernstein-plus-sons.com> wrote:
>            Dear James,
>
>             I respect the attempt to compromise, but the sentence "At
>            present only UTF8 and UTF16 are considered to satisfy this
>            constraint" is not quite
>            right without some additional work on the spec.  UTF16 with a
>            BOM is
>            self-identifying.  UTF8 with a BOM is also self-identifying.
>             However,
>            UTF8 without a BOM and without some other disambiguator (e.g.
>            the
>            accented o's), is _not_ self identifying.  I know, because my
>            students
>            and I hit this problem all the time in working with
>            multi-linguage,
>            multi-code-page message catalogs for RasMol.  Sometimes the only
>            way
>            we can figure out whether a UTF8 file is really a UTF8 file is
>            to
>            start translating the actual strings and see if they make sense.
>
>             Another problem is what the "ASCII range" means to various
>            people.
>            I suggest being much more restrictive and saying "the printable
>            ASCII characters, code points 32-126 plus CR, LF and HT"
>
>             Combined the statment I would suggest
>
>            If a CIF2 text stream contains only characters equivalent to the
>            printable ASCII characters plus HT, LF and CR, i.e. decimal code
>            points 32-126, 9, 10 and 13, then to ensure compatibility with
>            CIF1, the CIF2 specification does not require any explicit
>            specification of the particular encoding used, but recommends
>            the use of UTF8.  If a CIF2 text stream contains any characters
>            equivalent to Unicode code points not in that range, then for
>            any encoding other then UTF8 it is the responsibility of any
>            application writing such a CIF to unambigously specify the
>            particular encoding used, preferably within the file itself.
>            UTF16 with a BOM conforms to this requirement.
>
>             Regards,
>               Herbert
>
>            =====================================================
>             Herbert J. Bernstein, Professor of Computer Science
>              Dowling College, Kramer Science Center, KSC 121
>                   Idle Hour Blvd, Oakdale, NY, 11769
>
>                            +1-631-244-3035
>                            yaya at dowling.edu
>            =====================================================
> 
>
>       On Thu, 30 Sep 2010, James Hester wrote:
>
>            Here is a newish compromise:
>
>            Encoding: The encoding of CIF2 text streams containing
>            only code points in the ASCII
>            range is not specified. CIF2 text streams containing any
>            code points outside the ASCII
>            range must be encoded such that the encoding can be
>            reliably identified from the file
>            contents.  At present only UTF8 and UTF16 are considered
>            to satisfy this constraint.
>
>            Commentary: this is intended to mean that encoding works
>            'as for CIF1' (Proposals 1,2)
>            for files containing only ASCII text, and works as for
>            Proposal 4 for any other files. 
>            I believe that this allows legacy workflows to operate
>            smoothly on CIF2 files (legacy
>            workflows do not process non ASCII text) but also avoids
>            the tower of Babel effect that
>            will ensue if non-ASCII codepoints are encoded using local
>            conventions. 
>
>            To explain the thinking further, perhaps I could take
>            another stab at Herbert's point of
>            view in my own words.  Herbert (I think correctly)
>            surmises that all currently used CIF
>            applications do not explicitly specify the encoding of
>            their input and output files, and
>            so therefore are conceptually working with CIFs in a
>            variety of local encodings. 
>            Mandating any encoding for CIF2 would therefore force at
>            least some and perhaps most of
>            these applications to change the way they read and write
>            text, which is disruptive and
>            obtuse when the system works fine as it is.  Proposals 1
>            and 2 are aimed at avoiding
>            this disruption.
>
>            On the other hand, I look at the same situation and see
>            that all this software is in
>            fact reading and writing ASCII, because all of these local
>            encodings are actually
>            equivalent to ASCII for characters used in CIFs, and I
>            further assert that this happy
>            coincidence between encodings is the single reason CIF
>            files are easily transferable
>            between different systems.
>
>            These two points of view create two different results if
>            the CIF character repertoire is
>            extended beyond the ASCII range.  If we allow the current
>            approach to encoding to
>            continue, the happy coincidence of encodings ceases to
>            operate outside the ASCII range
>            and CIF files are no longer easily interchangeable.  If we
>            make explicit the commonality
>            of CIF1 encodings by mandating a common set of
>            identifiable encodings, the use of
>            default encodings has to be abandoned with accompanying
>            effort from programmers.
>
>            I believe that this latest proposal respects Herbert's
>            concerns as well as mine, and is
>            eminently workable as a starting point for going forward. 
>            I'm now off to do a sample
>            change and expect unanimous support from all parties when
>            I return in an hour's time :)
>
>            On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon
>            <bm at iucr.org> wrote:
>                 I think the crux of issue is as follows:
>
>                 [But part of our difficulty is that we are all having
>            separate
>                 epiphanies, and focusing on five different "cruxes".
>            Clarifying
>                 the real divergence between our views would be a
>            genuine benefit of
>                 a Skype conference, to which I have no personal
>            objection.]
>
>                 In the real world, a need may arise to exchange CIFs
>            constructed in
>                 non-canonical encodings. ("Canonical" probably means
>            UTF-8 and/or
>                 UTF-16). Such a need would involve some transcoding
>            strategy.
>
>                 What is the actual likelihood of that need arising?
>
>                 I would characterise James's position as "not very,
>            and even less
>                 if the software written to generate CIFs is
>            constrained to use
>                 canonical encodings within the standard".
>
>                 I would characterise the position of the rest of us
>            as "reasonable to
>                 high, so that we wish to formulate the standard in a
>            way that
>                 recognises non-canonical encodings and helps to
>            establish or at
>                 least inform appropriate transcoding strategies".
>            There appear to be
>                 strong disagreements among us, but in fact there's a
>            lot of common
>                 ground, and a drafting exercise would probably move
>            us towards a
>                 consensus.
>
>                 Do you agree that that is a fair assessment?
>
>                 If so, we can analyse further: what are the
>            implications of mandating
>                 a canonical encoding or not if judgement (a) is wrong
>            and if judgement
>                 (b) is wrong? My feeling is that the world will not
>            end - or even
>                 change very much - in any case; but it could
>            determine whether we
>                 need to formulate an optimal transcoding strategy
>            now, or can defer
>                 it to a later date.
>
>                 However, if anyone thinks this is just another
>            diversion, I'll drop
>                 this line of approach so as not to slow things down
>            even more.
>
>                 Regards
>                 Brian
>
>            On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J.
>            Bernstein wrote:
>            > John,
>            >
>            > Now I am totally confused about what you are proposing
>            and agree with Simon
>            > that what is needed for you to state your proposal as
>            the precise wording
>            > that you propose to insert and/or change in the current
>            CIF2 change document
>            > "5 July 2010: draft of changes to the existing CIF 1.1
>            specification
>            > for public discussion"
>            >
>            > If I understand your proposal correctly, the _only_
>            thing you are proposing
>            > that differs in any way from my proposed motion is a
>            mandate that a
>            > CIF2 conformant reader must be able to read a UTF8 CIF2
>            file, but
>            > that _no_ CIF application would actually be required to
>            provide such
>            > code, provided there was some mechanism available to
>            transcode from
>            > UTF8 to the local encoding,
>            > which does not seem to be a mandate on the conformant
>            CIF2 reader at
>            > all, but a requirement for the provision of a portable
>            utility to
>            > do that external transcoding.
>            >
>            > If that is the case, wouldn't it make more sense to just
>            provide that
>            > utility that to argue about whether my motion requires
>            somebody to write
>            > their own?  Having the utility in hand would avoid
>            having multiple,
>            > conflicting interpretations of this input transcoding
>            requirement.
>            >
>            > If I have read your message correctly, please just write
>            the utility you
>            > are proposing.  If I have read your message incorrectly,
>            please
>            > write the specification changes you propose for the
>            draft changes
>            > in place of the changes in my motion.
>            >
>            > _This_ is why it was, is, and will remain a good idea to
>            simply have
>            > a meeting and talk these things out.
>            >
>            >
>            >
>
>            --
>            T +61 (02) 9717 9907
>            F +61 (02) 9717 3145
>            M +61 (04) 0249 4148
> 
>
>       _______________________________________________
>       cif2-encoding mailing list
>       cif2-encoding at iucr.org
>       http://scripts.iucr.org/mailman/listinfo/cif2-encoding
> 
> 
> 
>
>       --
>       T +61 (02) 9717 9907
>       F +61 (02) 9717 3145
>       M +61 (04) 0249 4148
> 
> 
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding at iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>