[Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

Wed Sep 15 14:42:19 BST 2010

Dear Colleagues,

   1.  For a Mac under OSX, I use cyclone for conversion of encodings.

   2.  No hash scheme will survive random trips through random editors
or random systems.

   3.  Embedded strings of characters (e.g. the 5 accented o's or more)
will also undergo strange transformations, but they will be easier to
deal with without a lot of external software support.

   4.  There is no way to make a "pure ascii version" of a general UTF-8
file without adopting some reserved characters stirngs at the lexical
level -- \U... or &#...; or somesuch as used in many other systems,
but with such an extension, it is easy.

   5.  We can keep going on this forever -- we need to make some decisions.

   Regards,
     Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya at dowling.edu
=====================================================

On Wed, 15 Sep 2010, Brian McMahon wrote:

> I have said little or nothing on this list so far, because I'm
> not sure that I can add anything that's of concrete use. I've read
> the many contributions, all of them carefully thought through, and
> I still see both sides (actually, all sides) of the arguments. I
> am disinterested in the eventual outcome (but not "uninterested").
>
> But, whatever the outcome, the IUCr will undoubtedly receive files
> *intended* by the authors as CIF submissions, that come in a variety of
> character-set encodings. For the most part, we will want to accept
> these without asking the author what the encoding was, not least
> because the typical author will have no idea (and increasingly,
> our typical author will struggle to understand the questions we are
> posing since English is not his or her native language - or perhaps we
> will struggle to understand the reply).
>
> So my concerns are:
>
> (1) how easily can we determine the correct encoding with which the
> file was generated;
>
> (2) how easily can we convert it into our canonical encoding(s) for
> in-house production, archiving and delivery?
>
> First a few comments on that "canonical encoding(s)". Simon and I have
> both been happy enough to consider UTF-8 as a lingua franca, since we
> perceive it as a reasonably widespread vehicle for carrying a large
> (multilingual) character set, and that is widely supported by many
> generic text processors and platforms. However, many of our existing
> CIF applications may choke on a UTF-8 file, and we may need to
> create working formats that are pure ASCII. I would also prefer to
> retain a single archival version of a CIF (well, ideally several
> identical copies for redundancy, but nonetheless a single *version*),
> from which alternative encodings that we choose to support for
> delivery from the archive can be generated on the fly.
>
> So, really, the desire would be to have standalone applications that
> can convert between character encodings on the fly. Does anyone know
> of the general availability of such tools? The more, reliable,
> conversions that can be made, the more relaxed we are about accepting
> multiple input encodings. I have to say that a very quick Google
> search hasn't yet thrown up much encouragement here.
>
> Now, back to (1). In similar vein, do you know of any standalone
> utilities that help in determining a text-file character encoding?
>
> [I'm happy to be educated, ideally off-list, in whether
> Content-Encoding negotiation in web forms can help here, since many
> of our CIF submissions come by that route, but I'm more interested in
> the general question of how you determine the encoding of a text file
> that you just happen to find sitting on the filesystem.]
>
> One utility we use heavily in the submission system is "file"
> (http://freshmeat.net/projects/file - we currently use version 4.26
> with an augmented and slightly modified magic file). This is rather
> quiet about different character encodings, though I notice the magic
> file distributed with the more recent version 5.04 does have a
> "Unicode" section, namely:
>
>    #------------------------------------------------------------------------------
>    # $File: unicode,v 1.5 2009/09/19 16:28:13 christos Exp $
>    # Unicode:  BOM prefixed text files - Adrian Havill <havill at turbolinux.co.jp>
>    # GRR: These types should be recognised in file_ascmagic so these
>    # encodings can be treated by text patterns.
>    # Missing types are already dealt with internally.
>    #
>    0       string  +/v8                    Unicode text, UTF-7
>    0       string  +/v9                    Unicode text, UTF-7
>    0       string  +/v+                    Unicode text, UTF-7
>    0       string  +/v/                    Unicode text, UTF-7
>    0       string  \335\163\146\163        Unicode text, UTF-8-EBCDIC
>    0       string  \376\377\000\000        Unicode text, UTF-32, big-endian
>    0       string  \377\376\000\000        Unicode text, UTF-32, little-endian
>    0       string  \016\376\377            Unicode text, SCSU (Standard Compression Scheme for Unicode)
>
> Interestingly, the "animation" module of this new magic file
> conflicts with other possible UTF encodings:
>
>    # MPA, M1A
>    # updated by Joerg Jenderek
>    # GRR the original test are too common for many DOS files, so test 32 <= kbits <
>    = 448
>    # GRR this test is still too general as it catches a BOM of UTF-16 files (0xFFFE)
>    # FIXME: Almost all little endian UTF-16 text with BOM are clobbered by these entries
>
>
> And, by the way, the "augmented" magic file we use (the one distributed as
> part of the KDE desktop distribution) already includes this section:
>
>    # chemical/x-cif 50
>    0	string	#\#CIF_1.1
>    >10	byte	9	chemical/x-cif
>    >10	byte	10	chemical/x-cif
>    >10	byte	13	chemical/x-cif
>
>
>
> It seems to me that without some reasonably reliable discriminator,
> John's endorsement of support for "local" encodings will allow files
> to leak out into the wider world where they can't at all easily be
> handled or even properly identified. (Though, as many have argued
> persuasively, "forbidding" them is not going to prevent such files
> from being created, and possibly even used fruitfully within local
> environments.)
>
> Remember that many CIFs will come to us in the end after passage across
> many heterogeneous systems. I referred in a previous post to my own
> daily working environment - Solaris, Linux and Windows systems linked
> by a variety of X servers, X emulators, NFS and SMB cross-mounted
> filesystems, clipboards communicating with diverse applications
> and OSes running different default locales...
> [Incidentally, hasn't SMB now been superseded by "CIFS" !]
>
> Perhaps I'm just perverse; but I doubt that I'm quite unique. We'll
> also see files shuttled between co-authors with different languages,
> locales, OSes, and exchanged via email, ftp, USB stick etc.
> "Corruptions" will inevitably be introduced in these interchanges -
> sometimes subtle ones. For example, outside the CIF world altogether,
> we see Greek characters change their identity when we run some files
> through a PDF -> PostScript -> PDF cycle (all using software from the
> same software house, Adobe). The reason has to do with differences in
> Windows and Mac encodings, and the failure of the Acrobat software to
> track and maintain the character mappings through such a cycle.
>
> Well, I'll stop here, because in spite of my best intentions I don't
> think I'm moving the debate along very much, and I apologise if
> everything here has already been so obvious as not to need saying.
>
> I'll defer further comment until I've learned if there are already
> standard text-encoding identifiers and transcoders.
>
> Regards
> Brian
> _________________________________________________________________________
> Brian McMahon                                       tel: +44 1244 342878
> Research and Development Officer                    fax: +44 1244 314888
> International Union of Crystallography            e-mail:  bm at iucr.org
> 5 Abbey Square, Chester CH1 2HU, England
>
>
> On Tue, Sep 14, 2010 at 10:58:39AM -0400, Herbert J. Bernstein wrote:
>> One, hopefully relevant, aside -- ascii files are not as
>> unambiguous as one might think.  Depending on what localization
>> one has one one's computer, the code point 0x5c (one of the
>> characters in the first 127) will be shown as a reverse
>> solidus, a yen currency symbol or a won currency symbol.  This
>> is a holdover from the days of national variants of the ISO
>> character set, and shows no signs of going away any time soon.
>>
>> This is _not_ the only such case, but it is one that impacts
>> most programming languages, including dREL, and existing CIF
>> files, including the PDB's mmCIF files.
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>    Dowling College, Kramer Science Center, KSC 121
>>         Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                  +1-631-244-3035
>>                  yaya at dowling.edu
>> =====================================================
>>
>> On Tue, 14 Sep 2010, Herbert J. Bernstein wrote:
>>
>>> Dear Colleagues,
>>>
>>>  To avoid any misunderstandings, rather than worrying about how
>>> we got to where we are, let us each just state a clear position.
>>> Here is mine:
>>>
>>>  I favor CIF2 being stated in terms of UTF-8 for clarity, but
>>> not specifying any particular _mandatory_ encoding of a CIF2 file
>>> as long as there is a clearly agreed mechanism between the
>>> creator and consumer of a given CIF2 file as to how to faithfully
>>> transform the file between creator's and the consumer's encodings.
>>>
>>>  I favor UTF-8 being the default encoding that any CIF2 creator
>>> should feel free to use without having to establish any prior
>>> agreement with consumers, and that all consumers should try
>>> to make arrangements to be able to read, either directly or
>>> via some conversion utility or service.  If the consumers don't
>>> make such arrangements then there may be CIF2 files that they
>>> will not be able to read.  If a producer creates a CIF2 in any
>>> encoding other than UTF8 then there may be consumers who have
>>> difficulty reading that CIF2.
>>>
>>>  I favor the IUCr taking responsibility for collecting and
>>> disseminating information on particularly useful ways to go
>>> to and from UTF8 and/or other popular encodings.
>>>
>>>  Regards,
>>>    Herbert
>>> =====================================================
>>> Herbert J. Bernstein, Professor of Computer Science
>>>   Dowling College, Kramer Science Center, KSC 121
>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                 +1-631-244-3035
>>>                 yaya at dowling.edu
>>> =====================================================
>>>
>>> On Tue, 14 Sep 2010, SIMON WESTRIP wrote:
>>>
>>>> I sense some common ground here with my previous post.
>>>>
>>>> The UTF8/16 pair could possibly be extended to any unicode encoding that
>>>> is
>>>> unambiguously/inherently identifiable?
>>>> The 'local' encodings then encompass everything else?
>>>>
>>>> However, I think we've yet to agree that anything but UTF8 is to be
>>>> allowed
>>>> at all. We have a draft spec that stipulates UTF8,
>>>> but I infer from this thread that there is scope to relax that
>>>> restriction.
>>>> The views seem to range from at least 'leaving the door open'
>>>> in recognition of the variety of encodings available, to advocating that
>>>> the encoding should not be part of the specification at all, and it will
>>>> be
>>>> down to developers to accommodate/influence user practice. I'm in favour
>>>> of
>>>> a default encoding or maybe any encoding that is inherently identifiable,
>>>> and providing a means to declare other encodings (however untrustworthy
>>>> the
>>>> declaration may be, it would at least be available to conscientious
>>>> users/developers), all documented in the spec.
>>>>
>>>> Please forgive me if this summary is off the mark; my conclusion is that
>>>> there's a willingness to accommodate multiple encodings
>>>> in this (albeit very small) group. Given that we are starting from the
>>>> position of having a single encoding (agreed upon after much earlier
>>>> debate), I cannot see us performing a complete U-turn to allow any
>>>> (potentially unrecognizable) encoding as in CIF1, i.e. without some
>>>> specification of a canonical encoding or mechanisms to identify/declare
>>>> the
>>>> encoding. On the other hand, I hope to see
>>>> a revised spec that isnt UTF8 only.
>>>>
>>>> To get to the point - is there any hope of reaching a compromise?
>>>>
>>>> Cheers
>>>>
>>>> Simon
>>>>
>>>>
>>>> ____________________________________________________________________________
>>>> From: "Bollinger, John C" <John.Bollinger at STJUDE.ORG>
>>>> To: Group for discussing encoding and content validation schemes for CIF2
>>>> <cif2-encoding at iucr.org>
>>>> Sent: Monday, 13 September, 2010 19:52:26
>>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. ..
>>>> .
>>>>
>>>>
>>>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote:
>>>> [...]
>>>>> To my mind, the encoding of plain CIF files remains an open issue.  I
>>>>> do not view the mechanisms for managing file encoding that are
>>>>> provided by current OSs to be sufficiently robust, widespread or
>>>>> consistent that we can rely on developers or text editors respecting
>>>>> them [...].
>>>>
>>>> I agree that the encoding of plain CIF files remains an open issue.
>>>>
>>>> I confess I find your concerns there somewhat vague, especially to the
>>>> extent that they apply within the confines of a single machine.  Do your
>>>> concerns extend to that level?  If so, can you provide an example or two
>>>> of
>>>> what you fear might go wrong in that context?
>>>>
>>>> As Herb recently wrote, "Multiple encodings are a fact of life when
>>>> working
>>>> with text."  CIF2 looks like text, it feels like text, and despite some
>>>> exotic spice, it tastes like text -- even in UTF-8 only form.  We cannot
>>>> pretend that we're dealing with anything other than text.  We need to
>>>> accept, therefore, that no matter what we do, authors and programmers will
>>>> need to account for multiple encodings, one way or another.  The format
>>>> specification cannot relieve either group of that responsibility.
>>>>
>>>> That doesn't necessarily mean, however, that CIF must follow the XML model
>>>> of being self-defining with regard to text encoding.  Given CIF's various
>>>> uses, we gain little of practical value in this area by defining CIF2 as
>>>> UTF-8 only, and perhaps equally little by defining required decorations
>>>> for
>>>> expressing random encodings.  Moreover, the best reading of CIF1 is that
>>>> it
>>>> relies on the *local* text conventions, whatever they may be, which is
>>>> quite
>>>> a different thing than handling all text conventions that might
>>>> conceivably
>>>> be employed.
>>>>
>>>> With that being the case, I don't think it needful for CIF2 in any given
>>>> environment to endorse foreign encoding conventions other than UTF-8. 
>>>> CIF2
>>>> reasonably could endorse UTF-16 as well, though, as that cannot be
>>>> confused
>>>> with any ASCII-compatible encoding.  Allowing UTF-16 would open up useful
>>>> possibilities both for imgCIF and for future uses not yet conceived. 
>>>> Additionally, since CIF is text I still think it important for CIF2 to
>>>> endorse the default text conventions of its operating environment.
>>>>
>>>> Could we agree on those three as allowed encodings?  Consider, given that
>>>> combination of supported alternatives and no extra support from the spec,
>>>> how might various parties deal with the unavoidable encoding issue.  Here
>>>> are some of the more reasonable alternatives I see:
>>>>
>>>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB:
>>>>
>>>> Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.  The
>>>> responsibility to perform any needed transcoding is on the other party. 
>>>> This is just as it might be with UTF-8-only.
>>>>
>>>> Option b) in addition to supporting UTF-8 and/or UTF-16, support
>>>> other encodings by allowing users to explicitly specify them as part of
>>>> the
>>>> submission/retrieval process.  The processor / repository would either
>>>> ensure the CIF is properly labeled, or, better, transcode it to
>>>> UTF-8[/16]. 
>>>> This also is just as it might be with UTF-8 only.
>>>>
>>>> 2. Programs and Libraries:
>>>>
>>>> Option a) On input, detect encoding by checking first for UTF-16,
>>>> assuming UTF-8 if not UTF-16, and falling back to default text conventions
>>>> if a UTF-8 decoding error is encountered.  On output, encode as directed
>>>> by
>>>> the user (among the two/three options), defaulting to the input encoding
>>>> when that is available and feasible.  These would be desirable behaviors
>>>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment,
>>>> but they do exceed UTF-8-only requirements.
>>>>
>>>> Option b) Require input and produce output according to a fixed set
>>>> of conventions (whether local text conventions or UTF-8/16).  The program
>>>> user is responsible for any needed transcoding.  This would be sufficient
>>>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those
>>>> differ, however, in which text conventions would be assumed.
>>>>
>>>> 3. Users/Authors:
>>>> 3.1. Creating / editing CIFs
>>>> No change from current practice is needed, but users might choose
>>>> to
>>>> store CIFs in UTF-8[/16] form.  This is just as it would likely be under
>>>> UTF-8 only.
>>>>
>>>> 3.2. Transferring CIFs
>>>> Unless an alternative agreement on encoding can be reached by some
>>>> means, the transferor must ensure the CIF is encoded in UTF-8[/16].  This
>>>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe)
>>>> allowed.
>>>>
>>>> 3.3. Receiving CIFs
>>>> The receiver may reasonably demand that the CIF be provided in
>>>> UTF-8[/16] form.  He should *expect* that form unless some alternative
>>>> agreement is established.  Any desired transcoding from UTF-8[/16] to an
>>>> alternative encoding is the user's responsibility.  Again, this is not
>>>> significantly different from the UTF-8 only case.
>>>>
>>>>
>>>> A driving force in many of those cases is the well-understood (especially
>>>> here!) fact that different systems cannot be relied upon to share text
>>>> conventions, thus leaving UTF-8[/16] as the only available general-purpose
>>>> medium of exchange.  At the same time, local conventions are not forbidden
>>>> from use where they can be relied upon -- most notably, within the same
>>>> computer.  Even if end-users, as a group, do not appreciate those details,
>>>> we can ensure via the spec that CIF2 implementers do.  That's sufficient.
>>>>
>>>> So, if pretty much all my expected behavior under UTF-8[/16]+local is the
>>>> same as it would be under UTF-8-only, then why prefer the former?  Because
>>>> under UTF-8[/16]+local, all the behavior described is conformant to the
>>>> spec, whereas under UTF-8 only, a significant proportion is not.  If the
>>>> standard adequately covers these behaviors then we can expect more uniform
>>>> support.  Moreover, this bears directly on community acceptance of the
>>>> spec.  If flaunting the spec with respect to encoding becomes common, then
>>>> the spec will have failed, at least in that area.  Having failed in one
>>>> area, it is more likely to fail in others.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> John
>>>> --
>>>> John C. Bollinger, Ph.D.
>>>> Department of Structural Biology
>>>> St. Jude Children's Research Hospital
>>>>
>>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding at iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>