[Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

Tue Sep 14 13:19:45 BST 2010

I sense some common ground here with my previous post.

The UTF8/16 pair could possibly be extended to any unicode encoding that is 
unambiguously/inherently identifiable?
The 'local' encodings then encompass everything else?

However, I think we've yet to agree that anything but UTF8 is to be allowed at 
all. We have a draft spec that stipulates UTF8,
but I infer from this thread that there is scope to relax that restriction. The 
views seem to range from at least 'leaving the door open'
 in recognition of the variety of encodings available, to advocating that the 
encoding should not be part of the specification at all, and it will be down to 
developers to accommodate/influence user practice. I'm in favour of a default 
encoding or maybe any encoding that is inherently identifiable, and providing a 
means to declare other encodings (however untrustworthy the declaration may be, 
it would at least be available to conscientious users/developers), all 
documented in the spec.

Please forgive me if this summary is off the mark; my conclusion is that there's 
a willingness to accommodate multiple encodings
in this (albeit very small) group. Given that we are starting from the position 
of having a single encoding (agreed upon after much earlier debate), I cannot 
see us performing a complete U-turn to allow any (potentially unrecognizable) 
encoding as in CIF1, i.e. without some specification of a canonical encoding or 
mechanisms to identify/declare the encoding. On the other hand, I hope to see
a revised spec that isnt UTF8 only.

To get to the point - is there any hope of reaching a compromise?

Cheers

Simon

________________________________
From: "Bollinger, John C" <John.Bollinger at STJUDE.ORG>
To: Group for discussing encoding and content validation schemes for CIF2 
<cif2-encoding at iucr.org>
Sent: Monday, 13 September, 2010 19:52:26
Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

On Sunday, September 12, 2010 11:26 PM, James Hester wrote:
[...]
>To my mind, the encoding of plain CIF files remains an open issue.  I
>do not view the mechanisms for managing file encoding that are
>provided by current OSs to be sufficiently robust, widespread or
>consistent that we can rely on developers or text editors respecting
>them [...].

I agree that the encoding of plain CIF files remains an open issue.

I confess I find your concerns there somewhat vague, especially to the extent 
that they apply within the confines of a single machine.  Do your concerns 
extend to that level?  If so, can you provide an example or two of what you fear 
might go wrong in that context?

As Herb recently wrote, "Multiple encodings are a fact of life when working with 
text."  CIF2 looks like text, it feels like text, and despite some exotic spice, 
it tastes like text -- even in UTF-8 only form.  We cannot pretend that we're 
dealing with anything other than text.  We need to accept, therefore, that no 
matter what we do, authors and programmers will need to account for multiple 
encodings, one way or another.  The format specification cannot relieve either 
group of that responsibility.

That doesn't necessarily mean, however, that CIF must follow the XML model of 
being self-defining with regard to text encoding.  Given CIF's various uses, we 
gain little of practical value in this area by defining CIF2 as UTF-8 only, and 
perhaps equally little by defining required decorations for expressing random 
encodings.  Moreover, the best reading of CIF1 is that it relies on the *local* 
text conventions, whatever they may be, which is quite a different thing than 
handling all text conventions that might conceivably be employed.

With that being the case, I don't think it needful for CIF2 in any given 
environment to endorse foreign encoding conventions other than UTF-8.  CIF2 
reasonably could endorse UTF-16 as well, though, as that cannot be confused with 
any ASCII-compatible encoding.  Allowing UTF-16 would open up useful 
possibilities both for imgCIF and for future uses not yet conceived.  
Additionally, since CIF is text I still think it important for CIF2 to endorse 
the default text conventions of its operating environment.

Could we agree on those three as allowed encodings?  Consider, given that 
combination of supported alternatives and no extra support from the spec, how 
might various parties deal with the unavoidable encoding issue.  Here are some 
of the more reasonable alternatives I see:

1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB:

        Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.  The 
responsibility to perform any needed transcoding is on the other party.  This is 
just as it might be with UTF-8-only.

        Option b) in addition to supporting UTF-8 and/or UTF-16, support other 
encodings by allowing users to explicitly specify them as part of the 
submission/retrieval process.  The processor / repository would either ensure 
the CIF is properly labeled, or, better, transcode it to UTF-8[/16].  This also 
is just as it might be with UTF-8 only.

2. Programs and Libraries:

        Option a) On input, detect encoding by checking first for UTF-16, 
assuming UTF-8 if not UTF-16, and falling back to default text conventions if a 
UTF-8 decoding error is encountered.  On output, encode as directed by the user 
(among the two/three options), defaulting to the input encoding when that is 
available and feasible.  These would be desirable behaviors even in the UTF-8 
only case, especially in a mixed CIF1/CIF2 environment, but they do exceed 
UTF-8-only requirements.

        Option b) Require input and produce output according to a fixed set of 
conventions (whether local text conventions or UTF-8/16).  The program user is 
responsible for any needed transcoding.  This would be sufficient for the CIF2, 
UTF-8 only case, and is typical in the CIF1 case; those differ, however, in 
which text conventions would be assumed.

3. Users/Authors:
3.1. Creating / editing CIFs
        No change from current practice is needed, but users might choose to 
store CIFs in UTF-8[/16] form.  This is just as it would likely be under UTF-8 
only.

3.2. Transferring CIFs
        Unless an alternative agreement on encoding can be reached by some 
means, the transferor must ensure the CIF is encoded in UTF-8[/16].  This 
differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed.

3.3. Receiving CIFs
        The receiver may reasonably demand that the CIF be provided in 
UTF-8[/16] form.  He should *expect* that form unless some alternative agreement 
is established.  Any desired transcoding from UTF-8[/16] to an alternative 
encoding is the user's responsibility.  Again, this is not significantly 
different from the UTF-8 only case.

A driving force in many of those cases is the well-understood (especially here!) 
fact that different systems cannot be relied upon to share text conventions, 
thus leaving UTF-8[/16] as the only available general-purpose medium of 
exchange.  At the same time, local conventions are not forbidden from use where 
they can be relied upon -- most notably, within the same computer.  Even if 
end-users, as a group, do not appreciate those details, we can ensure via the 
spec that CIF2 implementers do.  That's sufficient.

So, if pretty much all my expected behavior under UTF-8[/16]+local is the same 
as it would be under UTF-8-only, then why prefer the former?  Because under 
UTF-8[/16]+local, all the behavior described is conformant to the spec, whereas 
under UTF-8 only, a significant proportion is not.  If the standard adequately 
covers these behaviors then we can expect more uniform support.  Moreover, this 
bears directly on community acceptance of the spec.  If flaunting the spec with 
respect to encoding becomes common, then the spec will have failed, at least in 
that area.  Having failed in one area, it is more likely to fail in others.

Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
cif2-encoding mailing list
cif2-encoding at iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100914/e0024b9b/attachment-0001.html