[Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
SIMON WESTRIP
simonwestrip at btinternet.com
Tue Sep 14 13:19:45 BST 2010
I sense some common ground here with my previous post.
The UTF8/16 pair could possibly be extended to any unicode encoding that is
unambiguously/inherently identifiable?
The 'local' encodings then encompass everything else?
However, I think we've yet to agree that anything but UTF8 is to be allowed at
all. We have a draft spec that stipulates UTF8,
but I infer from this thread that there is scope to relax that restriction. The
views seem to range from at least 'leaving the door open'
in recognition of the variety of encodings available, to advocating that the
encoding should not be part of the specification at all, and it will be down to
developers to accommodate/influence user practice. I'm in favour of a default
encoding or maybe any encoding that is inherently identifiable, and providing a
means to declare other encodings (however untrustworthy the declaration may be,
it would at least be available to conscientious users/developers), all
documented in the spec.
Please forgive me if this summary is off the mark; my conclusion is that there's
a willingness to accommodate multiple encodings
in this (albeit very small) group. Given that we are starting from the position
of having a single encoding (agreed upon after much earlier debate), I cannot
see us performing a complete U-turn to allow any (potentially unrecognizable)
encoding as in CIF1, i.e. without some specification of a canonical encoding or
mechanisms to identify/declare the encoding. On the other hand, I hope to see
a revised spec that isnt UTF8 only.
To get to the point - is there any hope of reaching a compromise?
Cheers
Simon
________________________________
From: "Bollinger, John C" <John.Bollinger at STJUDE.ORG>
To: Group for discussing encoding and content validation schemes for CIF2
<cif2-encoding at iucr.org>
Sent: Monday, 13 September, 2010 19:52:26
Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
On Sunday, September 12, 2010 11:26 PM, James Hester wrote:
[...]
>To my mind, the encoding of plain CIF files remains an open issue. I
>do not view the mechanisms for managing file encoding that are
>provided by current OSs to be sufficiently robust, widespread or
>consistent that we can rely on developers or text editors respecting
>them [...].
I agree that the encoding of plain CIF files remains an open issue.
I confess I find your concerns there somewhat vague, especially to the extent
that they apply within the confines of a single machine. Do your concerns
extend to that level? If so, can you provide an example or two of what you fear
might go wrong in that context?
As Herb recently wrote, "Multiple encodings are a fact of life when working with
text." CIF2 looks like text, it feels like text, and despite some exotic spice,
it tastes like text -- even in UTF-8 only form. We cannot pretend that we're
dealing with anything other than text. We need to accept, therefore, that no
matter what we do, authors and programmers will need to account for multiple
encodings, one way or another. The format specification cannot relieve either
group of that responsibility.
That doesn't necessarily mean, however, that CIF must follow the XML model of
being self-defining with regard to text encoding. Given CIF's various uses, we
gain little of practical value in this area by defining CIF2 as UTF-8 only, and
perhaps equally little by defining required decorations for expressing random
encodings. Moreover, the best reading of CIF1 is that it relies on the *local*
text conventions, whatever they may be, which is quite a different thing than
handling all text conventions that might conceivably be employed.
With that being the case, I don't think it needful for CIF2 in any given
environment to endorse foreign encoding conventions other than UTF-8. CIF2
reasonably could endorse UTF-16 as well, though, as that cannot be confused with
any ASCII-compatible encoding. Allowing UTF-16 would open up useful
possibilities both for imgCIF and for future uses not yet conceived.
Additionally, since CIF is text I still think it important for CIF2 to endorse
the default text conventions of its operating environment.
Could we agree on those three as allowed encodings? Consider, given that
combination of supported alternatives and no extra support from the spec, how
might various parties deal with the unavoidable encoding issue. Here are some
of the more reasonable alternatives I see:
1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB:
Option a) accept and provide only UTF-8 and/or UTF-16 CIFs. The
responsibility to perform any needed transcoding is on the other party. This is
just as it might be with UTF-8-only.
Option b) in addition to supporting UTF-8 and/or UTF-16, support other
encodings by allowing users to explicitly specify them as part of the
submission/retrieval process. The processor / repository would either ensure
the CIF is properly labeled, or, better, transcode it to UTF-8[/16]. This also
is just as it might be with UTF-8 only.
2. Programs and Libraries:
Option a) On input, detect encoding by checking first for UTF-16,
assuming UTF-8 if not UTF-16, and falling back to default text conventions if a
UTF-8 decoding error is encountered. On output, encode as directed by the user
(among the two/three options), defaulting to the input encoding when that is
available and feasible. These would be desirable behaviors even in the UTF-8
only case, especially in a mixed CIF1/CIF2 environment, but they do exceed
UTF-8-only requirements.
Option b) Require input and produce output according to a fixed set of
conventions (whether local text conventions or UTF-8/16). The program user is
responsible for any needed transcoding. This would be sufficient for the CIF2,
UTF-8 only case, and is typical in the CIF1 case; those differ, however, in
which text conventions would be assumed.
3. Users/Authors:
3.1. Creating / editing CIFs
No change from current practice is needed, but users might choose to
store CIFs in UTF-8[/16] form. This is just as it would likely be under UTF-8
only.
3.2. Transferring CIFs
Unless an alternative agreement on encoding can be reached by some
means, the transferor must ensure the CIF is encoded in UTF-8[/16]. This
differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed.
3.3. Receiving CIFs
The receiver may reasonably demand that the CIF be provided in
UTF-8[/16] form. He should *expect* that form unless some alternative agreement
is established. Any desired transcoding from UTF-8[/16] to an alternative
encoding is the user's responsibility. Again, this is not significantly
different from the UTF-8 only case.
A driving force in many of those cases is the well-understood (especially here!)
fact that different systems cannot be relied upon to share text conventions,
thus leaving UTF-8[/16] as the only available general-purpose medium of
exchange. At the same time, local conventions are not forbidden from use where
they can be relied upon -- most notably, within the same computer. Even if
end-users, as a group, do not appreciate those details, we can ensure via the
spec that CIF2 implementers do. That's sufficient.
So, if pretty much all my expected behavior under UTF-8[/16]+local is the same
as it would be under UTF-8-only, then why prefer the former? Because under
UTF-8[/16]+local, all the behavior described is conformant to the spec, whereas
under UTF-8 only, a significant proportion is not. If the standard adequately
covers these behaviors then we can expect more uniform support. Moreover, this
bears directly on community acceptance of the spec. If flaunting the spec with
respect to encoding becomes common, then the spec will have failed, at least in
that area. Having failed in one area, it is more likely to fail in others.
Regards,
John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital
Email Disclaimer: www.stjude.org/emaildisclaimer
_______________________________________________
cif2-encoding mailing list
cif2-encoding at iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100914/e0024b9b/attachment-0001.html
More information about the cif2-encoding
mailing list