[Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. .
SIMON WESTRIP
simonwestrip at btinternet.com
Tue Sep 14 17:02:24 BST 2010
Thank you John for your response.
I will state my position in due course (hopefully with more clarity than I
usually employ!)
but in the meantime, I'll briefly answer your question regarding extending the
UTF8/16 set:
Yes, I was thinking of the existing 'UTF family', while also allowing extension
in the future to any encodings that fall within the same class of 'inherently
identifiable' encodings. By 'inherently identifiable' I mean encodings that are
identifiable by e.g. BOM; but
as you explain, this is not appropriate for your proposal.
Cheers
Simon
________________________________
From: "Bollinger, John C" <John.Bollinger at STJUDE.ORG>
To: Group for discussing encoding and content validation schemes for CIF2
<cif2-encoding at iucr.org>
Sent: Tuesday, 14 September, 2010 15:46:10
Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. .
Simon,
On Tuesday, September 14, 2010 7:20 AM, SIMON WESTRIP wrote:
>I sense some common ground here with my previous post.
I hope so. My proposal is intended as a compromise position, and I hope it will
give all the participants in this discussion enough of what they want that we
can finally come to an agreement.
>The UTF8/16 pair could possibly be extended to any unicode encoding that is
>unambiguously/inherently identifiable?
Did you have any particular other encodings you would put in that category? The
only one(s) I think would qualify are UTF-32 variants, and, to the extent it is
distinct from UTF-16, perhaps UTF-16LE. If we're don't tag CIFs with encoding
information (and that's not part of my proposal) then I don't think it safe to
deem encodings that we do not explicitly enumerate as "inherently
identifiable". My proposal intentionally minimizes the list of allowed
encodings (even inclusion of UTF-16 is left open to debate) because (i) having
more than one allowed encoding already requires the UTF-8 only side to yield
some ground, and (ii) having fewer alternatives makes for much simpler
autodetection.
>The 'local' encodings then encompass everything else?
Sort of. "local" is environment-specific. It is what the system's text editors
read and (especially) write by default, what the local Fortran I/O library
expects of a 'formatted' file, what a Java InputStreamReader in that environment
handles correctly when no encoding is explicitly specified to it, etc..
>However, I think we've yet to agree that anything but UTF8 is to be allowed at
>all. We have a draft spec that stipulates UTF8,
>but I infer from this thread that there is scope to relax that restriction.
Um, yes. I think perhaps we've snuck one past you: this entire list
(Cif2-encoding) was split off from the ddlm-group list for the purpose of
discussing that topic, as there strong opinions on both sides. Brian
administratively subscribed several of the ddlm-group members to this list when
he created it, including you.
>The views seem to range from at least 'leaving the door open'
>in recognition of the variety of encodings available, to advocating that the
>encoding should not be part of the specification at all, and it will be down to
>developers to accommodate/influence user practice.
I think a better characterization of the views on the main CIF representation is
that they range from 'no encoding but UTF-8 should be permitted' to 'all text
conventions must be supported'. We have also discussed a side issue or two,
such as what to do about embedding CIF text in other files, but those seem not
to be very contentious. A central pillar of the multiple conventions camp's
arguments is CIF1's position that CIFs are text files complying with local text
conventions. Many CIF1 users and programmers have relied on that, and therefore
we would like to avid throwing it out the window. The essential position of the
UTF-8-only camp is that CIF2 must be inherently resistant to misinterpretation,
especially character encoding mismatches.
> I'm in favour of a default encoding or maybe any encoding that is inherently
>identifiable, and providing a means to declare other encodings (however
>untrustworthy the declaration may be, it would at least be available to
>conscientious users/developers), all documented in the spec.
My proposal comes close to making UTF-8 a default encoding, though if UTF-16 is
allowed as well then it would be a viable candidate for that spot. Inasmuchas
these cannot be confused in a CIF context, I don't see the availability of both
as a problem.
My proposal intentionally avoids requiring any kind of tagging, as
(i) Proponents of the UTF-8-only position have been relatively unreceptive to
tagging as a solution, mainly citing concerns about reliability of encoding tags
(ii) Avoiding tagging avoids giving any impression that CIF processors are
expected to handle non-native encodings other than UTF-8[/16]
(iii) Leaving out tags keeps it simpler
There is room for some kind of tagging scheme as a supplementary convention or
standard, and with input from James I have advanced 'Scheme B' for this
purpose. You will find discussion of Scheme B in the list archives, especially
among the earliest messages on this (cif2-encoding) list.
>Please forgive me if this summary is off the mark; my conclusion is that there's
>a willingness to accommodate multiple encodings
>in this (albeit very small) group. Given that we are starting from the position
>of having a single encoding (agreed upon after much earlier debate), I cannot
>see us performing a complete U-turn to allow any (potentially unrecognizable)
>encoding as in CIF1, i.e. without some specification of a canonical encoding or
>mechanisms to identify/declare the encoding. On the other hand, I hope to see
>a revised spec that isnt UTF8 only.
Part of my thesis behind the present compromise proposal is that in the context
of any particular computing environment, CIF1 in fact *does not* support every
possible encoding. It supports *only* the local default text conventions. CIF1
allows all encodings only in the sense that for any given encoding there may be
some computing environment, somewhere, for which that encoding is the default --
in that environment, CIF1 supports that encoding.
UTF-8-only would be a complete reversal of CIF1 in the sense that UTF-8 is
generally not the default convention in current environments. Thus, requiring
UTF-8 would demand that CIF2 files comply with NON-native conventions instead of
with native ones. Under ASCII-compatible default conventions, the distinction
appears only when non-ASCII characters appear in a CIF, but I have come to view
that as more of a detriment than an advantage: it would provide fertile ground
for bugs and mistakes.
Instead of such a complete reversal, then, my compromise proposal basically adds
UTF-8 and maybe UTF-16 as allowed encodings, and explicitly specifies that the
only other supported encoding is the local default, whatever that happens to
be. This acknowledges that CIF2 users will have more exposure to text encoding
concerns than CIF1 users do. Herb argues that that is inevitable, and I agree.
>To get to the point - is there any hope of reaching a compromise?
Scheme B was an attempt to build a compromise, but it doesn't look likely to
succeed in that capacity. I think the proposal to which you just responded is
the best hope for a compromise that so far has been presented. If that or
something like it is not accepted then I'm having trouble seeing where else to
turn.
Regards,
John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital
Email Disclaimer: www.stjude.org/emaildisclaimer
_______________________________________________
cif2-encoding mailing list
cif2-encoding at iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100914/30248db0/attachment.html
More information about the cif2-encoding
mailing list