[Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. .

Tue Sep 14 17:02:24 BST 2010

Thank you John for your response.

I will state my position in due course (hopefully with more clarity than I 
usually employ!)
but in the meantime, I'll briefly answer your question regarding extending the 
UTF8/16 set:

Yes, I was thinking of the existing 'UTF family', while also allowing extension 
in the future to any encodings that fall within the same class of 'inherently 
identifiable' encodings. By 'inherently identifiable' I mean encodings that are 
identifiable by e.g. BOM; but
as you explain, this is not appropriate for your proposal.

Cheers

Simon

________________________________
From: "Bollinger, John C" <John.Bollinger at STJUDE.ORG>
To: Group for discussing encoding and content validation schemes for CIF2 
<cif2-encoding at iucr.org>
Sent: Tuesday, 14 September, 2010 15:46:10
Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ..  .

Simon,

On Tuesday, September 14, 2010 7:20 AM, SIMON WESTRIP wrote:
>I sense some common ground here with my previous post.

I hope so.  My proposal is intended as a compromise position, and I hope it will 
give all the participants in this discussion enough of what they want that we 
can finally come to an agreement.

>The UTF8/16 pair could possibly be extended to any unicode encoding that is 
>unambiguously/inherently identifiable?

Did you have any particular other encodings you would put in that category?  The 
only one(s) I think would qualify are UTF-32 variants, and, to the extent it is 
distinct from UTF-16, perhaps UTF-16LE.  If we're don't tag CIFs with encoding 
information (and that's not part of my proposal) then I don't think it safe to 
deem encodings that we do not explicitly enumerate as "inherently 
identifiable".  My proposal intentionally minimizes the list of allowed 
encodings (even inclusion of UTF-16 is left open to debate) because (i) having 
more than one allowed encoding already requires the UTF-8 only side to yield 
some ground, and (ii) having fewer alternatives makes for much simpler 
autodetection.

>The 'local' encodings then encompass everything else?

Sort of.  "local" is environment-specific.  It is what the system's text editors 
read and (especially) write by default, what the local Fortran I/O library 
expects of a 'formatted' file, what a Java InputStreamReader in that environment 
handles correctly when no encoding is explicitly specified to it, etc..

>However, I think we've yet to agree that anything but UTF8 is to be allowed at 
>all. We have a draft spec that stipulates UTF8,
>but I infer from this thread that there is scope to relax that restriction.

Um, yes.  I think perhaps we've snuck one past you: this entire list 
(Cif2-encoding) was split off from the ddlm-group list for the purpose of 
discussing that topic, as there strong opinions on both sides.  Brian 
administratively subscribed several of the ddlm-group members to this list when 
he created it, including you.

>The views seem to range from at least 'leaving the door open'
>in recognition of the variety of encodings available, to advocating that the 
>encoding should not be part of the specification at all, and it will be down to 
>developers to accommodate/influence user practice.

I think a better characterization of the views on the main CIF representation is 
that they range from 'no encoding but UTF-8 should be permitted' to 'all text 
conventions must be supported'.  We have also discussed a side issue or two, 
such as what to do about embedding CIF text in other files, but those seem not 
to be very contentious.  A central pillar of the multiple conventions camp's 
arguments is CIF1's position that CIFs are text files complying with local text 
conventions.  Many CIF1 users and programmers have relied on that, and therefore 
we would like to avid throwing it out the window.  The essential position of the 
UTF-8-only camp is that CIF2 must be inherently resistant to misinterpretation, 
especially character encoding mismatches.

> I'm in favour of a default encoding or maybe any encoding that is inherently 
>identifiable, and providing a means to declare other encodings (however 
>untrustworthy the declaration may be, it would at least be available to 
>conscientious users/developers), all documented in the spec.

My proposal comes close to making UTF-8 a default encoding, though if UTF-16 is 
allowed as well then it would be a viable candidate for that spot.  Inasmuchas 
these cannot be confused in a CIF context, I don't see the availability of both 
as a problem.

My proposal intentionally avoids requiring any kind of tagging, as
(i) Proponents of the UTF-8-only position have been relatively unreceptive to 
tagging as a solution, mainly citing concerns about reliability of encoding tags
(ii) Avoiding tagging avoids giving any impression that CIF processors are 
expected to handle non-native encodings other than UTF-8[/16]
(iii) Leaving out tags keeps it simpler

There is room for some kind of tagging scheme as a supplementary convention or 
standard, and with input from James I have advanced 'Scheme B' for this 
purpose.  You will find discussion of Scheme B in the list archives, especially 
among the earliest messages on this (cif2-encoding) list.

>Please forgive me if this summary is off the mark; my conclusion is that there's 
>a willingness to accommodate multiple encodings
>in this (albeit very small) group. Given that we are starting from the position 
>of having a single encoding (agreed upon after much earlier debate), I cannot 
>see us performing a complete U-turn to allow any (potentially unrecognizable) 
>encoding as in CIF1, i.e. without some specification of a canonical encoding or 
>mechanisms to identify/declare the encoding. On the other hand, I hope to see
>a revised spec that isnt UTF8 only.

Part of my thesis behind the present compromise proposal is that in the context 
of any particular computing environment, CIF1 in fact *does not* support every 
possible encoding.  It supports *only* the local default text conventions.  CIF1 
allows all encodings only in the sense that for any given encoding there may be 
some computing environment, somewhere, for which that encoding is the default -- 
in that environment, CIF1 supports that encoding.

UTF-8-only would be a complete reversal of CIF1 in the sense that UTF-8 is 
generally not the default convention in current environments.  Thus, requiring 
UTF-8 would demand that CIF2 files comply with NON-native conventions instead of 
with native ones.  Under ASCII-compatible default conventions, the distinction 
appears only when non-ASCII characters appear in a CIF, but I have come to view 
that as more of a detriment than an advantage: it would provide fertile ground 
for bugs and mistakes.

Instead of such a complete reversal, then, my compromise proposal basically adds 
UTF-8 and maybe UTF-16 as allowed encodings, and explicitly specifies that the 
only other supported encoding is the local default, whatever that happens to 
be.  This acknowledges that CIF2 users will have more exposure to text encoding 
concerns than CIF1 users do.  Herb argues that that is inevitable, and I agree.

>To get to the point - is there any hope of reaching a compromise?

Scheme B was an attempt to build a compromise, but it doesn't look likely to 
succeed in that capacity.  I think the proposal to which you just responded is 
the best hope for a compromise that so far has been presented.  If that or 
something like it is not accepted then I'm having trouble seeing where else to 
turn.

Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer
_______________________________________________
cif2-encoding mailing list
cif2-encoding at iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100914/30248db0/attachment.html