[Cif2-encoding] A new(?) compromise position

Wed Sep 29 15:42:20 BST 2010

You wont be surprised to hear my support for this - especially if you've read 
recent exchanges between Herbert and I regarding compromise.

Go for it :-)

________________________________
From: James Hester <jamesrhester at gmail.com>
To: Group for discussing encoding and content validation schemes for CIF2 
<cif2-encoding at iucr.org>
Sent: Wednesday, 29 September, 2010 15:24:45
Subject: [Cif2-encoding] A new(?) compromise position

Here is a newish compromise: 

Encoding: The encoding of CIF2 text streams containing only code points in the 
ASCII range is not specified. CIF2 text streams containing any code points 
outside the ASCII range must be encoded such that the encoding can be reliably 
identified from the file contents.  At present only UTF8 and UTF16 are 
considered to satisfy this constraint.

Commentary: this is intended to mean that encoding works 'as for CIF1' 
(Proposals 1,2) for files containing only ASCII text, and works as for Proposal 
4 for any other files.  I believe that this allows legacy workflows to operate 
smoothly on CIF2 files (legacy workflows do not process non ASCII text) but also 
avoids the tower of Babel effect that will ensue if non-ASCII codepoints are 
encoded using local conventions.  

To explain the thinking further, perhaps I could take another stab at Herbert's 
point of view in my own words.  Herbert (I think correctly) surmises that all 
currently used CIF applications do not explicitly specify the encoding of their 
input and output files, and so therefore are conceptually working with CIFs in a 
variety of local encodings.  Mandating any encoding for CIF2 would therefore 
force at least some and perhaps most of these applications to change the way 
they read and write text, which is disruptive and obtuse when the system works 
fine as it is.  Proposals 1 and 2 are aimed at avoiding this disruption.

On the other hand, I look at the same situation and see that all this software 
is in fact reading and writing ASCII, because all of these local encodings are 
actually equivalent to ASCII for characters used in CIFs, and I further assert 
that this happy coincidence between encodings is the single reason CIF files are 
easily transferable between different systems.

These two points of view create two different results if the CIF character 
repertoire is extended beyond the ASCII range.  If we allow the current approach 
to encoding to continue, the happy coincidence of encodings ceases to operate 
outside the ASCII range and CIF files are no longer easily interchangeable.  If 
we make explicit the commonality of CIF1 encodings by mandating a common set of 
identifiable encodings, the use of default encodings has to be abandoned with 
accompanying effort from programmers.

I believe that this latest proposal respects Herbert's concerns as well as mine, 
and is eminently workable as a starting point for going forward.  I'm now off to 
do a sample change and expect unanimous support from all parties when I return 
in an hour's time :)

On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon <bm at iucr.org> wrote:

I think the crux of issue is as follows:
>
>[But part of our difficulty is that we are all having separate
>epiphanies, and focusing on five different "cruxes". Clarifying
>the real divergence between our views would be a genuine benefit of
>a Skype conference, to which I have no personal objection.]
>
>In the real world, a need may arise to exchange CIFs constructed in
>non-canonical encodings. ("Canonical" probably means UTF-8 and/or
>UTF-16). Such a need would involve some transcoding strategy.
>
>What is the actual likelihood of that need arising?
>
>I would characterise James's position as "not very, and even less
>if the software written to generate CIFs is constrained to use
>canonical encodings within the standard".
>
>I would characterise the position of the rest of us as "reasonable to
>high, so that we wish to formulate the standard in a way that
>recognises non-canonical encodings and helps to establish or at
>least inform appropriate transcoding strategies". There appear to be
>strong disagreements among us, but in fact there's a lot of common
>ground, and a drafting exercise would probably move us towards a
>consensus.
>
>Do you agree that that is a fair assessment?
>
>If so, we can analyse further: what are the implications of mandating
>a canonical encoding or not if judgement (a) is wrong and if judgement
>(b) is wrong? My feeling is that the world will not end - or even
>change very much - in any case; but it could determine whether we
>need to formulate an optimal transcoding strategy now, or can defer
>it to a later date.
>
>However, if anyone thinks this is just another diversion, I'll drop
>this line of approach so as not to slow things down even more.
>
>Regards
>Brian
>
>
>On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. Bernstein wrote:
>> John,
>>
>> Now I am totally confused about what you are proposing and agree with Simon
>> that what is needed for you to state your proposal as the precise wording
>> that you propose to insert and/or change in the current CIF2 change document
>> "5 July 2010: draft of changes to the existing CIF 1.1 specification
>> for public discussion"
>>
>> If I understand your proposal correctly, the _only_ thing you are proposing
>> that differs in any way from my proposed motion is a mandate that a
>> CIF2 conformant reader must be able to read a UTF8 CIF2 file, but
>> that _no_ CIF application would actually be required to provide such
>> code, provided there was some mechanism available to transcode from
>> UTF8 to the local encoding,
>> which does not seem to be a mandate on the conformant CIF2 reader at
>> all, but a requirement for the provision of a portable utility to
>> do that external transcoding.
>>
>> If that is the case, wouldn't it make more sense to just provide that
>> utility that to argue about whether my motion requires somebody to write
>> their own?  Having the utility in hand would avoid having multiple,
>> conflicting interpretations of this input transcoding requirement.
>>
>> If I have read your message correctly, please just write the utility you
>> are proposing.  If I have read your message incorrectly, please
>> write the specification changes you propose for the draft changes
>> in place of the changes in my motion.
>>
>> _This_ is why it was, is, and will remain a good idea to simply have
>> a meeting and talk these things out.
>>
>>
>>
>-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100929/6150de26/attachment.html