[Cif2-encoding] Addressing Brian's concerns

Tue Sep 28 08:05:49 BST 2010

In this email I address Brian's comments.  I have reproduced the email in
full at the end for reference.  To save reading the whole email, you may
download 'juffed', based on Qt, and cut and paste between foreign language
webpages in your browser and juffed to see multiple-encoding cut and paste
in action.  There is thus no impediment to publCIF operating in a
multiple-encoding environment.

Brian writes:

I sympathise greatly with James's desire for a prescriptive, "binary"
approach, but its corollary is that a CIF application must take full
responsibility for expressing any supported extended character set (I
mean accented Latin letters, Greek characters, Cyrillic or Chinese
alphabets).

This is not correct.  A typical application at minimum will only need to
parse the CIF file such that each tag has an associated string value.  What
higher levels of the application do with those tags and strings is
application-dependent.  The only applications that need to worry about
actually displaying glyphs are those concerned with text display.  Most
Unicode-aware software (e.g. web browsers) simply displays a default
character if it is not able to display a particular glyph.  That does not
mean that that code point disappears, simply that it is not displayed.  So I
do not see being required to display all of Unicode is a valid criticism of
the 'binary' approach, as displaying all of Unicode is not a requirement.

First off, I don't know how difficult that is technically. I would
guess that rather than trying to handle arbitrary keyboard mappings,
the natural approach would be to pick from a graphical character
grid. (What are the implications for this of glyph rendering - does
a CIF editor have to be compiled with its own large font library?)

If I understand correctly, this paragraph is not relevant if there is no
implicit requirement to display all of Unicode.  I will just add that there
is no need to choose a less comprehensive encoding just because you don't
want to display characters outside a certain range.  The various mappings
involved in text display are all decoupled.  Just to list them for clarity:

(1) Mapping from keyboard code to code point;
(2) Mapping from code point to on-disk binary representation; (this is the
encoding we have been talking about)
(3) Mapping from code point to font coordinate (display glyph);

You can restrict the set of display glyphs without changing the underlying
file encoding.  For example, a CIF-aware application for handling CIF2 text
could bundle different language packs as Adobe does for PDF.

 But that's a laborious method of authoring if relatively large amounts
of "non-standard" text are involved, and the way that authors would
prefer to work, surely, is by copying and pasting text from Word or
some other tool of choice. Permitting that necessarily pollutes the
"binary" approach with byte streams delivered by text-oriented
applications.

I agree that authors would probably prefer to cut and paste.  Does this
pollute the 'binary' approach more than the 'any encoding' approach?  How
much of the cut-and-pastability of CIF1 text is due to the commonality of
ASCII encoding for the CIF text codepoints, and how much to magic CIF1 pixie
dust that translates the encoding of the cut text to that of the text
document it is pasted into?  Anyway, more about cutting and pasting is
included below.

If I could be sure that publCIF, say, can be compiled with libraries
that reliably transcode byte streams imported from clipboards and
file import (across the mess of SMB/NFS mounts etc. that exist in
the real world) - and equally reliably transcode its UTF8 encoded text
to the author's locale-based clipboard, then I'd be more willing to
promote option 3 to the top as the starting point at least for CIF
2.0 (but its "enforcement" does depend on the availability of such a
robust CIF-editing tool).

Short form of my answer: publCIF will be able to work well under both
proposals as it is interactive and uses Qt.  Long form:

You don't even need any new libraries for this! Qt (upon which publCIF is
built) aims to do it all transparently for you. Let me be your software
architect.  Assume publCIF always handles UTF8 text as per my preferred
option.

Cut and paste (clipboard import): When text is imported from the QClipboard
(an abstraction of the system clipboard) into publCIF, publCIF should always
request QMimeData, which will return an object containing the text and the
encoding of the text.  Other standard Qt text transcoding functions can then
be applied to convert the text in one encoding to text in the target
encoding.  I estimate about 10 lines of new code.  The other direction is
even more of a no-brainer, as publCIF need simply set the encoding of its
source text in the mimeData object it passes to the clipboard, and the
clipboard will transparently handle transcoding as needed.  I would note
that this description applies equally well to the 'as for CIF1' proposal,
with the potential simplification that, if source and target texts are known
to be in the same encoding, no transcoding is necessary.

I suggest you download the free Qt-based editor called 'Juffed' and play
with cutting and pasting from international web pages.  I have just pasted
bits of text encoded in euc-jp, utf-8 and win-1251 into Juffed and all
displayed correctly.  As this is based on the same libraries and technology
that publCIF is, I think your worries are unfounded.

Note also that cutting and pasting is a user-mediated operation, so the user
sees both the input text and the output text.  This means that transcoding
errors (which may occasionally occur every now and then for single
characters, others have reported) are more likely to be caught than a
situation where transcoding is done silently in the background.

Import CIF file from some undefined location: under my UTF8/16 proposal,
there is no issue doing this, as the file is supposed to be UTF8/16.  Under
the 'as for CIF1' proposal (which Brain paradoxically supports?), or even
the 'local + UTF8/16' proposal, you are *on your own* as far as figuring out
the source file encoding and I know of no automated solution.  As a
practical matter, because publCIF is interactive, you could prompt the user
to specify the encoding when UTF8/16 are not found, in the same way that
browsers allow encoding to be set.  But that latter behaviour is entirely
your decision.

While I'm rewriting publCIF in my head, I will note in passing that in terms
of fonts, publCIF is already well set up.  The Linux version of publCIF
allows me to choose a Unicode font, for example, which displays the greek
symbols perfectly, and Windows will have its own Unicode font available.

So, I do not think that publCIF can be used as a way to distinguish between
the competing proposals on the table.

all the best,
James.
==========================
Brian's email in full for reference

My vote:

Preference  Option
 1        2. Herbert's 'as for CIF1 proposal with UTF8 in place of
             ASCII', together with Brian's *recommendations*
 2        1. Herbert's 'as for CIF1 proposal with UTF8 in place of
             ASCII' recently posted here and to COMCIFS.
 3        4. UTF8 + UTF16
 4        3. UTF8-only as in the original draft
 5        5. UTF8, UTF16 + "local"

Rationale: I still feel this argument is at heart a "binary/text"
dichotomy, where "binary" implies that one can prescribe specific
byte-level representations of every distinct character; "text"
implies that you're at the mercy of external libraries and mappings
between encoding conventions - and those mappings are not always
explicit or easy to identify.

I sympathise greatly with James's desire for a prescriptive, "binary"
approach, but its corollary is that a CIF application must take full
responsibility for expressing any supported extended character set (I
mean accented Latin letters, Greek characters, Cyrillic or Chinese
alphabets).

First off, I don't know how difficult that is technically. I would
guess that rather than trying to handle arbitrary keyboard mappings,
the natural approach would be to pick from a graphical character
grid. (What are the implications for this of glyph rendering - does
a CIF editor have to be compiled with its own large font library?)

But that's a laborious method of authoring if relatively large amounts
of "non-standard" text are involved, and the way that authors would
prefer to work, surely, is by copying and pasting text from Word or
some other tool of choice. Permitting that necessarily pollutes the
"binary" approach with byte streams delivered by text-oriented
applications.

If I could be sure that publCIF, say, can be compiled with libraries
that reliably transcode byte streams imported from clipboards and
file import (across the mess of SMB/NFS mounts etc. that exist in
the real world) - and equally reliably transcode its UTF8 encoded text
to the author's locale-based clipboard, then I'd be more willing to
promote option 3 to the top as the starting point at least for CIF
2.0 (but its "enforcement" does depend on the availability of such a
robust CIF-editing tool).

I prefer the UTF8 + UTF16 option over UTF8-only because of the
real-world use case that Herbert has described before; and in
existing imgCIF applications the UTF16 encoding is being done
rather carefully and for a specific purpose.

I put option 5 at the bottom because of the non-portability of a
"local" encoding.

Note, though, that whatever the outcome I would still favour the
discussion of character set encodings to be presented as a Part 3
to the complete CIF2 spec.

Best wishes
Brian
_________________________________________________________________________
Brian McMahon                                       tel: +44 1244 342878
Research and Development Officer                    fax: +44 1244 314888
International Union of Crystallography            e-mail:  bm at iucr.org
5 Abbey Square, Chester CH1 2HU, England

-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20100928/3c3b25b1/attachment-0001.html