[Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. .

Herbert J. Bernstein yaya at bernstein-plus-sons.com
Thu Aug 26 02:02:26 BST 2010


With software, we do "release candidates".  I would suggest that the
proponents of the UTF8-only approach prepare their CIF2 release candidate
and that those of us who favor a more general encoding approach prepare
our release candidate, that we put both forward to the communities
involved and see what reaction we get.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya at dowling.edu
=====================================================

On Wed, 25 Aug 2010, SIMON WESTRIP wrote:

> "to present the various ideas to the community in the form of
> a completed standard with supporting software and see if they accept
> it"
> 
> I tend to agree - the stumbling block is the "completed standard"
> (at least w.r.t. encoding?)
> 
> :-)
> 
> 
> ____________________________________________________________________________
> From: Herbert J. Bernstein <yaya at bernstein-plus-sons.com>
> To: Group for discussing encoding and content validation schemes for CIF2
> <cif2-encoding at iucr.org>
> Sent: Thursday, 26 August, 2010 0:57:44
> Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line
> . .. .. .. .. .. .. .. .. .. .. .. .. .. .
> 
> While I disagree with these estimates of how various communities will
> react, the best way to find out is not for us to debate among ourselves,
> but to present the various ideas to the community in the form of
> a completed standard with supporting software and see if they accept
> it.  In the case of core CIF, that community has accepted what they
> were offered.  In the case of mmCIF, that community has essentially
> rejected what they were offered.  So, after all these years of
> effort on CIF2, isn't it past time to finish something, put it out
> there and see if it flies.
> 
> As for my own views:
> I remind you that XML is the end result of the essentially failed
> SGML effort followed by the highly successful HTML effort.  XML saved
> the SGML effort by adopting a large part of the simplicity and
> flexibility of HTML.  Please bear that in mind.
> 
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
> 
>                 +1-631-244-3035
>                 yaya at dowling.edu
> =====================================================
> 
> On Wed, 25 Aug 2010, SIMON WESTRIP wrote:
> 
> > Dear all
> >
> > Recent contributions have stimulated me to revisit some of the fundamental
> > issues of the possible changes in CIF2 with respect to CIF1,
> > in particular, the impact on current practice (as I perceive it, based on
> my
> > experience). The following is a summary of my thoughts, trying to
> > look at this from two perspectives (forgive me if I repeat earlier
> > opinions):
> >
> > 1) User perspective
> >
> > To date, in the 'core' CIF world (i.e. single-crystal and its extensions),
> > users treat CIFs as text files, and expect to be able to read them as such
> > using
> > plain-text editors, and indeed edit them if necessary for e.g. publication
> > purposes. Furthermore, they expect them to be readable by applications
> that
> > claim that
> > ability (e.g. graphics software).
> >
> > The situation is slghtly different with mmCIF (and the pdb variants),
> where
> > users tend to treat these CIFs as data sources that can be read by
> > applications without
> > any need to examine the raw CIF themselves, let alone edit them.
> >
> > Although the above statements only encompass two user groups and are based
> > on my personal experience, I believe these groups are the largest when
> > talking about CIF users?
> >
> > So what is the impact on such users of introducing the use of non-ASCII
> text
> > and thus raising the text encoding issue?
> >
> > In the latter case, probably minimal, inasmuch as the users dont interact
> > directly with the raw CIF and rely on CIF processing software to manage
> the
> > data.
> >
> > In the former case, it is quite possible that a user will no longer be
> able
> > to edit the raw CIF using the same plain-text editor they have always used
> > for such purposes.
> > For example, if a user receives a CIF that has been encoded in UTF16 by
> some
> > remote CIF processing system, and opens it in a non-UTF16-aware plain-text
> > editor,
> > they will not be presented with what they would expect, even if the
> > character set in that particular CIF doesnt extend beyond ASCII;
> > furthermore, even 'advanced' test editors would struggle if the encoding
> > were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally
> > applicable to CIF1, but by 'opening up' multiple encodings, the
> probability
> > of their usage increases?
> >
> > So as soon as we move beyond ASCII, we have to accept that a large group
> of
> > CIF users will, at the very least, have to be aware that CIF is no longer
> > the 'text' format
> > that they once understood it to be?
> >
> > 2) Developer perspective
> >
> > I beleive that developers presented with a documented standard will follow
> > that standard and prefer to work with no uncertainties, especially if they
> > are
> > unfamiliar with the format (perhaps just need to be able to read a CIF to
> > extract data relevant to their application/database...?)
> >
> > Taking the example of XML, in my experience developers seem to follow the
> > standard quite strictly. Most everyday applications that process XML are
> > intolerant of
> > violations of the standard. Fortunately, it is largely only developers
> that
> > work with raw XML, so the standard works well.
> >
> > In contrast to XML, with HTML/javascript the approach to the 'standard' is
> > far more tolerant. Though these languages are standardized, in order to
> > compete, the leading application
> > developers have had to adopt flexibility (e.g. browsers accept 'dirty'
> HTML,
> > are remarkably forgiving of syntax violations in javascript, and alter the
> > standard to
> > achieve their own ends or facilitate user requirements). I suspect this
> > results largely from the evolution of the languages: just as in the early
> > days of CIF, encouragement of
> > use and the end results were more important than adherence to the
> documented
> > standard?
> >
> > Note that these same applications that are so tolerant of HTML/javascript
> > violations are far less forgiving of malformed XML. So is the lesson here
> > that developers expect
> > new standards to be unambiguous and will code accordingly (especially if
> the
> > new standard was partly designed to address the shortcomings of its
> > ancestors)?
> >
> >
> > Again, forgive me if these all sounds familiar - however, before arguing
> one
> > way or the other with regard to specifics, perhaps the wider group would
> > like to confirm or otherwise the main points I'm trying to assert, in
> > particular, with respect to *user* practice:
> >
> > 1) CIF2 will require users to change the way they view CIF - i.e. they may
> > be forced to use CIF2-compliant text editors/application software, and
> > abandon their current practice.
> >
> > With respect to developers, recent coverage has been very insightful, but
> > just out of interest, would I be wrong in stating that:
> >
> > 2) Developers, especially those that don't specialize in CIF, are likely
> to
> > want a clear-cut universal standard that does not require any heuristic
> > interpretatation.
> >
> > Cheers
> >
> > Simon
> >
> >
> >
> >___________________________________________________________________________
> _
> > From: James Hester <jamesrhester at gmail.com>
> > To: Group for discussing encoding and content validation schemes for CIF2
> > <cif2-encoding at iucr.org>
> > Sent: Tuesday, 24 August, 2010 4:38:27
> > Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs
> binary/end-of-line
> > . .. .. .. .. .. .. .. .. .. .. .. .. .. .
> >
> > Thanks John for a detailed response.
> >
> > At the top of this email I will address this whole issue of optional
> > behaviour.  I was clearly too telegraphic in previous posts, as
> > Herbert thinks that optional whitespace counts as an optional feature,
> > so I go into some detail below.
> >
> > By "optional features" I mean those aspects of the standard that are
> > not mandatory for both readers and writers, and in addition I am not
> > concerned with features that do not relate directly to the information
> > transferred, e.g. optional warnings.  For example, unless "optional
> > whitespace" means that the *reader* may throw a syntax error when
> > whitespace is encountered at some particular point where whitespace is
> > optional, I do not view optional whitespace as an optional feature -
> > it is only optional for the writer.  With this definition of "optional
> > feature" it follows logically that, if a standard has such "optional
> > features", not all standard-conformant files will be readable by all
> > standard-conformant readers.  This is as true of HTML, XML and CIF1 as
> > it is of CIF2.  Whatever the relevance of HTML and XML to CIF, the
> > existence of successful standards with optional features proves only
> > that a standard can achieve widespread acceptance while having
> > optional features - whether these optional features are a help or a
> > hindrance would require some detailed analysis.
> >
> > So: any standard containing optional features requires the addition of
> > external information in order to resolve the choice of optional
> > features before successful information interchange can take place.
> >
> > Into this situation we place software developers.  These are the
> > people who play a big role in deciding which optional parts of the
> > standard are used, as they are the ones that write the software that
> > attempts to read and write the files.  Developers will typically
> > choose to support optional features based on how likely they are to be
> > used, which depends in part on how likely they are perceived to be
> > implemented in other software.  This is a recursive, potentially
> > unstable situation, which will eventually resolve itself in one of
> > three ways:
> >
> > (1) A "standard" subset of optional features develops and is
> > approximately always implemented in readers.  Special cases:
> >   (a) No optional features are implemented
> >   (b) All optional features are implemented
> > (2) A variety of "standard" subsets develop, dividing users into
> > different communities. These communities can't always read each
> > other's files without additional conversion software, but there is
> > little impetus to write this software, because if there were, the
> > developers would have included support for the missing options in the
> > first place.  The most obvious example of such communities would be
> > thosed based on options relating to natural languages, if those
> > communities do not care about accessibility of their files to
> > non-users of their language and encoding.
> > (3) A truly chaotic situation develops, with no discernable resolution
> > and a plethora of incompatible files and software.
> >
> > Outcome 1 is the most desirable, as all files are now readable by all
> > readers, meaning no additional negotiation is necessary, just as if we
> > had mandated that set of optional features.  Outcome 2 is less
> > desirable, as more software needs to be written and the standard by
> > itself is not necessarily enough information to read a given file.
> > Outcome 3 is obviously pretty unwelcome, but unlikely as it would
> > require a lot of competing influences, which would eventually change
> > and allow resolution into (1) or (2).  Think HTML and Microsoft.
> >
> > Now let us apply the above analysis to CIF: some are advocating not
> > exhaustively listing or mandating the possible CIF2 encodings (CIF1
> > did not list or mandate encoding either), leading to a range of
> > "optional features" as I have defined it above (where support for any
> > given encoding is a single "optional feature").  For CIF1, we had a
> > type 1 outcome (only ASCII encoding was supported and produced).
> >
> > So: my understanding of the previous discussion is that, while we
> > agree that it would be ideal if everyone used only UTF8, some perceive
> > that the desire to use a different encoding will be sufficiently
> > strong that mandating UTF8 will be ineffective and/or inconvenient.
> > So, while I personally would advocate mandating UTF8, the other point
> > of view would have us allowing non UTF8 encoding but hoping that
> > everyone will eventually move to UTF8.
> >
> > In which case I would like to suggest that we use network effects to
> > influence the recursive feedback loop experienced by programmers
> > described above, so that the community settles on UTF8 in the same way
> > as it has settled on ASCII for CIF1.  That is, we "load the dice" so
> > that other encodings are disfavoured.  Here are some ways to "load the
> > dice":
> >
> > (1) Mandate UTF8 only.
> > (2) Make support for UTF8 mandatory in CIF processors
> > (3) Force non UTF8 files to jump through extra hoops (which I think is
> > necessary anyway)
> > (4) Educate programmers on the drawbacks of non UTF8 encodings and
> > strongly urge them not to support reading non UTF8 CIF files
> > (5) Strongly recommend that the IUCr, wwPDB, and other centralised
> > repositories reject non-UTF8-encoded CIF files
> > (6) Make available hyperlinked information on system tools for dealing
> > with UTF8 files on popular platforms, which could be used in error
> > messages produced by programs (see (4))
> >
> > I would be interested in hearing comments on the acceptability of
> > these options from the rest of the group (I think we know how we all
> > feel about (1)!).
> >
> > Now, returning to John's email: I will answer each of the points
> > inline, at the same time attempting to get all the attributions
> > correct.
> >
> > (James) I had not fully appreciated that Scheme B is intended to be
> > applied only at the moment of transfer or archiving, and envisions
> > users normally saving files in their preferred encoding with no hash
> > codes or encoding hints required (I will call the inclusion of such
> > hints and hashes as 'decoration').
> >
> > (John) "Envisions users normally [...]" is a bit stronger than my
> > position or the intended orientation of Scheme B.  "Accommodates"
> > would be my choice of wording.
> >
> > (James now) No problem with that wording, my point is that such
> > undecorated files will be called CIF2 files and so are a target for
> > CIF2 software developers, thus "unloading" the dice away from UTF8 and
> > closer to encoding chaos.
> >
> > (James)  A direct result of allowing undecorated files to reside on
> > disk is that CIF software producers will need to write software that
> > will function with arbitrary encodings with no decoration to help
> > them, as that is the form that users' files will be most often be in.
> >
> > (John) The standard can do no more to prevent users from storing
> > undecorated CIFs than it can to prevent users from storing CIF text
> > encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding.
> > More generally, all the standard can do is define the characteristics
> > of a conformant CIF -- it can never prevent CIF-like but
> > non-conformant files from being created, used, exchanged, or archived
> > as if they were conformant CIFs.  Regardless of the standard's
> > ultimate position on this issue, software authors will have to be
> > guided by practical considerations and by the real-world requirements
> > placed on their programs.  In particular, they will have to decide
> > whether to accept "CIF" input that in fact violates the standard in
> > various ways, and / or they will have to decide which optional CIF
> > behaviors they will support.  As such, I don't see a significant
> > distinction between the alternatives before us as regards the
> > difficulty, complexity, or requirements of CIF2 software.
> >
> > (James now) I have described the way the standard works to restrict
> > encodings in the discussion at the top of this email.  Briefly, CIF
> > software developers develop programs that conform with the CIF2
> > standard.  If that standard says 'UTF8', they program for UTF8.  If
> > you want to work in ISO-8859-15 etc, you have to do extra work.
> >
> > Working in favour of such extra work would be a compelling use case,
> > which I have yet to see (I note that the 'UTF8 only' standard posted
> > to ccp4-bb and pdb-l produced no comments).  My strong perception is
> > that any need for other encodings is overwhelmed by the utility of
> > settling on a single encoding, but that perception would need
> > confirmation from a proper survey of non-ASCII users.
> >
> > So, no we can't stop people saving CIF-like files in other encodings,
> > but we can discourage it by creating significant barriers in terms of
> > software availability.  Just like we can't stop CIF1 users saving
> > files in JIS X 0208, but that doesn't happen at any level that causes
> > problems (if it happens at all, which I doubt).
> >
> > (John) Furthermore, no formulation of CIF is inherently reliable or
> > unreliable, because reliability (in this sense) is a characteristic of
> > data transfer, not of data themselves.  Scheme B targets the
> > activities that require reliability assurance, and disregards those
> > that don't.  In a practical sense, this isn't any different from
> > scheme A, because it is only when the encoding is potentially
> > uncertain -- to wit, in the context of data transfer -- that either
> > scheme need be applied (see also below).  I suppose I would be willing
> > to make scheme B a general requirement of the CIF format, but I don't
> > see any advantage there over the current formulation.  The actual
> > behavior of people and the practical requirements on CIF software
> > would not appreciably change.
> >
> > (James now) I would suggest that Scheme B does not target all
> > activites requiring reliability assurance, as it does not address the
> > situation where people use a mix of CIF-aware software and text tools
> > in a single encoding environment.
> >
> > The real, significant change that occurs when you accept Scheme B is
> > that CIF files can now be in any encoding and undecorated.
> > Programmers are then likely to provide programs that might or might
> > not work with various encodings, and users feel justifiably that their
> > undecorated files should be supported.  The software barrier that was
> > encouraging UTF8-only has been removed, and the problem of mismatched
> > encodings that we have been trying to avoid becomes that much more
> > likely to occur.  Scheme B has very few teeth to enforce decoration at
> > the point of transfer, as the software at either end is now probably
> > happy with an undecorated file.  Requiring decoration as a condition
> > of being a CIF2 file means that software will tend to reject
> > undecorated files, thereby limiting the damage that would be caused by
> > open slather encoding.
> >
> > (James)  Furthermore, given the ease with which files can be
> > transferred between users (email attachment, saved in shared,
> > network-mounted directory, drag and drop onto USB stick etc.) it is
> > unlikely that Scheme B or anything involving extra effort would be
> > applied unless the recipient demanded it.
> >
> > (John) For hand-created or hand-edited CIFs, I agree.  CIFs
> > manipulated via a CIF2-compliant editor could be relied upon to
> > conform to scheme B, however, provided that is standardized.  But the
> > same applies to scheme A, given that few operating environments
> > default to UTF-8 for text.
> >
> > (James now) That is my goal: that any CIF that passes through a
> > CIF-compliant program must be decorated before input and output (if
> > not UTF8).  What hand-edited, hand-created CIFs actually have in the
> > way of decoration doesn't bother me much, as these are very rare and
> > of no use unless they can be read into a CIF program, at which point
> > they should be rejected until properly decorated.  And I reiterate,
> > the process of applying decoration can be done interactively to
> > minimise the chances of incorrect assignment of encoding.
> >
> > (James)  And given how many times that file might have changed hands
> > across borders and operating systems within a single group
> > collaboration, there would only be a qualified guarantee that the
> > character to binary mapping has not been mangled en route, making any
> > scheme applied subsequently rather pointless.
> >
> > (John) That also does not distinguish among the alternatives before
> > us.  I appreciate the desire for an absolute guarantee of reliability,
> > but none is available.  Qualified guarantees are the best we can
> > achieve (and that's a technical assessment, not an aphorism).
> >
> > (James now) Oh, but I believe it does distinguish, because if CIF
> > software reads only UTF8 (because that is what the standard says),
> > then the file will tend to be in UTF8 at all points in time, with
> > reduced possibilities for encoding errors.  I think it highly likely
> > that each group that handles a CIF will at some stage run it through
> > CIF-aware software, which means encoding mistakes are likely to be
> > caught much earlier.
> >
> > (James) We would thus go from a situation where we had a single,
> > reliable and sometimes slightly inconvenient encoding (UTF8), to one
> > where a CIF processor should be prepared for any given CIF file to be
> > one of a wide range of encodings which need to be guessed.
> >
> > (John) Under scheme A or the present draft text, we have "a single,
> > reliable [...] encoding" only in the sense that the standard
> > *specifies* that that encoding be used.  So far, however, I see little
> > will to produce or use processors that are restricted to UTF-8, and I
> > have every expectation that authors will continue to produce CIFs in
> > various encodings regardless of the standard's ultimate stance.  Yes,
> > it might be nice if everyone and every system converged on UTF-8 for
> > text encoding, but CIF2 cannot force that to happen, not even among
> > crystallographers.
> >
> > (James now) You see little will to do this: but as far as I can tell,
> > there is even less will not to do it.  Authors will not "continue" to
> > produce CIFs in various encodings, as they haven't started doing so
> > yet.  As I've said above, CIF2 can certainly, if not force, encourage
> > UTF8 adoption.  What's more, non-ASCII characters are only gradually
> > going to find their way into CIF2 files, as the dictionaries and large
> > scale adopters of CIF2 (the IUCr) start to allow non-ASCII characters
> > in names, and the users gradually adapt to this new way of doing
> > things.  I have no sense that CIF users will feel a strong desire to
> > use non UTF8 schemes, when they have been happy in an ASCII-only
> > regime up until now.  But I'm curious: on what basis are you saying
> > that there is little will to use processors that are restricted to
> > UTF8?
> >
> > (John) In practice, then, we really have a situation where the
> > practical / useful CIF2 processor must be prepared to handle a variety
> > of encodings (details dependent on system requirements), which may
> > need to be guessed, with no standard mechanism for helping the
> > processor make that determination or for allowing it to check its
> > guess.  Scheme B improves that situation by standardizing a general
> > reliability assurance mechanism, which otherwise would be missing.  In
> > view of the practical situation, I see no down side at all.  A CIF
> > processor working with scheme B is *more* able, not less.
> >
> > (James) I would much prefer a scheme which did not compromise
> > reliability in such a significant way.
> >
> > (John) There is no such compromise, because in practice, we're not
> > starting from a reliable position.
> >
> > (James now) I think your statement that our current position is not
> > reliable arises out of a perception that users are likely to use a
> > variety of encodings regardless of what the standard says.  I think
> > this danger is way overstated, but I'd like to see you expand on why
> > you think there is such a likelihood of multiple encodings being used
> >
> > (James) My previous (somewhat clunky) attempts to adjust Scheme B were
> > directed at trying to force any file with the CIF2.0 magic number to
> > be either decorated or UTF-8, meaning that software has a reasonably
> > high confidence in file integrity.
> >
> > An alternative way of thinking about this is that CIF files also act
> > as the mechanism of information transfer between software programs.
> > [... W]hen a separate program is asked to input that CIF, the
> > information has been transferred, even if that software is running on
> > the same computer.
> >
> > (John) So in that sense, one could argue that Scheme B already applies
> > to all CIFs, its assertion to the contrary notwithstanding.  Honestly,
> > though, I don't think debating semantic details of terms such as "data
> > transfer" is useful because in practice, and independent of scheme A,
> > B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to
> > choose what form of reliability assurance to accept or demand, if any.
> >
> > (James now) I was only debating semantic details in order to expose
> > the fact that data transfer occurs between programs, not just between
> > systems, and that therefore Scheme B should apply within a single
> > system, so therefore, all CIF2 files should be decorated.  As for who
> > should be demanding reliability assurance, the receiver may not be in
> > a position to demand some level of reliability if the file creator is
> > not in direct contact.  Again, we can build this reliability into the
> > standard and save the extra negotiation or loss of information that is
> > otherwise involved.
> >
> > (James) Now, moving on to the detailed contours of Scheme B and
> > addressing the particular points that John and I have been discussing.
> >  My original criticisms are the ones preceded by numerals.
> >
> > [(James now) I've deleted those points where we have reached
> > agreement.  Those points are:
> > (1) Restrict encodings to those for which the first line of a CIF file
> > provides unambiguous encoding for ASCII codepoints
> > (2) Put the hash value on the first line]
> >
> > (James a long time ago) (4) Assumption that all recipients will be
> > able to handle all encodings
> >
> > (John) There is no such assumption.  Rather, there is an
> > acknowledgement that some systems may be unable to handle some CIFs.
> > That is already the case with CIF1, and it is not completely resolved
> > by standardizing on UTF-8 (i.e. scheme A).
> >
> > (James) There is no such thing as 'optional' for an information
> > interchange standard.  A file that conforms to the standard must be
> > readable by parsers written according to the standard. If reading a
> > standard-conformant file might fail or (worse) the file might be
> > misinterpreted, information cannot always reliably be exchanged using
> > this standard, so that optional behaviour needs to be either
> > discarded, or made mandatory. There is thus no point in including
> > optional behaviour in the standard. So: if the standard allows files
> > to be written in encoding XYZ, then all readers should be able to read
> > files written in encoding XYZ.  I view the CIF1 stance of allowing any
> > encoding as a mistake, but a benign one, as in the case of CIF1 ASCII
> > was so entrenched that it was the defacto standard for the characters
> > appearing in CIF1 files.  In short, we have to specify a limited set
> > of acceptable encodings.
> >
> > (John) As Herb astutely observed, those assertions reflect a
> > fundamental source of our disagreement.  I think we can all agree that
> > a standard that permits conforming software to misinterpret conforming
> > data is undesirable.
> >
> > Surely we can also agree that an information interchange standard does
> > not serve its purpose if it does not support information being
> > successfully interchanged.  It does not follow, however, that the
> > artifacts by which any two parties realize an information interchange
> > must be interpretable by all other conceivable parties, nor does it
> > follow that that would be a supremely advantageous characteristic if
> > it were achievable.  It also does not follow that recognizable failure
> > of any particular attempt at interchange must at all costs be avoided,
> > or that a data interchange standard must take no account of its usage
> > context.
> >
> > (James now) This is where we must make a policy decision: is a CIF2
> > file to be a universally understandable file?  I agree that excluding
> > optional behaviour is not an absolute requirement, but I also consider
> > that optional behaviour should not be introduced without solid
> > justification, given the real cost in interoperability and portability
> > of the standard.  You refer to two parties who wish to exchange
> > information: those parties are always free to agree on private
> > enhancements to the CIF2 standard (or to create their very own
> > protocol), if they are in contact.  I do not see why this use case
> > need concern us here.  Herbert can say to John 'I'm emailing you a
> > CIF2 file but encoded in UTF16'.  John has his extremely excellent
> > software which handles UTF16 and these two parties are happy.
> >
> > John mentions a 'usage context'.  If the standard is to include some
> > account of usage context, then that context has to be specified
> > sufficiently for a CIF2 programmer to understand what aspects of that
> > context to consider, and not left open to misinterpretation.  Perhaps
> > you could enlarge on what particular context should be included?
> >
> > (John) Optional and alternative behaviors are not fundamentally
> > incompatible with a data interchange standard, as XML and HTML
> > demonstrate.  Or consider the extreme variability of CIF text content:
> > whether a particular CIF is suitable for a particular purpose depends
> > intimately on exactly which data are present in it, and even to some
> > extent on which data names are used to present them, even though ALL
> > are optional as far as the format is concerned.  If I say 'This CIF is
> > unsuitable for my present purpose because it does not contain
> > _symmetry_space_group_name_H-M', that does not mean the CIF standard
> > is broken.  Yet, it is not qualitatively different for me to say 'This
> > CIF is unsuitable because it is encoded in CCSID 500' despite CIF2
> > (hypothetically) permitting arbitrary encodings.
> >
> > (James now)  The difference is quantitative and qualitative.
> > Quantitative, because the number of CIF2 files that are unsuitable
> > because of missing tags will always be less than or equal to the
> > number of CIF2 files that are unsuitable because of a missing tag and
> > unknown encoding.  Thus, by reducing ambiguity at the lower levels of
> > the standard, we improve the utility at the higher levels.  The
> > difference is also qualitative, in that (a) if we have tags with
> > non-ASCII characters, they could conceivably be confused with other
> > tags if the encoding is not correct and so you will have a situation
> > where a file that is not suitable actually appears suitable, because
> > the desired tag appears. Likewise, the value taken by a tag may be
> > wrong.
> >
> > (James a long time ago) (iii) restrict possible encodings to
> > internationally recognised ones with well-specified Unicode mappings.
> > This addresses point (4)
> >
> > (John) I don't see the need for this, and to some extent I think it
> > could be harmful.  For example, if Herb sees a use for a scheme of
> > this sort in conjunction with imgCIF (unknown at this point whether he
> > does), then he might want to be able to specify an encoding specific
> > to imgCIF, such as one that provides for multiple text segments, each
> > with its own character encoding.  To the extent that imgCIF is an
> > international standard, perhaps that could still satisfy the
> > restriction, but I don't think that was the intended meaning of
> > "internationally recognised".
> >
> > (James now)  Indeed.  My intent with this specification was to ensure
> > that third parties would be able to recover the encoding. If imgCIF is
> > going to cause us to make such an open-ended specification, it is
> > probably a sign that imgCIF needs to be addressed separately.  For
> > example, should we think about redefining it as a container format,
> > with a CIF header and UTF16 body (but still part of the
> > "Crystallographic Information Framework")?
> >
> > (John) As for "well-specified Unicode mappings", I think maybe I'm
> > missing something.  CIF text is already limited to Unicode characters,
> > and any encoding that can serve for a particular piece of CIF text
> > must map at least the characters actually present in the text.  What
> > encodings or scenarios would be excluded, then, by that aspect of this
> > suggestion?
> >
> > (James) My intention was to make sure that not only the particular
> > user who created the file knew this mapping, but that the mapping was
> > publically available.  Certainly only Unicode encodable code points
> > will appear, but the recipient needs to be able to recover the mapping
> > from the file bytes to Unicode without relying on e.g. files that will
> > be supplied on request by someone whose email address no longer works.
> >
> > (John) This issue is relevant only to the parties among whom a
> > particular CIF is exchanged.  The standard would not particularly
> > assist those parties by restricting the permitted encodings, because
> > they can safely ignore such restrictions if they mutually agree to do
> > so (whether statically or dynamically), and they (specifically, the
> > CIF originator) must anyway comply with them if no such agreement is
> > implicit or can be reached.
> >
> > (James) Again, any two parties in current contact can send each other
> > files in whatever format and encoding they wish.  My concern is that
> > CIF software writers are not drawn into supporting obscure or adhoc
> > encodings.
> >
> > (John) B) Scheme B does not use quite the same language as scheme A
> > with respect to detectable encodings.  As a result, it supports
> > (without tagging or hashing) not just UTF-8, but also all UTF-16 and
> > UTF-32 variants.  This is intentional.
> >
> > (James) I am concerned that the vast majority of users based in
> > English speaking countries (and many non English speaking countries)
> > will be quite annoyed if they have to deal with UTF-16/32 CIF2 files
> > that are no longer accessible to the simple ASCII-based tools and
> > software that they are used to.  Because of this, allowing undecorated
> > UTF16/32 would be far more disruptive than forcing people to use UTF8
> > only. Thus my stipulation on maintaining compatibility with ASCII for
> > undecorated files.
> >
> > (John) Supporting UTF-16/32 without tagging or hashing is not a key
> > provision of scheme B, and I could live without it, but I don't think
> > that would significantly change the likelihood of a user unexpectedly
> > encountering undecorated UTF-16/32 CIFs.  It would change only whether
> > such files were technically CIF-conformant, which doesn't much matter
> > to the user on the spot.  In any case, it is not the lack of
> > decoration that is the basic problem here.
> >
> > (James now)  Yes, that is true.  A decorated UTF16 file is just as
> > unreadable as an undecorated one in ASCII tools.  However, per my
> > comments at the start of this email, I think an extra bit of hoop
> > jumping for non UTF8 encoded files has the desirable property of
> > encouraging UTF8 use.
> >
> > (John) C) Scheme B is not aimed at ensuring that every conceivable
> > receiver be able to interpret every scheme-B-compliant CIF.  Instead,
> > it provides receivers the ability to *judge* whether they can
> > interpret particular CIFs, and afterwards to *verify* that they have
> > done so correctly.  Ensuring that receivers can interpret CIFs is thus
> > a responsibility of the sender / archive maintainer, possibly in
> > cooperation with the receiver / retriever.
> >
> > (James) As I've said before, I don't see the paradigm of live
> > negotiation between senders and receivers as very useful, as it fails
> > to account for CIFs being passed between different software (via
> > reading/writing to a file system), or CIFs where the creator is no
> > longer around, or technically unsophisticated senders where, for
> > example, the software has produced an undecorated CIF in some native
> > encoding and the sender has absolutely no idea why the receiver (if
> > they even have contact with the receiver!) can't read the file
> > properly.   I prefer to see the standard that we set as a substitute
> > for live negotiation, so leaving things up to the users is in that
> > sense an abrogation of our responsibility.
> >
> > (John) That scenario will undoubtedly occur occasionally regardless of
> > the outcome of this discussion.  If it is our responsibility to avoid
> > it at all costs then we are doomed to fail in that regard.  Software
> > *will* under some circumstances produce undecorated, non-UTF-8 "CIFs"
> > because that is sometimes convenient, efficient, and appropriate for
> > the program's purpose.
> >
> > I think, though, those comments reflect a bit of a misconception.  The
> > overall purpose of CIF supporting multiple encodings would be to allow
> > specific CIFs to be better adapted for specific purposes.  Such
> > purposes include, but are not limited to
> >
> > () exchanging data with general-purpose program(s) on the same system
> > () exchanging data with crystallography program(s) on the same system
> > () supporting performance or storage objectives of specific programs or
> > systems
> > () efficiently supporting problem or data domains in which Latin text
> > is a minority of the content (e.g. imgCIF)
> > () storing data in a personal archive
> > () exchanging data with known third parties
> > () publishing data to a general audience
> >
> > *Few, if any, of those uses would be likely to involve live
> > negotiation.*  That's why I assigned primary responsibility for
> > selecting encodings to the entity providing the CIF.  I probably
> > should not even have mentioned cooperation of the receiver; I did so
> > more because it is conceivable than because it is likely.
> >
> > (James now) OK, fair enough. My issues then with the paradigm of
> > provider-based encoding selection is that it only works where the
> > provider is capable of making this choice, and it puts that
> > responsibility on all providers, large and small.  Of course, I am
> > keen to construct a CIF ecology where providers always automatically
> > choose UTF8 as the "safe" choice.
> >
> > (John) Under any scheme I can imagine, some CIFs will not be well
> > suited to some purposes.  I want to avoid the situation that *no*
> > conformant CIF can be well suited to some reasonable purposes.  I am
> > willing to forgo the result that *every* conformant CIF is suited to
> > certain other, also reasonable purposes.
> >
> > (James now) Fair enough.  However, so far the only reasonable purpose
> > that I can see for which a UTF8 file would not be suitable is
> > exchanging data with general-purpose programs that do not cope with
> > UTF8, and it may well be that with a bit of research the list of such
> > programs would turn out to be rather short.
> >
> >
> >
> > --
> > T +61 (02) 9717 9907
> > F +61 (02) 9717 3145
> > M +61 (04) 0249 4148
> > _______________________________________________
> > cif2-encoding mailing list
> > cif2-encoding at iucr.org
> > http://scripts.iucr.org/mailman/listinfo/cif2-encoding
> >
> >
> 
>


More information about the cif2-encoding mailing list