[Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. .

Thu Aug 26 12:24:49 BST 2010

Um, but CIF1 is _not_ ascii-only.  It is text in any acceptable local
encoding.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya at dowling.edu
=====================================================

On Thu, 26 Aug 2010, James Hester wrote:

> Hi Simon and others,
>
> What Simon describes accords closely with my perception of the
> situation, except that your final point regarding CIF2 requiring users
> to abandon text editors will depend on how we resolve the encoding
> issue.  For me the logical conclusion from the points you make is to
> stick to UTF8-only encoding which will keep the large majority of
> users and developers happy.  Unfortunately others have the perception
> that UTF8-only will be overly restrictive, and lacking hard data we
> are having trouble deciding which of these two perceptions are
> correct.  Clearly UTF8-only is not overly restrictive *now* because it
> is *less* restrictive than the (de-facto) CIF1 situation of ASCII-only
> which has served us well.  UTF8 may be restrictive in the future when
> users of non Latin-1 code points find that they don't know how or
> can't use their favourite text editors for putting those code points
> into a CIF, but I'm not sure even the users themselves could answer
> the question now as to how likely that is going to be.
>
> What I would suggest as a cautious compromise is to leave the door
> open for adding non UTF8 encodings in the future, but not describing
> any scheme for doing this at present.  One way to leave the door open
> like this would be to declare that the first line of a CIF2 file is
> 'special', and is reserved for future expansion.  Our discussions on
> Scheme B are sufficiently far advanced to indicate that conventions
> relating to encoding schemes could be managed in the first line. The
> question of how strictly something like Scheme B should be applied
> remains open, and could be addressed once more in-field experience has
> been gained.
>
>
> On Thu, Aug 26, 2010 at 9:08 AM, SIMON WESTRIP
> <simonwestrip at btinternet.com> wrote:
>> Dear all
>>
>> Recent contributions have stimulated me to revisit some of the fundamental
>> issues of the possible changes in CIF2 with respect to CIF1,
>> in particular, the impact on current practice (as I perceive it, based on my
>> experience). The following is a summary of my thoughts, trying to
>> look at this from two perspectives (forgive me if I repeat earlier
>> opinions):
>>
>> 1) User perspective
>>
>> To date, in the 'core' CIF world (i.e. single-crystal and its extensions),
>> users treat CIFs as text files, and expect to be able to read them as such
>> using
>> plain-text editors, and indeed edit them if necessary for e.g. publication
>> purposes. Furthermore, they expect them to be readable by applications that
>> claim that
>> ability (e.g. graphics software).
>>
>> The situation is slghtly different with mmCIF (and the pdb variants), where
>> users tend to treat these CIFs as data sources that can be read by
>> applications without
>> any need to examine the raw CIF themselves, let alone edit them.
>>
>> Although the above statements only encompass two user groups and are based
>> on my personal experience, I believe these groups are the largest when
>> talking about CIF users?
>>
>> So what is the impact on such users of introducing the use of non-ASCII text
>> and thus raising the text encoding issue?
>>
>> In the latter case, probably minimal, inasmuch as the users dont interact
>> directly with the raw CIF and rely on CIF processing software to manage the
>> data.
>>
>> In the former case, it is quite possible that a user will no longer be able
>> to edit the raw CIF using the same plain-text editor they have always used
>> for such purposes.
>> For example, if a user receives a CIF that has been encoded in UTF16 by some
>> remote CIF processing system, and opens it in a non-UTF16-aware plain-text
>> editor,
>> they will not be presented with what they would expect, even if the
>> character set in that particular CIF doesnt extend beyond ASCII;
>> furthermore, even 'advanced' test editors would struggle if the encoding
>> were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally
>> applicable to CIF1, but by 'opening up' multiple encodings, the probability
>> of their usage increases?
>>
>> So as soon as we move beyond ASCII, we have to accept that a large group of
>> CIF users will, at the very least, have to be aware that CIF is no longer
>> the 'text' format
>> that they once understood it to be?
>>
>> 2) Developer perspective
>>
>> I beleive that developers presented with a documented standard will follow
>> that standard and prefer to work with no uncertainties, especially if they
>> are
>> unfamiliar with the format (perhaps just need to be able to read a CIF to
>> extract data relevant to their application/database...?)
>>
>> Taking the example of XML, in my experience developers seem to follow the
>> standard quite strictly. Most everyday applications that process XML are
>> intolerant of
>> violations of the standard. Fortunately, it is largely only developers that
>> work with raw XML, so the standard works well.
>>
>> In contrast to XML, with HTML/javascript the approach to the 'standard' is
>> far more tolerant. Though these languages are standardized, in order to
>> compete, the leading application
>> developers have had to adopt flexibility (e.g. browsers accept 'dirty' HTML,
>> are remarkably forgiving of syntax violations in javascript, and alter the
>> standard to
>> achieve their own ends or facilitate user requirements). I suspect this
>> results largely from the evolution of the languages: just as in the early
>> days of CIF, encouragement of
>> use and the end results were more important than adherence to the documented
>> standard?
>>
>> Note that these same applications that are so tolerant of HTML/javascript
>> violations are far less forgiving of malformed XML. So is the lesson here
>> that developers expect
>> new standards to be unambiguous and will code accordingly (especially if the
>> new standard was partly designed to address the shortcomings of its
>> ancestors)?
>>
>>
>> Again, forgive me if these all sounds familiar - however, before arguing one
>> way or the other with regard to specifics, perhaps the wider group would
>> like to confirm or otherwise the main points I'm trying to assert, in
>> particular, with respect to *user* practice:
>>
>> 1) CIF2 will require users to change the way they view CIF - i.e. they may
>> be forced to use CIF2-compliant text editors/application software, and
>> abandon their current practice.
>>
>> With respect to developers, recent coverage has been very insightful, but
>> just out of interest, would I be wrong in stating that:
>>
>> 2) Developers, especially those that don't specialize in CIF, are likely to
>> want a clear-cut universal standard that does not require any heuristic
>> interpretatation.
>>
>> Cheers
>>
>> Simon
>>
>>
>>
>>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding at iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>