[Cif2-encoding] Drafting issues

Fri Oct 1 05:44:29 BST 2010

Before I post my revised text, I have only just realised (upon close perusal
of the two texts) that Herbert's motion is substantially the same as the
'Changes' document, just without the headings etc, so we are discussing
almost the same document.  My apologies for the confusion.

James.

On Fri, Oct 1, 2010 at 2:37 PM, James Hester <jamesrhester at gmail.com> wrote:

> Dear Group,
>
> As I think we have reached a consensus in principle, and are now moving
> into discussion of precise definitions, let us have wording arguments only
> once (that is, for a single document).  I think that our base document must
> be the one that the DDLm group agreed on - the link once again is
> http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf- simply because it will be unnecessarily confusing for the DDLm group to
> deal with two documents at once, and the 'Changes' document is admirably
> precise.  I reiterate once again that I am happy with the motion that
> Herbert presented, with the proviso that one paragraph is rewritten as I
> have recently proposed.  Herbert - if you would like to negotiate that
> paragraph with me by Skype, I'm happy to do that too.
>
> I have appended a text version of what I consider to be the relevant
> sections of the 'changes' document to this message.  I am happy to provide
> the complete document in OpenOffice format to anybody who would like it.
> Herbert - if you think any of the non-encoding discussion in your motion is
> not already covered in the 'Changes' document, please advise.
>
> I will be posting my own suggestion, largely based on parts of the motion
> that Herbert and I drafted yesterday, in a reply to this email.
>
> CIF - Changes to the specification 05 July 2010
>
> This document specifies changes to the *syntax *of CIF. We refer to the
> current syntax specification of CIF as CIF1, and the new specification as
> CIF2. To date all archival CIFs are CIF1.
>
> The changes to syntax are necessitated by the adoption of new dictionary
> functionalities that introduce several extensions, including new data types,
> and method definitions using dREL.
>
> It is assumed the reader has a thorough understanding of the CIF1
> specification.
> TERMINOLOGY
>
> Reference to *character(s)* means abstract characters assigned code points
> by *Unicode*. Specific characters are referenced according to Unicode
> convention, U⁠+⁠*xxxx*[*x*[*x*]], where *xxxx*[*x*[*x*]]* *is the four- to
> six-digit hexadecimal representation of the assigned code point. The
> designated character encoding for CIF2 is UTF-8.
>
> Reference to *ASCII *characters means characters U⁠+⁠0000* *through* *
> U⁠+⁠007F,* *or, equivalently the first 128 characters of the *ISO 8859 1*(
> *LATIN 1*)* *character set.
>
> Reference to *newline *or *\n *means the sequence that conventionally
> terminates a line record (which is environment dependent). * See Change 3.
> *
>
> Reference to *whitespace *means the characters ASCII space (U⁠+⁠0020),
> ASCII horizontal tab (U⁠+⁠0009) and the *newline *characters. Without
> regard to local convention, the various other characters that Unicode
> classifies as whitespace (character categories Zs and Zp) do not constitute
> *whitespace *for the purposes of CIF2.
> PREAMBLE
>
> CIF2 significantly extends CIF1 functionality, primarily through new
> dictionary features. CIF2 is not fully backwards-compatible with CIF1: many
> files compliant with CIF1 are also compliant with CIF2, but some are not
> (see especially change 5, below). The CIF1 standard will continue to operate
> for the foreseeable future in parallel with CIF2.
> CHANGE 1 ‒ NEW (MAGIC CODE)
>
> A CIF2 file is uniquely identified by a required magic code at the
> beginning of its first line. The code is,
>
> #\#CIF_2.0
>
> followed immediately by *whitespace*.
> CHANGE 2 ‒ NEW (CHARACTER SET)
>
> CIF2 files are standard variable length text files, which for compatibility
> with older processing systems will have a maximum line length of 2048
> characters. As discussed above and below, however, there are some
> restrictions on the character set for token delimiters, separators and data
> names.
>
> In keeping with XML restrictions we allow the characters
>
> U⁠+⁠0009 U⁠+⁠000A U⁠+⁠000D
> U⁠+⁠0020 – U+007E
> U+00A0 - U⁠+⁠D7FF
> U⁠+⁠E000 – U+FDCF
> U⁠+⁠FDF0 - U+FFFD
> U⁠+⁠10000 - U⁠+⁠10FFFD
>
> In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is
> any hexadecimal digit are disallowed. Unicode reserves the code points E000 –
> F8FF for private use. The IUCr and only the IUCr may specify what
> characters are assigned to these code points in the context of CIF2.
>
> *Reasoning: There is growing demand for the wider character set afforded
> by Unicode to be made available in applications, especially those where
> internationalisation is an issue.*
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>

-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://scripts.iucr.org/pipermail/cif2-encoding/attachments/20101001/1053ca62/attachment.html