Accent escape sequences

Joe Krahn krahn at niehs.nih.gov
Sat Mar 3 21:01:45 GMT 2007


Brian McMahon wrote:
> Dear Joe
> 
> We have recently exchanged a few messages off-list, and it is
> clear that you have an interest in, and perhaps some time for,
> working on CIF-based applications. It would be great if you would
> introduce yourself to the list with a brief indication of your
> current interests.
Recently, I have been working on some tools for data management in 
macromolecular programming, with an interest in combining force-field 
development with crystallography. The software idea is to create a 
framework for modular programming. Most applications are tied together 
into one big package that makes it difficult for individual 
experimentation without digging through a lot of source code. It also 
typically means that individual contributions may give up ownership, 
such as a lot of community efforts into programs like CNS getting sucked 
into Accelrys, where the scientific development pretty much dies. My 
plan involves an "in-memory database", where modular units access 
molecular data using memory pointer look-ups by name. Then, a module 
programmer can (for example) add atom properties without modifying 
compiled data structures in the core code. It should also provide a 
natural way to tie in to scripting tools.

As for CIF format, it is a fairly good fit to the molecular database 
concept. I realized that there seem to be no decent Fortran tools. The 
available Fortran code seems to be mostly inflexible F77 spaghetti code. 
  Also, most of the C/C++ code is generally oriented towards 
multi-structure databases. I also want to keep things very simple, where 
no CIF dictionary is needed, with float/int types automatically 
recognized and stored as such. So, I decided to implement my own 
Fortran95 CIF library. In the process, I realized that some parts of CIF 
and mmCIF are a bit ill-defined. Now that many people have used CIF, it 
seems like now is a good time to work out some of the unfinished details.

> 
> Regarding the untidy typographic markup conventions in CIF text
> fields, what we currently have arises from the pragmatic
> requirements of our early 1991 (prehistoric!) CIF-handling
> procedures in Acta Cryst. We used TeX as a formatter, so
> the markup (initially) was somewhat TeX-like; but there was
> pressure on us not to rely on TeX, especially as many of our
> authors would have no experience of it. Thus a minimal set
> of markup was devised, requiring very little learning from
> authors, that covered most markup that in practice we came
> across in Acta C papers (which have rather little
> mathematical content). Very few additional codes were
> introduced; and, for example, the relatively recent <i> and
> <b> markup for italic and bold was chosen because
> non-specialist authors were beginning to become familiar
> with such codes in HTML markup.
> 
> The current arrangement is, in my opinion, very inelegant,
> but it is supported by publCIF, the IUCr's own CIF editor,
> and is workable within that tool's reasonably user-friendly
> interface.
> 
> To provide better formatting abilities, I think it would be
> preferable to allow text fields to contain markup in various
> different standard formats, suitably identified, and to
> pass the fields to appropriate handlers. The simplest way to
> do so would be to have a 'magic number' introducing each text
> field. There's an undocumented example of this inasmuch as
> ciftex, the old cif->TeX translater, passes through unchanged
> any text field beginning 
> ;%T   (i.e. it treats is as containing pure TeX markup).
> The 'magic number' might be a simple character sequence
> (%T for TeX, %L for LaTeX, %H html, %R RTF, %U Unicode...)
> or could be a more general, but more verbose, signature
> involving MIME headers:
> ;
> Content-Type: application/tex
> (this mimics the approach for embedding binary data in imgCIF files).
Something along those lines sounds good. One problem with the current 
multi-line text is that the text fields often are indented, with one 
less character n the first line to offset the semicolon. I think the 
multi-line format would be much simpler if the begin and end semicolons 
were both required to be the only character on a line, i.e. the 
text-block delimiter is "<eol>;<eol>" instead of just "<eol>;". Also, a 
line starting with a semicolon within the multiline text is not a 
problem. A content-type tag could be placed on the line with the 
starting semicolon. A multi-line pattern would then be:
<eol>;<content-type><eol><multi-line text><eol>;<eol>


> 
> There's nothing fundamentally wrong with extending the existing
> special character sequences, and I'm happy to consider a
> specific proposal in terms of whether we could easily provide
> publCIF support for it. The problem is that the more one offers
> to the author, the more the author will want to do, and the more
> unwieldy an ad-hoc markup will become. (And recall that even
> TeX, which is unparalleled for mathematics, does not offer as
> primitives anywhere near all the symbols that our authors do
> use.)
> 
I think the current set IS fundamentally flawed. Any proper set of 
'escape' codes should be able to display the escape characters 
literally. Currently, there is no rule for displaying backslash or carat 
without potentially being recognized as escape-code characters.

I thought that CIF code were rather ad-hoc, but realized that similar 
code sequences have been used elsewhere. The advantage of the current 
codes is that they are simple enough to be read fairly well in plain 
text form. For an archival format, I think that it a good thing.

My proposal is not just to make a huge list of character codes, but to 
define some simple rules that keep things from getting ad-hoc. 
Personally, I would not have included <I> and <B>. It would be a better 
fit to use old-style /italics/ and *bold*, specifically because CIF 
markup is not HTML.

Here is my idea. Note that the second rule provides the unescaped form 
of any special character by using a blank second character.

special character sequence          result

\<alphabetic>                       Greek letter
\<not alpha><char>                  combination of 2 chars
\\<one or more alpha chars><space>  named code


style rules:
superscript text:  ~text~
subscript text:    ^text^
italic text:       /text/
bold text:         *text*

Some of the existing named 'by convention' rules might be better written 
with the combined-character trigraph:
\\leftarrow   to  \\<-
\\rightarrow  to  \\->
\\simeq       to  \\~=
\\square      to  \\[]

I also think that the bare codes should be changed. How do I write "---" 
and not mean single bond?
--    to  \\--
+-    to  \\+-
-+    to  \\-+
---   to  \\sb

Single bond could also be "\\--", but only if other bond types are also 
visual.

Also, the italic and bold style suggestion would interfere a bit with 
equations if not written with separating spaces. But, the carat sequence 
also is a conflict with it's use as an exponential operator, and nobody 
seems to mind the lack of a carat escape.

Joe


More information about the comcifs mailing list