David Ness
Mind / Matter

MarkUp: Simpler is Better

By David Ness
Tuesday, February 19, 2002

The Problem of MarkUp

The purpose of this note is to develop a system for storing markup in text data bases. This paper starts by considering some of the design environments that we want our technology to be able to handle. These considerations then lead us to a set of test cases that can be used to validate and strengthen our design.

Some MarkUp Languages

Two straightforward, but nevertheless quite different, markup languages are TeX and HTML. One test of any conventions we develop in this paper will be to see how well they handle the problems that are created by these languages.

TeX

TeX is a language created by Prof. Donald Knuth as a vehicle to allow him to get his manuscripts, which often contained lots of mathematics, typeset in the way he wanted them. When mathematical equations were typeset by hand experience indicated that it was quite common to introduce errors, and errors are especially disconcerting in mathematics---`precise, but wrong' may well cause us more of a problem than `vague'.

TeX is a very elaborate language, and it allows very precise control over the placement of ink on paper. However, it also serves perfectly well as a conventional typesetter when we deal with simpler, more plain, text.

HTML

HTML (Hyper Text Markup Language) is a widely shared specification of a language that forms the basis for most of the communication that takes place on the World Wide Web. The HTTP (HyperText Transfer Protocol) ports on the Web deliver up pages of HTML text, and it is up to the client browsers: Internet Explorer, Netscape, Opera, ... to interpret the pages and show them on the computer screen.

HTML shares with TeX the characteristic that, for the most part, plain text will typeset reasonably well as is. Thus, without much trouble, one is able to get a decent display either on paper or on the screen. And if more elaborate presentation is desired, both languages (each in their own way, of course) allow it

Identifying Scope

TeX and HTML each identify the scope of any operation in rather different ways. Understanding the difference is helpful in working with our problem of language design.

TeX Scope

In TeX scope is generally indicated by surrounding the scope of an operation with curly braces: {...}. Actually, in TeX this represents a concept of considerable generality. A `scope' defines a limited field of interest within which we can change our typesetting characteristics with the sure knowledge that when the scope closes, the characteritics which were in place when the scope opened will be restored.

HTML Scope

In HTML scope is indicated by a fairly consistent symbology. If the symbol causes us to begin a `mode', then will mark the end of it. Some HTML use is a little casual about some of these conventions, for example some markup is not very careful to close paragraphs with `official' markers, relying on conventions like extra blank lines instead.

Fonts

Fonts are an incredible nuiscance in typesetting. They are truly difficult to `live with'. On the other hand, as the old saying goes, you `can't live without them' either. Some of the `best brains' that have worried about typesetting have tried to deal with these issues for years, and we still are far short of any completely satisfactory answers.

The complications of fonts generally deal with all of the various aspects of size, face, family and other characteristics. For example, while a font may be rendered in various sizes, in general the sizes are not simple magnifications of one another. The human eye plays all kinds of optical tricks, and lines that `look right' at one size may well look either too thick or too thin at other sizes.

And the relationship between the various faces in a particular font family are not necessarily very clear. Questions of design and taste often interfere with making this easy. An italics font is not just a simple slanting of a roman font. Bold is not just a thickening of the lines of the glyphs.

So, taken all together, this means that fonts aren't easy to handle. And thus it might come as no surprise that they are handled rather differently in different environments.

TeX Fonts

TeX makes a big deal out of fonts. Indeed, on the way to finishing up the Tex typesetting project, Knuth took a several year detour create Meta-Font, a language that he then used to express the mathematical characterisics of the individual glyphs that made up a font. References to fonts in TeX typically involve two stages:

  1. linking a symbol to a storage font; and
  2. using the symbol
when needed. Fonts are typically introduced in the context of a particular scope.

This way of handling fonts has some characteristics that make it difficult to capture cumulative face change effects. For example, in TeX a bold slanted font is a completely different referent than either a bold or a slanted/italic font.

HTML Fonts

HTML files treat fonts somewhat more simply.

Text Emphasis

TeX Style

HTML Style

Approaches

The Markup Translator

Control Elements

Hierarchy

Links

Enumerations / Lists

Tables

We have not yet implemented tables.

Images

We have not yet had much experience with managing images in our text data bases, so this kind of object is left for later design consideration.

Logical MarkUp

Logical vs. Physical Distinctions

Why Logical?

Linking Logical and Physical

Overspecification

Saying What We Mean

The important characteristics of effecting some form of logical markup is that it allows us to say what we mean, without requiring us to say a lot about things that aren't the central focus of our concern.

Other Tries

The Markup Language

General Forms

Implementation

  • i [italics]
    {i This is italics.} causes text to appear in italics.
  • b [bold]
    {b This is bold.} causes text to appear in bold.
  • n [newline]
    {n} causes text that follows to appear on a new line.
  • c [curly braces]
    {Text here} causes curly braces to surround `Text here'.
  • # [numbered list]
    {#} introduces a numbered list.
  • * [bullet list]
    {*} introduces a bulleted list.
  • _ [end list]
    {_} ends either a numbered or bulleted list.
  • e [enumerated item]
    {e Item} `Item' is displayed as a numbered or bulleted list element.
  • d [definition]
    {def Name=Value} defines `Name' to have the value `Value'.
  • r [recall]
    {r Name} recalls (and is replaced by) `Name' which was defined in an earlier definition command.

Operation

David Ness' summary of work can be found at http://mywebpages.comcast.net/dness

Laanguages like HTML allow us to mark up our text directly with `tags' like <B>Bold</B>  for Bold, for example. However, in TeX, a different mark up language, you might use {\bf Bold} to accomplish the same thing.

When we store text, it would be nice if we didn't have to commit to one or another of these particular forms. This paper not only tries to deal with that fact, it also attempts to go somewhat further into this deceptively simple-looking problem.

 

Home