Very Short Papers

By Stephen Orla Kimbrough and David Ness
Thursday, March 21, 2002

Very Short Papers: Introduction

3/21/2002 5:00 PM

`Very Short Papers' are a collection of papers that express simple ideas and contain a little discussion of those ideas. The fundamental notion is that one paragraph ideas can be expressed in just one paragraph rather than requiring that they be surrounded by lots of supporting and integrating text. The discussions are afforded the luxury of being very preliminary, and thus are generally not required to be carefully thought out. We expect the ideas expressed and discussed in the VSPs to either be promoted into more fully developed papers or consigned to the circular file.

VSP#1: Untitled Paragraphs

There is a question about whether text blocks can contain more than one paragraph. There are two possibilities:

all paragraphs are separate; or
paragraphs are allowed to have parts (separate paragraphs).

What they Are

Paragraphs that have titles are handled naturally. However there are at least two ways that we could handle untitled paragraphs. First, untitled paragraphs could just be concatenated on to the titled paragraphs that they follow. Of course, in such circumstances some form of paragraph separating marker is needed. The most commonly used marker is probably a double carriage return, but this has some potential confusion with the role of single carriage returns. Second, untitled paragraphs could be treated just like titled paragraphs but with a null title. Of course, in such circumstances there would have to be some way of linking the paragraphs to their parents.

Argument #1: Convenience

The convenience argument suggests that untitled paragraphs should appear as a part of the text of the titled paragraph that they follow. This structure implies a couple of things:

we don't need to invent a linkage mechanism;
untitled paragraphs are unlikely candidates for other use except as a part of the titles to which they belong; and
we need to implement a paragraph separator of some sort.

Argument #2: Paragraph Detection

The argument of paragraph detection suggests that we do not need to invent any logical paragraph marker if each paragraph is clearly separated from each other because they are structurally separate.

Decision

Not made yet.

VSP#2: Input Forms

There are a number of different input forms which are determined by a combination of:

markup;
purpose; and
convenience of manipulation.

Issues: Markup

There are many candidates for markup language. Each of them has some advantages and some disadvantages. For the most part we will choose to implement our own standard form of markup, but may have occasion to support other forms in some documents.

Issues: Purpose

Different input forms may prove to be convenient for different purposes. For example, the forms which are appropriate for a document that is going to potentially be rendered in a number of different contexts may conveniently have on particular kind of representation, where those which are going to be processed by some other sort of transformation may have a very different base representation. Perhaps the simplest contrast in input forms can be seen by comparing code with text. In most situations, we have considerable flexibility in the storage form for text. A paragraph, for example, could be represented by a single line of text, a `column' of text with a single word on each line, or anything comfortable in between. No important inferences about content are made from the form of the text in a paragraph, so long as nothing like an empty line appears in the text, for example. On the other hand, code may have a very restricted structure where virtually everything counts. We are often restricted from moving anything around without possibly impacting its significance and meaning.

Issues: Convenience

The convenience of manipulating input data may also cause different representations to be appropriate. For example, some situations would best be covered by a form that might require a paragraph to be stored as a single (long) line of text, while this would prove to be a very inconvenient form for most writing and editing tasks.

Decision

Not made yet.

VSP#3: Lossy Mappings

Some of the transformations between different forms of storage may involve loss of information. For example, while comments which don't influence the final output of the document may prove to be inportant in some situations, they may be quite inconvenient in other situations. This may mean that some of the transformations we make are lossy in the sense that they are not directly reversable (i.e. no complete inverse transformation exists) without some reference to some external information.

Issues

Alternatives

Decision

Not made yet.

VSP#4: CVS and CMS

The issues of version control apply to documents as well as they apply to complexes of programs. For the most part the issues of documents are somewhat simpler than those associated with programs, but the two problems share many common aspects. For example, a complex document and a program may well share

multiple authorship;
complex version control; and
issues of security and access control.

Programming a Document

Decision

Not made yet.

VSP#5: Meta-Documents

Candidates

Outlines

Diary / Calendar

Searializer

Decision

Not made yet.

VSP#6: Common Tasks

Construction / Deconstruction

Spell Check

Object Recognition and Markup

Decision

Not made yet.

VSP#7: Handling Code and other Special Forms

Decision

Not made yet.

David Ness' summary of work can be found at http://mywebpages.comcast.net/dness and Steven O. Kimbrough's can be found at http://opim.wharton.upenn.edu/~sok/cv/cv.html
The TextDrupelets Project has been commissioned to deal with the problems of representing some atomic structures that aggregate into the molecular structures of documents.

Home