Text Data Bases

By David Ness
Tuesday, January 29, 2002

This paper is in an early draft state. It should not be construed as a commitment yet, although some code has been written to support the structures that are presented

Creating a Text Data Base

0. Introduction

10/19/2001 3:00 AM

This note describes a structure used for storing and manipulating text. In particular the data here is structurally linked with the concept of a data Vault as is implemented by the Vault program from http://www.personalmicrocosms.com

This program allows a particularly convenient visual interface to the kind of text data that is represented here.

1. Key Structural Element

The key structural element treated by this system is the NoteCard or Titled Paragraph(s) that consist of:

a title; and
some paragraphs of text.

We will (hopefully consistently) use NoteCard to represent this data structure, but notice that this does not properly carry with it the connotation of hierarchical structure which is well represented in the system here.

It remains to be investigated if we want to treat circumstances where we have `invisible' titles which exist but which are not displayed when the corresponding paragraphs are called to be displayed.

2. Related Problems

Parts of the technology presented here are also used in dealing with other text problems. For that reason some of the facilities presented have a somewhat more general character than is really needed for the problem of representing information that is stored in Vaults.

3. Where Problem Arises

The problem arises in almost all forms of text management and processing.

4. Structure of Items

The items in a Vault can be `blown up' into two files. One of these files contains the structural information that represents the organization of the items in the Vault. The other file contains the base text that actually constitutes the body of the information in the Vault.

There is some purposeful redundancy in these files. This redundancy is created when a Vault is blown into its component parts. The implementation of the routines that re-constitute a Vault from its component parts do not, for the most part, demand that this redundancy be consistent when a Vault is `put back together', however.

4.1. The Structure

The structure file describes the structure of the text data. It consists of one line for each NoteCard. Each line consists of some `comma separated' elements that represent three things:

the `title' of the NoteCard;
the level of the item; and
a `reference' to the NoteCard text if there is any.

The lines might look like this:
"Creating a Text Data Base",,,0,{ $Lab0},
,"Key Structural Element",,1,{ $Lab1},
,"Where Problem Arises",,1,{ $Lab2},
,"Structure of Items",,1,,
,,"The Structure",2,{ $Lab3},
,,"The Text",2,,
,"Characteristics of Structure",,1,,
The structure file is generally given an extension of .CSV because it has a conventional comma separated variable structure.

4.1.1. The Title The title portion of the structure line consists of the quoted title of the NoteCard located at an appropriate place in the hierarchy by placing it between commas that represent each level. For example, if a particular database has three levels then there will be three commas. A level 0 item would appear before the first comma. A level 1 item would appear between the first and second commas. A level two item would appear between the second and third commas.

The title itself is always enclosed in double quotes, so any commas that appear in the title are not confused with the level indicating comas that occur only outside of the quoted part.

4.1.2. The Level The level indication is simply an integer. This field follows the title portion of the Structure line. It is followed by another comma.

The level could, of course, be inferred from the relative position of the title, but there are times when having direct access to the level information without having to make any inference.

4.1.3. The Reference The reference section follows the level indication. It is followed by a comma that also ends the structure line.

If there is no text in the section, the reference is null.

If there is text, it is indicated by a label reference which consists of a label surrounded by curly braces. The label must appear in the Base Text File that corresponds to structure file containing the reference.

While label references actually have a somewhat more general structure, that generalization is not used in the context of Vault files, as all of the references are necessarily within Base Text File.

4.2. The Base Text

The Base Text File contains the body of the information in a Vault. This file is designed with two principal aspects in mind.

First, the data in the file should be easily and quickly available to any generalized process that might want to access the body of information. This means that the domain of usefulness for the text might well go beyond just using the text in Vaults.

Second, because the notions of the base text file might be used in many different circumstances, it is important that the file be allowed to become quite large (at least millions of bytes) without causing particular problems.

The Base Text File consists of two parts, a header and a block of text items. The Base Text File may or may not be `well formed'. We will first describe a well-formed file and then later discuss why we need to be able to deal with Base Files that are not well formed, and what we can do about them.

In a well-formed Base Text File there will be as many lines in the header as there are items in the block of text that follows. In addition the `pointers' in the header must `properly' point to the text. This will be discussed in detail in a moment.

Base Text Files are conventionally given the extension .TDB (for Text Data Base).

4.2.1. Header Structure The purpose of the `header' is to provide a means of directly accessing information in a large Base Text File without actually passing through the entire file.

Base Text Files (.TDB) serve several purposes. The files should be editable, and should place as few constraints as possible on the editing process.

This is the principal reason that the header is not counted when determing character counts, either in header tables or in external references within other files. Besides avoiding the `race condition' problem that would occur if we were to count the header (unless we fix the size of beginning and ending character counts, the size of the header may change depending on the size of the header, ad nauseum).

By not counting the header it is:

possible to keep it in a separate file if that would be desirable in some contexts;
We can append items at the end of the file without disturbing the header, or, alternatively, by adding lines to the header without changing other references;
If we detect that a header is incorrect, then at any point a program (BLDTDB) can be used on a file to produce a new valid header.

A typical header line has five element, each of which is separated from the next by a vertical bar (`|'). The five fields are:

A label;
A beginning character position (BegChar);
An ending character position (EndChar);
The title of the NoteCard; and
The level of the item.

It might look like this:
Lab0|34|128|Creating a Text Data Base|0|
Lab1|159|519|Key Structural Element|1|
Lab2|548|623|Where Problem Arises|1|
Lab3|645|665|The Structure|2|
In a well formed TDB, there is one header line for each existing Item. The actual text of the items *Not* including the label begins at the character in the BegChar position of the file and ends at the EndChar position. Expository text may appear but has no specific function in generic.TDB files.

The character position is measured from the beginning of the label of the first item. This has the important characteristic that it makes the character positions in the header independent of the size of the header itself.

It should be noted that the header does not have entries for lines of the outline that do not have any body text. (****Verify This****)

4.2.2. Text Item The text item consists of two parts:

a Text Leader; and
Text.

4.2.2.1. Text Leader The text leader looks like {Label=Level=Expository Text}. Each text leader marks the beginning of a new item. The marker is only directly used to parse the file when a header structure is being rebuilt. If the text leader is well-formed then the header can be completely recreated by a one pass process through the data file. Other than for this purpose the actual content of the text leader is generally not used because the header structure

4.2.2.2. Text The text leader is followed by the `ultimate object' of the whole data structure, namely the actualy body text of the item. This text simply runs on with its own line structure until the next text leader (or the end of the file) is encountered.

This requires that all text blocks end with an end of line. As a result text leaders always occur at the left margin in the current implementation.

4.3. External References

The data in .TDBs can be referred to in many different kinds of `higher level' formats. As a result the references are of a very general format.

Full Forms are:
{File$Symbol}
{File:Beg-End}
When processed these forms are removed from the source file and teplaced by the appropriate characters from the text data base.

First, if File is elided, then the source of the text is determined by the convention of the specific circumstances.

If a File is ever mentioned explicitly it is `sticky' and applies until there is a subsequent file reference.to some other file. In the context of Vault files the assumed file name is the name of the .TDB that corresponds to the Vault.

{ $Symbol} references select text in the data base according to the symbol definitions in the file. The references can be resolved either via the header table in the file or by searching the file itself if there is no valid header.

{:Beg-End} references select text in the data base by referring to the character positions beginning and ending the referred string. Notice that the beginning and ending references are to string locations after the header. Thus they are independent of changes made to the header, up to, and including, complete removal of the whole header.

A `special' case is {File:0:-1} which can be used to `introduce' a file (that will serve, for example, as a default file name for later references) without introducing any characters into the processed version of the document.

4.4. ISW File Relationships

The header structure of a .TDB file corresponds to a common format of the .ISW file structure system, and can therefore be manipulated by all of the software that deals with such data files

4.5. Ill-Formed Files

4.6. Assumptions

Lines with {.*=d*=.*} assumption

Well-formed headers

5. Characteristics of Structure

Structure allows manipulations of Text Body

6. Deconstructing a Vault

Need to expand outline completely.

7. Reconstructing a Vault

BLDTDB builds a header for a .TDB file. In doing so any previous header is discarded* and a new header is constructed.

*BLDTDB will be given a mode that will cause it to match a new header to the previous header so that
changes in items can be noted.

8. Vault Output via REBOL

REBOL

8.1. What is REBOL

8.2. REBOL's MakeDoc Facility

Give credit to MakeDoc Author

David Ness' summary of work can be found at http://mywebpages.comcast.net/dness

Home