Diablog: Kimbrough and Ness

By Steven Orla Kimbrough and David Ness
Tuesday, February 26, 2002

Diablog: Kimbrough and Ness

Opening [2/26/02] Document Concepts [2/26/02]

Opening (SOK)

Intrigued by our recent conversations on blogs, I've decided to venture on an experiment. If you mind or object in many way, just give the word and I'll cease. No questions asked. (but I think you'll like this, think you think it's good thing to do.) Here then is the beginning of a diablog: a dialog blog between you and me. I'd like a single place to record our extended dialog. I'd like to "own" it for two reasons:

so I can do things with it, experiment, modify, and
so I can retrieve all of it when I want to.
I'm hoping you might want to make one yourself and link back to me, etc. Here is my first link to you: DNess at Antville and the second DNess Homepage at Comcast. Next, some further comments from me.

Document Concepts (SOK)

SOK: I want to record and credit you with two ideas/concepts (of course there are many others!), which I'll express in my own way (to see what you think). First, a document is a combination of text-Drupelets and structuring thereof. Not: text-Drupelets + structuring = document, but document = structuring(text-Drupelets). Indeed, a pregnant idea as I hope will be shown in the sequel (but not today). Second, your concept for the text-Drupelets has then consisting of three things:

A Drupelet of ordinary text about one paragraph long (possibly with blank lines separating subDrupelets, possibly with allowance for specialize markup);
An associated title, and
Metadata as appropriate, e.g., some sort of unique ID for the text Drupelet.
Date and author are also obvious possibilities. The idea, I think, is there is at least a unique ID of some sort, and everything else is optional and very open. a text database is organized around storing and retrieving such text-Drupelets.

In play is the idea of seeing how these ideas (among others) might be played out with modern Web technology. Might we build a very useful and powerful text database management system? Might we use it to feed new kinds of editors and retrieval & exploration tools? Seems promising. My thought is to think, write, and brainstorm about it using the kinds of technologies we have in mind. More on that later.

Comment (DN)

2/26/2002 10:57 PM I have several comments, all favorable.

On Text Chunks

I like Text Drupelet so (for at least the short run) I'll change my references from `TextBits' to `TextDrupelets'.

Current Representation

On the issue of representation, my current implementations also have a concept of level that relates to the text-Drupelets. This is perhaps something to get out of (gracefully?) but it has been of some particular use when outlines are used as the mainstay of the text structure. Perhaps this is a simple misparsing of the problem on my part---having dealt so far largely with outline structures, where level (both absolute and relative) is a central concept.

The Role of Composition

I heartily agree that the functional form is better than the (pseudo-) additive form implicit in my earlier remarks. However I wonder if a still better representation might not be Document=Compose(Structure,Text-Drupelets) to emphasize the separation of the structure, which we might want to explicitly represent as an entity, from the process that composes the structure and the text into a document. I would note, here, that Compose probably introduces the notion of Target the language that is the representation for the document itself. The domain of Target is certainly at least

HTML; and
TeX
but we may want to broaden that right from the beginning.

Comment (SOK)

Comments on Ness's Comments

Basically I agree with all the points. In particular the point about the functional form requires another argument (for style or mode of composition) is precisely the next point I was going to make. The wording and notation Ness uses, however, is better than what I was going to use, so let's stick with what's up there.

Three other brief points (much more to folow):

Besides HTML and TeX as targets, clearly XML has to be included, although perhaps not immediately, at least in the intermediate term

How does this connect with the SGML/XML view of the world? As I see it, Ness made a key move in addressing, focusing on, (what we now call) text Drupelets. These are molecules, which are more or less directly composable into documents. SGML and XML focus on a finer-grained view of the world; atoms or atomic particles. Nothing wrong with that. The betting here, however, is that the present text-Drupelet focus will prove more immediately felicitous. Goldilocks: not too hot, not too cold, just right.

Hierarchy. Fine, but isn't that best seen as an optional feature held in the metadata?

Adding an Outline Process (DN)

2/26/2002 11:27 PM

This iteration (in the arcane terminology of Blogs it's called push-back tries to do two things:

Comment on SOK's start of the Diablog and;
explain what I have done in a superficial / overview kind of way.

Conversion to .TDB

I took the SOK original and put it into my .TDB format. The original document was pretty much contained in the first three entries of that DataBase. I then created a three line .CSV file to represent the outline for the document.

Putting the original into .TDB form wasn't difficult. Since there wasn't much font markup (italics and bold), a source text extraction from the original document was adequate to do most of the work.

All that needed to be done at this point was to place some TextDrupelet markers

To Vault

This was then processed by already existing software into VAULT format. VAULT is a very simple program that makes handling outlines and text Drupelets very easy.

Adding Markup

In this form I then went through the document and converted what little markup there was into my generic form. This isn't because I have some particular commitment to my forms, but rather because they are at least a little more general than putting the text in with direct HTML markup or some other grounded form.

Adding Commentary and New Text

I then added both my commentary to SOK's original, and a new section explaining what I had done.

Conversion Back

The final task will be to convert back out from VAULT format to .TDB format for mailing back to SOK. I will also attempt to add the current state of things to my Web Page---which at this state will represent a little hand work (but only a little, I believe)

Cleaning Some Code (DN)

2/27/2002 3:32 PM

The insight of the day was to recognize that one of the properties of the TextDrupelets is their Markup Language. This simple notion clears up some of the confusions that have been troubling me as we dig into this, and I am surprised that so many of the things that were blocking implementations seem to clear up if this view is made more explicit.

Source of Text

The fundamental bifurcation of the world with respect to text is that which describes where it comes from. Text comes from places:

Under our control;

Generated by hand; or
Generated by computer;

Not under our control.

When text is generated by hand then we often have quite a free decision about what the original source form of markup is to be used. This raises the questions of input capture editor which can range from something quite ASCII oriented all the way to some WYSIWYG facilities.

Even if we choose an ASCII orientation, we still have a choice about whether we are going to directly enter commands in something like HTML, or rather use some other language DN Markup, for example, to handle them.

For text that comes in from somewhere else, we generally have little say. If, for example, it arrives as a .JPG or an ill-concieved .PDF file, then we may not be able to manipulate the actual text at all. If, on the other hand, it arrives as HTML we may be able to do much better deconstructing it into some useful form for longer run storage.

Markup Domains

There are currently five forms of markup in the domain of our interest:

HTML;
SGML;
XML;
TeX; and
DN Markup.

An open question (for me at least) is whether SGML and XML are both separable from HTML on on our list of interest. From his remarks I think SOK thinks so, and I probably do too.

Externally Supported

HTML is the most well know form of markup in use on the Web. This markup is used to display most of the text that we read on computer screens that are operating on the Net.

TeX is an important form of markup in use for printed material. A lot of text is written in LaTeX. a TeX correlate, and text in this form can generally be rendered easily in PostScript, PDF and other encodings which are particularly common in the hard-copy printing business.

AFT is Almost Free Text and is discussed in some detail at Todd Coram's web page. It is, in may ways, closely related to:

REBOL make-doc which is some code, written in REBOL and available on REBOL's site.

Within the Project

DN Markup is my set of conventions designed to deal with text which might reasonably eventually want to be rendered in either HTML, TeX or some of the other markup forms. It is generally the form I use to write original material.

CIN Markup is a closely related form to DN Markup, but it contains a representation for hierarchy that is not present in basic DN.

Markup Conversion Processes

The development of markup conversion processes is an interesting subject in itself. Markup languages have the property of being more or less rich, where richness may be thought of as opportunities for alternative forms of complex expression.

The more rich a markup language is, the more we may be able to say about how things look. However, this, in and of itself, is not necessarily a good thing. Rich expressions require that rendering technology be complex enough to be able to render the richness. And that all of the environments through which we pass on the way to getting the image rendered are at least able to cope with passing the requests on in a significant way.

So, what we probably need is a markup language that provides support which is only about as rich as the problems that you are working on demand. Years of experience with TeX have allowed me to realize that having a very powerful expressive language at your disposal may be as much of a distraction as it is a source of benefit.

Irrelevant Sidebar #

It might be worth noting that the discussion above is reasonably likely to be a good example of one which might well be useful in some other document in addition to this one. So having the paragraphs filed away in some useful and accessable way would be potentially productive.

Other Forms, Other Blogs (DN)

2/27/2002 3:37 PM

I am still unclear about how to handle the push back in this document, but---since our overall topic has something to do with how ideas are communicated---this very question is probably a reasonable one to discuss here as well.

The `Form' of the Diablog

So far I am assuming that it is appropriate to initial levels of the hierarchy of the outline for a document to establish inherited authorship. Thus if a particular level of outline is initialled, all sections below that are assumed to be authored by author of the higher level until one is encountered with other initials. Then the same recursive rule applies.

Similarly for time. The technology that I currently use to manage the original source of these documents does not make it easy to include `time' in the outline, so I have the habit of beginning each major writing effort with a date time stamp. However, I don't really expect this to be very important, and if it is missing I wouldnt make much of it.

The `Structure' of the Diablog

There are some alternative base forms for the Diablog. Among them are:

Diary;
Wiki; and
Diary with `Snips'.
Each of these is worth consideration.

Chronological Blog - Diary

The Diary Blog is strictly temporal in structure, usually with the most recent entries occuring at the top---the conventional acces point for a Blog.

Wiki

The Wiki is a linked structure of `pages'. Each page deals pretty much with one idea. References to other pages are enforced by some mechanism. For conventional Wikis this is an odd naming convention which allows indexed items to be easily recognized.

Diary with Snips

An alternative form, currently supported by languages like (See Vanilla), is generally a Diary but with some support for the automatic indexing that we would usually associiate with a Wiki.

Consideration of Alternatives

There is a certain flow to a Diablog that simple WIkis would not easily capture. This is probably particularly important because in an ongoing project the time flow is particularly significant, and knowing what is new is central to keeping things moving forward.

In a WIki, or even a structure like this document, it can be difficult to know what is new. For documents that are going to be displayed in a computer based medium, this may not be as serious a problem as it is when we are dealing with paper. This is because we can automate things like lists of recently edited parts of the document.

On the other hand, the gradual organization of material into connected thoughts is also an important part of the process of developing the material. This suggests that some form capability to collect information by topic might be useful. There are at least three possible types:

unindexed Blogs;
hand-indexed Blogs; and
auto-indexed Blogs.

Antville Type Blogs

Antville is a nice example of a blog structure that has many desirable properties. For example, it provides for a nice simple structure to manage the information in each article that is in the blog. It also automatically provides recently edited indications so that it is easy to see what material has been recently modified, even if it is somewhat scattered throughout the document.

What isn't so easily done with Antville, at least out of the box, is the representation of any hierarchy associated with the ideas that are present. There is also no natural represention beyond a high level `break' into a topics field. Beneath that there is simply another time-sequenced blog.

Cross-referencing is also non-trivial. One can use the facilities built into the object structure that Antville's underlying structure supports to accomplish cross-references using the facilities that are built into HTML.

Vanilla Blogs

Vanilla is another form of Diary-oriented Blog. What it brings to the table that is somewhat different from Antville is the ability to markup the text in the Blog to create snips that easily be cross-referenced to one another.

Auto-indexed Blogs

There is another alternative, but I don't know of any widely available Blog technology that supports it. This is a structure much like that supported by Vanilla but with an automated indexing of certain terms.

The Experiment

This section contains reports on the experiments which are currently in process. They consist of:

Our Own Dog Food

Our Own Dog Food

It is often suggested that any good programmer or program designer should spend some time eating their own dog food. The essential idea, here, is that any developing technology should be applied, if it can be, by those who are developing the technology. This is likely to give the developers / designers / workers a healthy respect for the problems that the ultimate end users are likely to encounter.

Another example of this approach was (and maybe still is) used in packing parachutes. During the second world war it was not uncommon for those packing the parachutes to have to occasionally use some random sample of their own output for their own jump. The idea was that this would properly focus their energy during the packing process on doing a good job.

In the case at hand, we are using the technology that we are developing for managing this document itself. So far it has already taught us some lessons and, even more importantly, raised some questions about how some things should be handled.

Relative vs. Absolute Outlines

The initial technology brought into the product already had a developed notion of the relationship between a .CSV and .TDB file. In this formulation the .CSV file contained the important hierarchical information about the outline.

This information was stored in the .CSV file in two forms:

relative position of the title; and
an explicit level number.

This redundancy affords us the opportunity to use the explicit level number for relative level indication as well as (in the original implementation) absolute level number.

We will experiment with this implementation by providing an option where explict level number is indicated either by:

an absolute integer; or
a signed integer indicating relative level change

[Added 3/3/2002 1:29 AM] I think this discussion might be a mistake. A little experimentation with relative outlines suggested that managing an outline relatively isn't actually very easy. It would seem to be much easier to accomplish what is likely needed by cutting and pasting in the VAULT form of the data.

Standards

As Standards for the project develop, we need some place to record them. This is that place.

Storage Formats

The project supports a number of different file formats.

.VLT
.TDB
.CSV
.CIN
.HTM via Rebol's make-doc

Elements

There are some atomic elements that this project deals with.

Elemental Markup

The basic elements that can be marked up are:

Italics used for marking emphasis in the text;
Bold often also used for emphasis or for marking some particularly relevant nouns (names of companies in financial discussion, names of programs in a computer discussion etc.;
Code nothing about code markup has yet been implemented in any of our environments.

Skins

Skins deal with the presentation of information on a screen. The presence of a computer in the live distribution channel makes the concept of applying a skin in the presentation process a real and effective one.

Fonts

Fonts are a very complicated problem. There are issues that deal with

font availability; and
character representation.

Process Descriptions

It is useful to have a language to describe the process by which we put a document together.

Other Document Considerations

There are other considerations in producing documents.

Module Names

When possible and meaningful, modules are named with conventional names of the form XXX2YYY for a process that converts files with extension .XXX into files with extension .YYY. Thus the CIN2HTM module translates .CIN files into . HTM files.

This convention does not work well when there is more that one file that serves as input to a process. For example, there are situations where both the .CSV and .TDB files are input. In this case we will generally choose to use the name of the most meaningful of the files, if that is easy to identify.

A slightly less good alternative is to use an unreal name (for example, OUT to represent CSV+TDB. This convention remains underconsideration and is not yet built into any code.

Tasks in Process

Some tasks are already in process.

Outline Management; and
Decontructors
are already in process.

Outline Management

The outline manager allows size information to be passed into the outline, particularly so that the outline viewer will have size information on display by section.

Deconstructors

Several deconstructors need to be written. So far the list is:

CIN to Outline.:

CIN to TDB Deconstructor

A .CIN to .CSV, .TDB deconstructor has been written without much difficulty.

Outline Size Program

The OUTLIN program manages the addition (and deletion of) size information in an outline.

The simple function of this program is to use the size information that is present in the TDB header to append a size notation to the lines of the outline for a document.

The added information has a straightforward form: [#1+#2]. This represents the fact that text item corresponding to the header contains #1 characters, and all of the items below the current entry contribute #2 characters in addition.

An alternative form is available. The character counts can be shown as [#1+#2=$3], where the last number is the total number of characters in the current section plus those in sections below.

Format of Outline Entries

The Outline files conform to their usual format. However, they have the additional property, generated by this program, of having the [#+#] entries added.

OUTLIN Operation

Executing OUTLIN File Switch uses the information in File.TDB to update the entries in File.CVS to their proper (according to the .TDB file) values.

This is done if Switch is anything other than D (including null). If the switch value of D is given, then the size information is removed from the lines if it is currently there.

The Switch may also contain an S if the [#1+#2=#3] form of the outline results are desired.

Document Log

Some elementary software has been constructed to start to build a document `log' of auxillary information about a document.

This software currently only logs file size and date, but it will be expanded to take on whatever we find useful as the project evolves.

The DOCLOG Program

The DOCLOG program manages the DFX files that are the history, log and general control file for the document.

Format of DFX Files

The general form of DFX files are lines that are generally of the structure Key: Value.

Document: 3RD The base name of the document.

Title: The Third Degree The title of the document.

Principal: Ness The name of the principal author.

Seconds: Ness The name of the secondary author, or a (comma separated) list of authors.

Source: D:\DB\3RD.CIN The principal Source file that holds the document. This can be a (comma separated) list of source file locations. [Sidebar: In the current implementation, the disk drive is a conventional drive letter, assumed to be reachable from the machine that houses the .DFX file. It is possible that this should be changed to be a logical name instead.]

Type: Outline The type of the document file. At the moment the supported types are:

Outline

Markup: DN The type of Markup used in the source file. At the moment the supported markups are:

CIN
(b HTML}

Level: 3 For an outline file, the base level of the outline.

#Hist: 0 The number of the last history entry in the .DFX file. The numbering (for better, or worse) follows perl's convention.

Hist[0]: VAIOD|\DB|10780|2002-03-02 0218 UTC| A history entry. The fields are delimited by the vertical bar, and the fields are

Logical Disk Name;
Directory (on that disk);
Size; and
Date and Time of the file.

#Log: 0 The number of the last log entry in the .DFX file.

Log[0]: 2002-03-06 1353 EST|Consolidate 3RD Source Documents into 3RD.ZIP| A log entry. It consists of: (#)
the Date and Time; and
Logging Text.

DOCLOG Operation

H Function This function causes a `snapshot' of the source file (size, date and time) to be logged. This History function is incremental, and previous Hist[n] items are retained.

L "Text" This function logs an entry of the Text at the current date and time. This Log function is incremental and previous values of Log[n] items are retained.

Running DOCLOG File Cmd1 ... performs the indicated commands reading from File.DFX and (currently) writing into File.NFX.

Other Questions

In the process of working with all of this, questions will arise. This is a (dynamic) list of some of those that are currently unresolved.

Strict Outline Hierarchies

Conventional outline hierarchies have one strange aspect. We can sensibly go in only one level at a time, but we can come out any number of levels. In free flowing text this is usually indicated by the fact that levels of hierarchy are shown by some typographic artifice that suggests the hierarchy (the larger the type the higher the level of the item, for example). It is conventional for paragraphs to imply a continuation of the same item unless the typography suggests otherwise.

When doing long quotes, though, other typographic conventions apply. Long quotes are often shown by indentation, and in such cases we have a clear typographic indication when the quote ends that we are returning to the level that we were in before the quote commenced.

There needs to be some discussion of the role of typography in showing the structure of ideas.

Word Counts / Character Counts

There are situations where it is nice to be able to get some kind of indication of how much text has been written. This is particularly true when looking at an outline for a document that is under development. Should this information be stored, or always generated on the fly.

It is worth noting that, as things already stand, the size of text fragments is directly computable from the header to .TDB files. These headers contain entries that mark the character position of the beginning and ending characters of any fragment, and so by just computing the difference between them we get a count of the number of characters in the fragment.

Object Markup

We need to experiment with object markup in our text. The basic notion is that objects might be given some form of logical markup that might have more than just typographical consequences.

Of course, this would raise the question of when the typography issues would get resolved in the chain of document production tasks.

Are Comments Special?

Some recent work, as well as some national controversy, raises an interesting issue about comments and source material. The question is whether such materials should have a special kind of role in our information complex.

It is clear that both of these types of items could just be treated as conventional material, but some of the recent scandals associated with the plagerism of source material suggests that it is quite easy for historical tracks to get lost and for external source material to unintentionally become incorporated into documents without the appropriate quoting and referencing.

Incoming from SOK 2002-03-01 (SOK)

Google Compute + Misc.

Date: Fri, 01 Mar 2002 08:53:04 -0500 From: David Ness To: Kimbrough Subject: Have you seen . Ref to Doc It looks (vaguely) relevant.

sok Comments

It is fascinating in any event. Cool. It's an idea that has been floating around a long time. (At least a couple of years in a practical sense.) SETI guys. Also, there have been commercialization ideas like this. Am continually impressed with the innovations coming out of Google.

On blogs, wikis and text databases generally, during the last few days I've gone from intrigued to seriously wanting them NOW. On various projects I'd like to set up a blog with controlled access and even within that further controls. E.g., everyone on the project can see most things, but individuals can put in documents that only they and I can see. Subgroups. In principle no problem on Unix. Am keen to do a first-cut design document. Unfortunately, as it were, I'm off to London for a week on March 8. Can take notes during that time. Let's book some serious brainstorming sesssions after March 18.

David Ness' summary of work can be found at http://mywebpages.comcast.net/dness and Steven O. Kimbrough's can be found at http://opim.wharton.upenn.edu/~sok/cv/cv.html
This Diablog began on 26 February 2002.

Home