David Ness
Mind / Matter

Going to L: The Other Side of K

By David Ness
Thursday, December 6, 2001

K is an interesting language. It is distinguished by its derivation, look, performance and effectiveness.

K is an important language in certain problem domains largely, but not exclusively, associated with the management of `tick' data bases of stock market data. Actually, problems are usually tackled with KDB, a data base management language built on a base of K, a general purpose computational language that serves as its underpinnings.

My use of K is in a totally different problem domain. I use it for lots of much more conventional programming tasks where it serves as sort of a character-oriented immediate interpretation language, useful for manipulating text data in many different situations.

Brief History

K is the outgrowth of a series of developments that occurred in the Wall Street community in the past decade or two. APL had a reasonable footprint in this community as many of those trained in financially oriented MBA programs in the 1970s and early 1980s had some APL training. Good as it was, thoughts about APL continued to evolve and three separate threads began to spin out from the core:

  • A+
  • J
  • K

Arthur Whitney, drawing on the experience of developing A+, has shifted his attention to K, the subject of this note. J has a rather different orientation, and represents the direct evolution of the thoughts of APL's inventor Ken Iverson.

In this note, we will focus our attention on K.

Further, we focus our attention on that part of K that is often neglected by most of the practitioners of The Art of K Programming, namely its use in situations other than those that relate to large scale data bases and financial transaction information.

Little Data Bases

The first problem we are going to use as an example here is that of managing small data bases. I have found these to be very useful for storing a wide variety of different kinds of information. The principal characteristic of these problems is that we want to be able to manage the data base very simply and straightforwardly, often by directly manipulating the whole data base in a text editor directly.

Flat Files

The data is stored in what are commonly called flat files. This kind of file contains data on a one item per line basis. It is therefore only appropriate for very simple, flat, data. That is to say data which doesn't have complex relationships or any depth structure. The flat files used here have a very simple format.

First, there are three types of lines. Comment Lines begin with equal signs and simply contain comments. They therefore do not represent any data. There can be as many comment lines as are desired, and they can appear anyplace in the file, but in most cases there is at most one comment line, and if it appears it is usually the first line and serves as a sort of label for the data in the file.

A second kind of line is a Format Line. This line begins with an asterisk (*) and consists of the names of fields, which will be described in a moment. Field names consist of alphanumeric characters only. A format line must occur in the file before any data lines appear, but there may be comment lines ahead of the first format line. In general our files will have only one format line in the file.

After the Format Line appears, there may be as many lines of data as are necessary. Each data line consists of a number of fields. The number of fields must exactly match the number of fields that appear in the format line. Each field is interpreted as being the value of the current item that corresponds to the field that occurs in the format line in the same order. It makes sense to describe this data as rectangular because the fields may be thought of as columns, and each item is represented as a row that contains the same number of columns (fields). Thus the data forms a rectangular array.

Each field is followed by a separator character. Here we will generally use the vertical bar (|). There are two other common choices, that of a comma (,) and that of tab. When the comma is used, the files are often called CSV ("Comma Separated Variable") files.

I chose to use vertical bar because the comma may well occur in many kinds of common data fields. For example, book titles often have commas as do things as varied as many French addresses ("4, ave. de l'Opera") and some names ("John F. Kennedy, Jr."). In most CSV files this problem is handled by surrounding any fields that contain commas with quote marks. The problem with this strategy is that it very substantially complicates the code needed to deal with the problem.

I also generally avoid using tab as a separator because white space can be handled by some editors in ways that may end up confusing the data. And blanks and tabs may get mixed in the process, leading to misinterpretation of the data. This generally makes it wise, I think, to avoid using tabs for this kind of purpose.

By using the vertical bar as a separator these problems are largely avoided. Admittedly we make the sacrifice of preventing the vertical bar from actually occurring within any real data item, but in most situations this is not a very serious constraint, as most conventional data sources makes scant use of the vertical bar as a real character.

Perhaps the first few lines of a file that represents the Presidential Administrations of the U.S. might be illustrative:

=PRES.ISW A List of the Presidents and Vice Presidents
*NBR|PRES|VICEPRES|BEGINT|ENDT|
1|George Washington|John Adams|1789|1793|
1|George Washington|John Adams|1793|1797|
2|John Adams|Thomas Jefferson|1797|1801|

Notice that in their standard form the fields are not padded in any way in some attempt to reconcile lengths.

Some Code

Now let's start to build up some code to deal with this. We'll build up the code in a directory called ISW (For historical reasons of little interest or relevance here, files of this type have been called ISW files, and that tradition will be continued here).

First, we create a function that can eliminate comments from any list of lines. It is the `trim equals' function:

 te:{x[&~"="=*:'x]} 

does the job. Let's see how. The *:'x applies the `first' monad (* is either `first' as a monad or `times' as a dyad, and *: says to use the monad) to each (the ') of the items in the argument (x). These characters are compared to the "=" character, and the `not' (~) and equals (=) create a binary vector with ones where the lines do not begin with an equal sign. If we apply the `where' monad (&) to this we get a list of the line numbers of the lines that don't start with equal signs. If we use this to subscript the argument array (x[...]) we get a result of all of the lines that don't begin with an equal sign.

Given this function, we can now read the whole data base with a rather terse function which still manages to demonstrate all kinds of technique:

\d ISW
sep: "|"
ld:{r: -1_' 1_'' {(&:'sep=/: x) _' x}
          sep,'te[0: :["." _lin x;x;x,".ISW"]]
    r[0;0]:1_ r[0;0]
    . (`$r[0]),',:'+1_ r}
\d ^

where we (unfortunately) had to split the first line of the ld function to be able to display it well. The load function is called with either a filename or a filename and extension as an argument. The first thing done in the function is a conditional (:["." _lin x;x;x,".ISW"]) which returns the argument (if an extension was explicitly specified) or appends a ".ISW" to the filename if it wasn't. Then it reads in the whole of the data base (the 0: operation does this). The te function which we just developed above is given the file content as an argument, so it returns all of the lines in the file which do not begin with equal signs.

The `join-each' (,') puts a separator character on the front of each line, and the result is then cut up by {(&:'sep=/: x) _' x}. This function works by making a list of all of the vertical bars that appear in each line, and then it chops up each line at those character positions. This leaves a list of lists each element of which begins with the separator character. The 1_'' trims the first character off of each field (the separator), and then the -1_' throws away the last field on each line, which happens to be a null field that is an artifact of the way we have done the cutting. All of this is then stored away, temporarily, as r.

The second line of the function has a very simple function. It removes the asterisk that is the first character of the first field of the format line.

The third line of the function really does a rather startling amount of work. After removing the first (format) line of the array (1_ r) the result is transposed and each item is enlisted. Then a symbol corresponding the the names that appear in the format statement is prepended to the lists and the entire result is converted into a dictionary. This dictionary is rhen returned as the value of the funtion. Let's see how this function might be used, assuming that it has been defined in a directory ISW.

Dat: ISW.ld["PRES"]

If we then type

Dat.PRES

we will get

("George Washington"
"George Washington"
"John Adams")

Notice that all of the data is loaded in the form of strings, so if we wanted to do any computation with the dates we would need to do something like:

.:'Dat.BEGINT

to get a numeric list of the beginning dates of the Presidential administrations.

Auxillary Information

There is some software that produces what is called a key file from a data base. The key file for (the full version of) the PRES.ISW data base looks like this:

*LONG|SHORT|SIZE|JUST|TYPE|
President No|NBR|2|R|I|
President|PRES|25|L|S|
Vice President|VPRES|24|L|S|
Begin Term|BEGINT|4|R|I|
End Term|ENDT|4|R|I|

The information in this file consists of five basic facts, and each data item corresponds to a `column' of the original data base. These facts correspond to the long name (print name) of the field, the short name (name in the .ISW file, for example) of the field. The maximum size of any current item, the default justification of the item and the type of the item (currently either S for string or I for integer).

By current convention, the size field is given a correct initial value when the file is created by the software that currently exists for this purpose. A particular program using or manipulating the data is not, however, under any obligation to keep this up to date. If a program wants to `count' on this value, it should either arrange to call a checking program which will validate the items in the file, or it should load the data and measure it itself. For stable data bases this obviously does not need to be done very often.

The data in a key file, if one exists, can be loaded as easily as the data in the data base itself.

lf:{t:ld[x]
    t[`FMT]:ld[x,".KEY"]
    t}

The first line loads the data file. The second line arranges to load the key file---notice that it has a proper .ISW format itself---as a FMT item within the directory, and then the third line returns the whole directory that has been created as the result of the lf operation.

An Alternative Form

There is an alternative form for a simple data base. I call this the long form (.LNG). Sometimes this form is more convenient to work with, other times the .ISW form is easier to manage. By having the ability to simply move back and forth, we can choose whichever form suits our convenience at each point.

First, to define the form. .LNG files have one field on each line. The form of each line is quite simple:

Name of Field: Value of the Field

The only real constraint here is that the Name of Field cannot, itself, contain a colon, as it is the colon that marks the end of this name, and therefore the beginning (after a space) of the value of the field.

In this form, fields are listed one after the other, and items are separated from one another by a line containing an indeterminate number of hyphens (but at least more than 6). Fields are only present if their value is not null. At the moment we do not distinguish between a field being present but null and that field being absent.

By convention the first item in any file consists of a `defining item' which has the following properties:

  1. Every possible field is present; and
  2. The value of each field is the short name to be used for the field in an .ISW file.

Let's look at the first two items in PRES.LNG, the converted version of PRES.ISW that we presented above:

President No: NBR
President: PRES
Vice President: VPRES
Begin Term: BEGINT
End Term: ENDT
------------
President No: 1
President: George Washington
Vice President: John Adams
Begin Term: 1789
End Term: 1793
------------

At the moment, we will not worry about the issues associated with comments.

Writing a .LNG File

The terse code to write a .LNG file looks like this (with apologies because we have to break the lines so that they look ok here):

\d LNG
v:{,/((.k[Q][`FMT][`LONG],\:": ")
      ,'/:x),\:,"------------"}
wr:{Q::`$x
  r: +{t::!0; {t::t,,.k[Q][`$x]}'x; 
      :t}.k[Q][`FMT][`SHORT]
  (y,".LNG")0:v[,.k[Q][`FMT][`SHORT]]
      ,v[r][&~0=#:',/r,\:,"----------"]
}
\d ^

Here we have some code that may require some explanation, particularly since the role of the argument x can be a little obscure until you get used to reading it.

We'll get to the v function in a moment. There are two arguments to the wr function, the first is the name of a directory and the second is the file name to be assigned to the .LNG file that we are creating. The code displayed here is, as of this writing, sloppy in the sense that it assumes the data directory, which is going to be written out, hangs off of the root K directory (which is called .k). However that saves complexity in the explanations here at the cost of some generality (which we happen not to need much in the case of these demonstration programs).

The first line of the wr function makes the first argument (name of the directory to be written) into a symbol (Q) which can be used in the function. Then the first line makes use of this with a reference to .k[Q][`FMT][`SHORT] which is a list of the short names of the data items that comprise the data base. This list is fed item by item (.k[Q][`$x]) into a list (t) which is then transposed and saved as r. This means that r now contains the data base, item by item.

The second line is rather typical K, but not too difficult to understand, as the real complexity of what is going on is hidden in the v function (which we will get to in a moment). The key operation on the second line is an ASCII file write (0:) which will write the data (stuff to the right of the 0:) into the file named by the second argument with .LNG appended (y,".LNG"). What gets written is the concatenation of the translation of a header item (v[,.k[Q][`FMT][`SHORT]]) and the translation of the data v[r], with the data trimmed by removing the null data items [&~0=#:',/r,\:,"----------"] from the list. Tackling the latter, we expand the data base (in r) by adding a separator line (,"---------") and then `flattening' the file (,/). Next we count the length of each line, and use where (&) to make a list of the lines which are not null. This is used as a subscript of the translated data array (v[r]), thus giving us a list of the non-null lines.

That leaves us with only one mess, v[...] and it is quite a `typical' K function: obscure at first glance, but also basically understandable once we see how to break it down. So let's get to it.

It may be easiest to understand v by seeing its `big picture' first. We are clearly assigning a function to v, and there is one parameter, x. If we look inside see it has the form

,/(...),\:,"------------" 

where we have elided some inner complexity. Leaving aside what is done inside the parenthesis, this fragment has two main effects. First, a separator line (,"-----------") is placed after each item on whatever list is (presumably) generated by the stuff in the parenthesis. That's what the eachleft join (,\:) function does. Then the ,/ is used to flatten this list to a single level, producing lines that can be written directly into a file.

Now we are left with the messy inner parenthesis. To understand it in detail we need to know that the data we hand to the function consists of a list of items, each of which is itself a list of fields.

((.k[Q][`FMT][`LONG],\:": ") ,'/:x)

will iterate over each field within each item of the argument (/:x). What will be done with each field is to join it (,') with each element of a list that consists of the long names of each data field followed by a ": ". This produces a list of lists where each sublist has the usual name: value form that is characteristic of .LNG files.

Now, with this technology in place, solving the problem of converting an .ISW file into a .LNG file becomes trivial:

/ <$Code to Translate .ISW into .LNG$>
/ load ISW.k and LNG.k from appropriate lib
Data: ISW.lf["PRES"]
LNG.wr["Data";"PRES"]

does the job.

Now we can introduce a function to write out an .ISW file.

\d ISW
wr:{Q::`$x
   p: (. .k[Q])[;0]
   t:: +{t::!0; {t::t,,.k[Q][x]}'x; :t}
       r:p[& &/ ~p =/: (`FMT `LONGNAME `SHORTNAME `END)]
   (y,".ISW")0:(,"*",,/($ r),'sep),,/'t,''sep
}
\d ^

This code is a lot like that which wrote out .LNG files, except it is a bit less complicated. This is not a surprise because its structure is somewhat simpler than that used in .LNG files. The one task that is finessed here is writing out any comments into the output file. It would be easy to add some small code to write comments at the beginning of a file, but it is not regarded as worth the potential complications to attempt to restore comments to what might have been their original position, should they have occurrred elsewhere.

There is parallel code for loading a .LNG data file.

\d LNG
ld: {a:0:x,".lng"
     v:a _sm\: "------*"
     a[v]:(#v:&v)#,"End: END"
     cl: 0,/: cl ,' 2 + *:'cl: a _ss ": "
     db: (+cl _' a)[0 2;]
     tab: db[0;!1+v[0]]
     db[0;]: tab ?/: db[0;]
     qq:-1_ (0,1 + v) _ +db
     rs:{@[(#tab)#,"";qq[x;;0];:;qq[x;;1]]}'!#qq
     . +(,`LONGNAME,`SHORTNAME,`$rs[0]),
           (,(,tab),(,rs[0]),+1_ rs),,(2+#tab)#_n}
\d ^

The first three lines read the input file, locate the separator lines and then replace the lines in the original source file with some `end' markers that conform to the general format of the rest of the lines in the .LNG file. Then the cl list is constructed, marking the positions in the lines where the (first) ": " occurs. The lines are cut around each of these markers and the markers are dropped.

The information in the name column of the first item in the data base is stored in tab. Then the values of the name column for the entire data base are translated into indexes of the tab array

HTML Parsing

Another context where I find K quite useful is in making sense of HTML files. This also illustrates reading a file in `byte' mode as opposed to `line' mode.

/<$HTML Reader - Ver DN-1A(2) - [CWY-KZPW]$>
pl:{a::6:x
   {a::_ssr[a;x;" "]}'("\r\n";,"\n")
   "aaa.tmp"6:,/(1_ (&a="<")_ a::"<",a),\:"\r\n"
  `0:"Doing ",x,"->aaa.tmp"}
pl[_i[0]]
. "\\\\"

This code is quite straightforward, once you see the point of what is going on. The pl function reads the file who's name is mentioned on the command line. This file is read in straight ASCII bytes because HTML files are notorious about being cavalier with returns and line feeds. The first official act, in the code, is to get rid of returns and line feeds. Then the file is broken into lines each of which starts with a left-angle-bracket character. This places all HTML codes on separate lines, and follows them with any free text that may follow their corresponding right-angle-bracket. While this does not capture any information about the nesting of the HTML codes, it does make it easy for later processes (typically perl programs) to process the files without the potential confusion that is caused by odd placement of returns, line-feeds and other random `breakage' which has no effect on the meaning of the HTML but which makes the processing of it a real nuiscance.

Interaction with Perl

There are situations where it is particularly nice to be able to mix K with other languages. One place where I find this capability particularly useful is mixing K with perl, as I far prefer perl's regular expression string matching capability to the rather crude facilities which are available in K. In my case the data that I deal with is also generally of such a size that it does not cause me particular pain to write some data out, manipulate it, and then read it back.

I do this often enough with perl to have developed a very simple, but so far effective, procedure for handling this problem with some generality. And in developing this procedure, I wanted to solve one further problem, namely allowing the driving K file to contain whatever perl code and other supporting data that was necessary, in such a fashion that it is not necessary to manage more than one source file to accomplish the whole process.

My design for this facility is implemented in the next fragment, and then used in the program that follows it.

/ <$Allow K Files to have Wrapper - Ver DN-1A(1) - [XUQ-GNQX]$>
\d Wrap
a:{x[0] 0: 1_ x}'b: (&,/"="=(1#2_)'a)_ 3_'
     a: 0: z: MyName,:["." _in MyName;"";".K"]
\d ^

The fragment above performs a simple function. The file that is to be `unwrapped' is assumed to contain a number of K comment lines which have a particular format. They are either data lines (a slash followed by two blanks followed by arbitrary text) or name lines (a slash followed by a blank followed by an equal sign and a file name). The WRAP.K fragment, when executed, causes the block of data lines to be written into a file of the name indicated on the name line which preceedes it.

In the file which follows we use this device to write a perl program into a file passtrak.pl. This file is a legitimate perl program. Its particular function does not need to be discussed here. Suffice it to say that the overall effect of the whole program is to take the HTML that is delivered from UPS Web site concerning the delivery of a particular package. This code reads the file created by UPS and produces a little data base which contains information about the progress of delivery of a package.

/ <$UPS - TRACKING.CGI analysis - Ver DN-1A(1) - [SWF-TQUG]$>
if[()~C::_I;`0:"Track What?";."\\\\"]
Wrap.MyName: "UPS"
. "\\l ",_getenv["AltC"],":\\STORE\\K\\WRAP"
pl:{a::6:x<
    a::_ssr[a;"\r\n";""]
    a::_ssr[a;"\n";""]
    b::(0,&a="<")_ a
}
pl["TRACKING.CGI"]
"tmp.tmp" 0: b[&b _sm\: "<FONT*"]
\perl Passtrak.pl
\del tmp.tmp
\del passtrak.pl
\sort <UPS.ISW >A.TMP
`0:"TRACKING.CGI -> UPS.ISW\n\n"
a: 0: "A.TMP"
Date: "=",("-",$ _dj _ t:_T)[1 2 3 4 0 5 6 0 7 8] Date: Date," ",((":",5$10000 + (_ 60 * (t - _ t)) + 100 * (_ t: 24 * t - _ t))[2 3 0 4 5])," GMT" "UPS.ISW" 0: (,"=",C[0]),(,Date),a . "\\\\" / =PassTrak.pl
/ @ChristianShortName = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec); / @Mnth{@ChristianShortName} = (0...$#ChristianShortName); / open(In,"<TMP.TMP"); / open(Out,">UPS.ISW"); / print Out "*Day|Time|Place|Text|\n"; / while (<In>) / { / chomp; / next if (!/^<FONT /); / s/^<FONT [^>]*>//; / if (/(\S\S\S) (\d+), (\d\d\d\d)/) / { / $Mo = 1 + $Mnth{ $1}; / $Day = sprintf "%4d%02d%02d",$3,$Mo,$2; / } / elsif (/(\d+):(\d+) ([AP])\.M\./) / { / $T = $1; / if ($T == 12) { $T = 0;} / $Time = 100 * $T + $2; / if ($3 eq "P") { $Time += 1200;} / $Time = sprintf "%04d",$Time; / $Place = <In>; chomp($Place); / $Place =~ s/^<FONT [^>]*>//; / $Text = <In>; chomp($Text); / $Text =~ s/^<FONT [^>]*>//; / print Out "$Day|$Time|$Place|$Text|\n"; / } / } / close(In); / close(Out);

This results in a file (it conforms to ISW format, of course) that represents what I recall was a shipment of memory from Crucial.

*Day|Time|Place|Text|
20010522|1708|BOISE AIR SORT, ID, US|ORIGIN SCAN|
20010522|1902|US|PICKUP MANIFEST RECEIVED|
20010523|2055|PHILA AIR HUB, PA, US|LOCATION SCAN|
20010524|0051|WEST CHESTER, PA, US|ARRIVAL SCAN|
20010524|0318|PHILA AIR HUB, PA, US|DEPARTURE SCAN|
20010524|0408|WEST CHESTER, PA, US|ARRIVAL SCAN|
20010524|0559|WEST CHESTER, PA, US|DESTINATION SCAN|
20010524|1317|WEST CHESTER-MALVERN, PA, US|DELIVERY|

While this example is fairly complicated, I hope it gives some idea of the kind of capability that can be easily delivered by a mixture of K and perl.

 

David Ness' summary of work can be found at http://mywebpages.comcast.net/dness

K is Arthur Whitney's remarkable language.

K is a distant descendent of APL, and is used by several big-league financial clients, particularly for situations where fast deployment of systems that access huge data bases is the core problem.

K easily handles problems with data bases that store hundreds of millions of data items.

Home