David Ness
Mind / Matter

An Experiment in J and K: Part II

By David Ness
Sunday, February 11, 2001

Data Base

A previous note discusses developing simple programs to access information that is stored in my telephone data base. The programs worked well, and I have used them regularly over the past couple of months.

Indeed, the programs involved worked well enough that I expanded my phone book quite substantially by adding the names and locations of lots of the good restaurants in the cities that I often visit. I found lots of lists of restaurants on the net, and using J, K and perl, I managed to get them converted into a format that allowed them to be incorporated in my address book. In addition I figured out a way to get the latitude and longitude of these restaurants, and this meant that my GPS unit could become an increasingly useful appliance in my travels.

There was one undesirable, but obvious, side effect. My phone book increased in size quite dramatically, and it looks like it will continue to increase even more as I find more sources of good restaurant data.

The Problem: Speed

Since disk is cheap, size only becomes a problem if the time required to do something with a large data set becomes unacceptable. With my original data set, the J code responded quickly and the K code responded instantaneously. With the newer, larger data set, however, the delays with K became noticeable and those with J became intolerable---to the point of making the J solution not useful as a practical matter.

Formatting: The Drag Parachute

A brief view of the code, and a couple of timing experiments, suggested that most of the delay was associated with picking up the original data base and formatting it into an accessible form. A little further refinement of the experiments suggested that loading data which was already in the appropriate form for data base access might allow a substantial improvement in performance if it proved to be possible.

Live vs. Stable Data

In this problem, the data base is not very `live'. Phone numbers change, new people and places get added and old entries get deleted, but all-in-all not very often. Every few days, perhaps, a number will change or there will be an addition. However, numbers will be looked up many times for each change that gets made. This makes it quite plausible to consider separating the formatting stages of the data parsing task from the lookup process.

Using K's Data Format

Implementing this approach in K proved to be trivial. K supports a `data read' function, so splitting the problem into two parts

  • Reading the original data, understanding its format and writing out a `data file'; and
  • Loading the 'data file' to lookup and present an entry.
offers the prospect of an improvement in performance. The net effect of this approach is to save the cost of reprocessing the format information each time a number is looked up while incurring the cost---whatever it may be---of storing the data in data base form and loading it each time a number is to be retrieved.

The Results with K

It took about 10 minutes to implement the changes to the K version of the code. The original PHONE.K file was split into two parts, essentially around the dividing line separating the loading and formatting of the original data from the task of looking up the entries and displaying them on the screen. The conversion task was put into a file PHONECVT.K, while the lookup was left in PHONE.K.

The results were salubrious. The conversion was very rapid. And although the data base data file created by the process is fairly long, tit loads quickly enough in the second phase of the operation to result in a dramatic improvement in overall speed. There is a storage cost: the 362k original data file results in an 851k data base, and this is an `extra cost'. However, in the modern world of cheap storage, this is not a very substantial concern, unless it would cause the data base to `load' slowly in the second phase. Since this doesn't happen, the effect of the change is quite positive. The resultant code responds detectably more quickly with the new larger data base---and the new programming approach---than it did before the additions and conversion.

The K Code

The new K code is presented in two pieces:PHONECVT.K---responsible for converting the data into a data base file, and PHONE.Kwhich loads the data base and performs the lookup. This code is available on the web page.

/ <$Convert Phone .LNG into DB 
     - Ver DN-1A(4) - [KMW-WSPY]$>
/ Called from Console with `k phonecvt'
/ directory for data
d:_getenv["AltC"],":\\work\\phone" 
a:0:d,".lng" / get the data base
/ find lines that end entries
v:a _sm\: "----------*" 
/ make them `End: End'
a[v]:(#v:&v)#,"End: End" 
/ `l n' finds entry # that contains line n
l:{-1 # &(x>-1 _ -1,v)&x<1 _ v,1+#a}
/ `r n' dumps nth entry
r:{ x:,/x; a[b+!v[x[0]]-b:1+(-1,v)[x[0]]]} 
/ prepare to cut up lines
cl: 0,/: cp ,' 2 + cp: a _ss ": "
/ cut them and toss `: '
db: (+cl _' a)[0 2;]
/ get keys from 1st item
tab: db[0;!v[0]]
/ `key' the items
db[0;]: tab ?/: db[0;]
/ fn["Field Name"] crates a `column' 
     of that particular field
fn: {p: &db[0;] = (tab ?/: ,x)[0]; 
      h: (#v)#,""; h[,/l'p]: db[1;p]; h}
/ pickup the fields of interest in this example
Data: fn'("Home Phone";"Last Name";
     "Business Phone";"First Name";
      "Other Phone";"Mobile Phone")
/ Save Db, Searchable and Ends
d 1:.k[`Data `a `v]
. "\\\\" / Quit

Here is the code that is called each time we want to do a lookup.

/ <$Load Phone DB - Ver DN-1A(6) - [VKU-EGVH]$>
/ Called from PHONE.BAT with `k phone %*'
/ DB prepared by PHONECVT.K Fields: 
     Home, Last, Business, First, Other
\m f Lucida Sans Unicode-12
c:_i / save command args
/ location of file
d:_getenv["AltC"],":\\work\\phone" 
.k[`Data `a `v]:1:d / load the data
/ `l n' finds entry # that contains line n
l:{-1 # &(x>-1 _ -1,v)&x<1 _ v,1+#a}
/ find the items that match command arg, 
/ doing a `nub' in the process
/ display items that match argument 
/ somewhere, sort on last,first
ds:{m: & (!#v) _in\: ,/ 
       l'&a _sm\: "*",x,"*"
    q:Data[;m]
    `show $ +(6#-15) $/: 
         (,Data[;0]),+q[;<Q[1;],'Q[3;]]} 
ds[c[0]]

Trying the K Code

I've tested this code at three different sizes and I've compared the new approach to the old approach (where conversion and lookup was performed every time the program was invoked).
 

Item Original 3x Original 9x Original
Old Code 2.0 sec  45.7 sec  457.6 sec 
Conversion 2.5 sec  47.1 sec  472.8 sec 
New Lookup 0.3 sec   0.6 sec  2.0 sec 
Original Data 362k  1.1m  3.3m 
DB File 851k  2.5m  7.7m 

This rather clearly demonstrates the effect of separating the data interpretation and structuring process from the lookup process as the amount of data increases. The `old code' and `conversion' processes (which do essentially the same thing except for the fact that the old code adds a lookup) seem to increase with the square of the size of the data (roughly) while the lookup process is holding quite linear across the size scale studied here.

J Code

After a couple of false starts, and with the kind help of some of the regulars on the J Forum, I finally got J code that is quite parallel to the K code discussed above working. This code, like the K code, is now divided into two stages, one of which converts the data into a convenient storage format, and the other of which is responsible for looking up the appropriate entries.

NB. <$Convert PHONE.LNG to PHONE.IJF 
           - Ver DN-1A(4) - [APR-DOJQ]$>
NB.!Category: [PHONE;`LNG]
each =: &.>
Toss =: 26 13{a. NB. characters to throw away
s =: ' '"_NB. space and separator constants
c =: ': '"_
f =: (2!:5 'AltC'),':\db\phone'
NB. load the phone db, toss EOF/CR 
           and cut into lines
a =: ([: (< ;. _2) (] (#~) ([: -. 
           (Toss"_ (e.~) ])))) 1!:1 <f,'.lng'
NB. find the lines that are separators 
           and replace them with `End: End'
v =: (# i.@#) > ([: +./ '-----'"_ E. ]) each a
a =: (<'End: End') v}a
NB. build thd data base and a list of column titles
sl=:(c E. s,],c)(<;._2)(s,(],c))
tab =: 0{|:(>:0{v){.db =: (1&}.) 
           each"0 >sl each a
NB. replace labels with index numbers
db =: |: (<"0 (tab i. 0{|:db)) 0} |: db
NB. `l n' gives item number that contains line n
l =: 3 : '0{(# i.@#) (_1}.y.>_1,v) *. y. <: v'
NB. `fn text' yields the column of data named `text'
fn =: 3 : 0
   tt =. ~. > l each <"0 xx =. 
           (# i.@#) (tab i. <y.) = >0{"1 db
   (1{"1 xx{db) tt} (#v)$<''
)
NB. build a base of what we want to display
Data =: |:>fn each ('Business Phone';
           'Last Name';'Home Phone';'First Name')
NB. Save the relevant data for (J)PHONE
load 'jfiles'
jcreate f
(<Data) jappend f
(<a) jappend f
(<v) jappend f
2!:55''

Comment

NB. <$Load Phone DB 
     - Ver DN-1A(4) - [GFD-ZFLN]$>
NB. called by JPHONE.BAT containing:
NB. `e:\j406\j.exe e:\j406\user\dnn\phone.ijs %*'
s =: ' '" NB. space and separator constants
c =: ': '"_
sl =: (c E. s,],c)(<;._2)(s,(],c))
r =: 3 : '>,,.sl each a{~([ 
      (+i.) -~)/ 1 0 + (y. + 0 1){_1,v'
NB. `sh n' is a lot like `r n', 
      but I think it's prettier
sh =: 3 : 0
   xx =: >0{zz =: |:(1&}.) 
          each r y.
   (<>(-+/*./ \|.' '=|:xx)|. 
          each <"1 xx),.<>1{zz
)
load 'jfiles'
f =: (2!:5 'AltC'),':\db\phone'
Data =: >jread f;0
a =: >jread f;1
v =: >jread f;2
l =: 3 : '0{(# i.@#) (_1}.y.>_1,v) 
          *. y. <: v'

NB. find the items that have elements 
          that match the argument
m =: 3 : '~.>l each (# i.@#)> +/ 
          each y.&E. each a'

NB. display it
dc =: 3 : 0
   if. a: -: 0{y. do. y. =. 1}.y. end.
   if. '-' = >0{y. do. (>sh 
          each ;m each 1}.y.) 1!:2(2)
                   else. 
          ((0,;m each y.){Data) 1!:2 (2)
   end.
)
NB. get the (last) command argument
cmd =: (1 - a: = cmd) # cmd =: 1}.< 
          ;. _2 (11!:0 'qcmdline'),' '
dc cmd
q =: 3 : '2!:55 '''''

Comment

Trying the J Code

I've run the same experiment with J.
 

Item Original 3x Original 9x Original
Old Code 11.3 sec  129.5 sec  1276.1 sec 
Conversion 11.7 sec  117.0 sec  996.9 sec 
New Lookup 1.2 sec   3.8 sec  11.1 sec 
Original Data 362k  1.1m  3.3m 
DB File 665k  1.3m  1.6m 

This is worthy of a few comments. First, J is somewhat slower than K, something that seems to be true in more than just this one domain. Second, J seems dramatically more conservative of space than K, at least as far as the storage structures that happen to be used in this example indicate. Third, the same approach seems to produce (no surprise here) similar results in both of the languages.

David Ness' summary of work can be found at http://mywebpages.comcast.net/dness

This note continues the development of the story of managing a simple phone data base in J and K

Home