David Ness
Mind / Matter

An Experiment in J and K: Associative Arrays via Hashing

By David Ness
Monday, February 26, 2001

The Purpose

The purpose of this note is to continue to exercise my problem solving capabilities in J and K and to demonstrate some straightforward use of the J and K languages. It is not intended to present examples of particularly good code (it is too early in my use of J and K for that) or of particularly interesting techniques (the problems are chosen for expository reasons). In starting to learn J and K I found study of some of the existing code and documents, particularly the J Phrases (by Burke, Hui, KE Iverson, McDonnell and McIntyre) and the K Idioms by Eugene McDonnell to be of particular value in my learning process. It is my hope that these examples will be of some additional help to others who are learning these languages.

Associative Arrays and Hashes

It was only when I had nearly completed this note that I realized that I had never made the case there was a relationship between hashes and associative arrays. This is, at least to some extent, because the ideas are quite intimately tied to one another in certain contexts. However, each of the concepts can stand, quite well, on its own.

In this note we are primarily interested discussing an implementation of associative arrays. Associative arrays could be implemented many ways. However, if they are not implemented with hashes, then they tend to produce implementations which are both slow and which have bad scale characteristics. Here we will use hashes to effect good performance in our resulting code.

perl is introduced in this note because it is a language that makes considerable use of associative arrays. perl does many things well, and associative arrays are only one of them. However, as far as we are concerned here, associative arrays are one element of perl that is:

  • of real convenience;
  • rarely available in other languages; and
  • a useful adjunct to J and K.

As a result there is good reason to add this kind of capability to J and K. And therefore there is good reason to describe, at least in a rudimentary way, these capabilities in perl.

Associative Arrays in perl

Perl has associative arrays that prove to be an enormous convenience in handling certain classes of problems. Neither J nor K have similar facilities built-in (at least not yet), but both have powerful capabilities that allow perl's operations to be emulated in useful ways. Before discussing the implementation of J and K capabilities to parallel these perl functions, it may prove useful to have a short discussion of the way that perl's associative arrays work and how they are used in practice---at least by me.

Perl's associative arrays are a special case of regular arrays in that language. They are special, first, in that they represent n items with 2-by-n entries. Each entry is a `key/value pair'. In general the items are accessed by `key'. These keys behave as though they were text/object `subscripts' which lead to the values. In perl, keys can be pretty much any kind of object. Perl treats numbers as equivalent to the strings that represent them, so 1 and "1" are (pretty much) indistinguishable. So far in our description perl arrays are just a special restricted case of conventional arrays. What really makes them different, and this is the second thing special about them, is the fact that the keys are hash coded and this makes it possible to use them in circumstances where searching and matching the linear lists would take enough time to slow implementations significantly.

Effect of Hashing

Hashing dramatically impacts the time required to find items in a list. Looking up an item in an (unordered) list without special methods will generally require an amount of time proportional to the number of items in the list. If the list can be ordered, then we can apply a special method like binary search to the task, and the time required changes dramatically, and will typically end up requiring an amount of time proportional to the log (base 2) of the number of items in the list. Even if the mechanisms required to support he binary lookup happen to consume more time than the simple linear search (not unlikely, but also not likely to be a huge difference), as the scale of the problem increases the binary search technology will eventually end up by dominating the situation.

Just to emphasize what is probably a pretty obvious point, multiplying the number of items in a list by 1000 will multiply the time it takes to match an item using a linear search technique by the same amount. However, it will only multiply the time required for binary search by a factor of 10, not 1000, a substantially smaller amount. As the scale of the problem increases this difference is more and more important, and eventually the binary search technique will dominate for all sizes beyond some point.

Hashing employs rather different methodology. At the cost of some space inefficiency (generally, these days, not a very expensive commodity) search time can be made essentially independent of the number of  items in the list. Thus the arguments made above about scale apply even more strongly to this technology.

Of course nothing comes truly for free. There are some disadvantages to hash coding schemes, but in many circumstances these deficiencies are not particularly significant in many particular situations. One example which applies to this problem domain is the identification of near matches or word completion. With an ordered list, when we find an item which is not in the list, we also find items `near' the item being looked for. We may be looking for `dogs' only to find that `dog' is present, but `dogs' isn't. We may have typed `xeno' which isn't in the list, but in looking we encounter `xenophobia' .

In an ordered array, we can program to make use of the entries near our target if that is our wish. In a hash array, however, there is no concept of `near', i.e. entries for `dog' and `dogs' essentially have nothing to do with one another. Unless special methods are employed, finding one tells us nothing about the other. More important, if we don't find one it tells us nothing about the possible occurrence of the other or any related terms. We can't easily answer questions like `what entries begin with xeno...?' in any straightforward fashion, unless we have made special provision for them.

Effect on Implementation

The simplicity of perl's implementation leads to some very straightforward code to accomplish what might appear otherwise to be complex tasks. This is particularly true when items themselves can be used as keys. Perhaps a simple example helps. If I have a file full of URLs and I want to count how often they each appear `while (<>) { ++$URL{$ };}' does the whole job. This tiny fragment:

  • reads the lines; and
  • accumulates a count by each URL
I could complicate the example to handle lines that contain more than just the URL, but since the purpose of this note is to discuss J and K, not to teach perl programming, such elaboration would be pointless here. However, the fragment does emphasize the point of just how nice it is for me not to have to care about the content or structure of the URLs.

This is the central point of my attempt to exercise these sorts of capabilities in J and K. If such capabilities are available, then it becomes possible to use the languages to solve a broad class of  ad hoc problems on the spot.

Explicit vs. Implicit

In perl, implementation of associative arrays is implicit. Except for the fact that associative subscripts occur between curly braces while numerical subscripts occur between square brackets,  it is hard to tell one kind of reference from the other. Perl, by choice, makes it possible for a programmer to access the actual array which implements a particular hash, should they want to. But one rarely needs or wants to do this.

And the mechanisms which implement the concept are generally hidden from the programmer. While it is quite easy to find out how many elements are being used in any particular hash, it is not so easy to find out just how much storage is actually devoted to storing the information. For the most part the system itself can and will manage storage in such a way as to keep balance in the costs of storage and the time it takes to reference things.

In short, perl references associative information in a fashion very similar to conventionally stored linear information (i.e. it is implicit), and it hides the management of such from the user. Neither J nor K have such implicit facilities, although both now have enough capability to allow us to emulate this kind of activity with reasonable ease.

A Caveat

The descriptions that follow are based on my experience as a user of these systems, not from any understanding of the innards of either of the implementations. While I have tried to make the descriptions faithful to the languages, they represent only my own (limited) understanding of the situation and not any `official' description. I have attempted to have those better informed than me help catch errors, but at this stage I am fearful that some misunderstandings remain on my part.

K's Approach

The idea of hashes seems to be deeply embedded in K. But it is quite tied into the concept of directories. The management of information in directories is obviously a high priority item in K, as directories are responsible for a lot of the heavy lifting that makes the language so efficient and effective.

With K (and, as we will see in a moment, with J as well) there is an intermediate concept: the symbol. This is not a concept in perl, but has been a part of K since the beginning (at least as far as I know it) and has recently been introduced into J. A short detour into the nature of symbols is probably appropriate here, as it is relevant to our discussion of both J and K.

A sidebar: Symbols

Strings are a real nuisance when they have to be sorted and or searched for. In general strings are variable length, and if comparisons are made on a byte-by-byte basis, several comparisons may be needed to separate or order `consequential' and `consequence', for example. For this, and other, reasons it is useful to have something that is:

  • related to, or a lot like, a string; but
  • quite easy to manage; and
  • easy to search for and find.
In the context of J and K, symbols are related to strings but have special properties as well. These special properties relate to:
  • role;
  • use; and
  • scope.
Language Role Special Use Scope
J Just a Name None Global
K Names Space Variable Names Within Dictionary

In J, symbols are explicitly introduced, and once introduced are maintained in a global `tree'. Certain operations notice when they have symbol arguments and then operate much more quickly than they would if the arguments were, for example, strings.

In K there does not appear to be a global symbol table, but rather one associated with each dictionary. Of course there may be a global symbol table hidden behind all of this, but if so that is an implementation consideration of no particular consequence to the user of the system. [Later information leads me to believe that K does actually put the symbols into a global table, but nevertheless has a scope for the variables that is per dictionary.]

There is one particularly important difference between symbols in J and K. So far in my experiments I have been able to make symbols out of arbitrary strings in J and K, but if a set of K symbols is restricted to those who's strings are legal variable names, then the K symbols can be made into a dictionary and this may give them special properties. There is some indication that this may change in K4, where there seems to be some intention to broaden the domain of names to allow all strings.

In both languages, symbols are represented by a tick mark (`), and in both languages the symbols which would otherwise be legal variable names are represented by pre pending the tick as in `abc. However, the languages differ in their typography for some of the the other symbols. For example, the symbol generated from the string "123abc" is represented as
  `123abc
in J, while in K (where it can't be used as a variable name anyway) it is
  `"123abc"

Symbols play an important role in both of these languages. And the role is very similar, if not identical. Essentially symbols allow a fast form of comparison, searching and management.

Back to K

Other than dictionaries (and in the---largely---undocumented and unreferenced _hash command) , there are no explicit facilities for dealing with hashes in K. It was not immediately clear to me which K operators are particularly sensitive to hashing, but a little experience rapidly indicates that searches, such as those made with `?', are very sensitive from a timing standpoint to whether their arguments are strings or symbols. A recent experience with a half a million symbols indicated that searching symbols was about an order of magnitude faster than searching for their string equivalents.

I have tested string vs. symbol searching and selection. Symbols have far superior timing characteristics. Of course converting a series of strings into symbols takes a reasonable amount of time, and this should be factored into any design involving the use of symbols instead of strings. I have not yet had the time, or the circumstance, to study the differences between use of the normal lookup mechanisms and those which are supported in the dictionary process. Some problems come with a naturally restricted set of possible keys, and in such cases one could choose a dictionary oriented implementation as opposed to a more normal vector based implementation (such as the ones presented in the code samples that support this document).

Symbol subscripts are nice syntactic sugar in K. In a sense they (almost) make K look like perl in this regard. If Data is a directory, and abc is a symbol in that directory, then Data[`abc] will allow us to access the value of the symbol `abc. This is quite parallel to perl where a reference to $Data{"abc"} would be much the same sort of thing.

J's Approach

J handles this all somewhat differently. J has a symbol operator that converts strings into symbols. Once we have symbols we can put them into vectors that then become searchable very quickly. Obviously the interpreter recognizes when its arguments are lists of symbols as opposed to lists of strings, and in such cases invokes special methods that make the searches and lookups lightning fast.

In J a system operator (s:) operates on strings to convert them into symbols. Once they are in this form they can be used, much like boxed strings, as elements in lists or for comparison, lookup and sorting. It is not clear, as of this moment, how they might be used in `subscripting', but that topic was under active discussion in the J User Forum on the Internet (as of December 2001 such discussion is dead, nothing having been said in months). I remain unconvinced, at the present time, that J's approach to this problem will naturally lead to the opportunity for `subscripting', But while that issue may be unresolved at this time there is no question that J's hashing facility is very effective and useful.

Summary

The following table summarizes the major characteristics of these languages with respect to associative arrays:
 

Language Management How Referenced Explicit Symbols
perl hidden Normal Language No
K half hidden Available as Subscript  Yes
J open No special way Yes(new)

External Considerations

It is not clear what aspects of being a symbol, if any, can survive or be saved from one session to another. J makes some provision, in its documentation, for accomplishing this, but I have not yet had a chance to experiment with it, so I am unclear about just what aspects of `being a symbol' could be preserved and reinstated.

A sidebar on the `Phone' Problem

The `Phone Problem' was presented in another note. Since it may have been studied by some, it is perhaps worth considering why the technology presented here is not relevant to the kind of solution chosen there.

The principle characteristic that renders this associative technology inappropriate for the phone problem is that, perhaps without realizing it, the solutions presented there relied heavily on the notion of being able to search for data with relatively imprecise specifications. If we wanted to use the technology presented here, we would have to be much more careful about how we specified our keys.

As the code stands in the J and K solutions to the phone problem, the text of the data base is searched for matches to the `target' character string. If too many or too few matches are made it is generally easy to fine tune the searches by adding characters to or taking characters from the target. In this scheme there really are no preorganized keys. A search is made across the data each time we have a new target.

To use an associative scheme we would have to decide on what keys to associate with each data item. We could parse names and addresses into individual words, for example, and associate each of them with the items that they occur in, but then when we look for them we had better be very careful to ask about them quite precisely. Searches for `Mill' would not, for example, match `Millcreek' so we would have to know whether we were looking for `Mill Creek' or `Millcreek'. With the text based, non associative scheme that we implemented for this problem a search for `Mill' would have found it either way.

So the associative scheme is good when we can be quite precise about the keys. Other methods tend to be better when we can not.

The J Code

Comment

NB. <$Implement a Hash Coding Scheme for J 
          - Ver DN-1A(4) - [XDS-KNGR]$>
NB. DN 2001-02-24
NB. To broadly parallel perl associative arrays...
NB. Initial Hash	    h =: a:
NB. Storing data	    h =: h sto key;value
NB. Getting data	    h fet key
NB. Deleting an item	    h =: h del key
NB. Key exists? 	    h haskey key
NB. List of Key Symbols     keys h
NB. List of Boxed Values    vals h

sto =: 4 : 0
   'key value' =. y.
   loc =. (syml =. >0{x.) i. <sym =. s: ('/'),":key
   if. loc = #syml do.
	  (<syml,<sym) 0} x.,<<value
       else.
	  (<<value) (>:loc)}x.
       end.
)

fet =: 4 : 0
   loc =. (syml =. >0{x.) i. < s: ('/'),":y.
   if. loc = #syml do. 0 0#0 else. >>x.{~>:loc end.
)

del =: 4 : 0
   locs =. (# i.@#) -. (syml =. >0{x.) = 
          < s: ('/'),":y.
   syml =. locs{syml
   (<syml),x.{~>:locs
)

haskey =: 4 : 0
   -. (#syml) = loc =. (syml =. >0{x.) 
          i. < s: ('/'),":y.
)

keys =: 3 : '>>0{y.'

vals =: 3 : 'y.&fet each 4&s: each keys y.'

show =: 3 : '(''Keys'';''Values'')
          ,(<"1 keys y.),.vals y.'

NB. Test Values: Type Demo''
Demo =: 3 : 0
h =: a: sto ('01';'Here 1')
h =: h sto ('Mat';<6+i.2 3)
h =: h sto ('two';'Here 2')
h =: h sto ('boxes';<<"0( 2 3) $ 'abcdef')
h =: h sto ('1';'Different One')
h =: h sto ('two';'New two')
h =: h sto (1;'Same as string 1')
(h fet 'boxes') 1!:2 (2)
(h fet 'Not There') 1!:2 (2)
(h fet 'two') 1!:2 (2)
(show h) 1!:2 (2)
h =: h del 'two'
h =: h del 'Mat'
h =: h del '01'
h =: h del 1
h =: h del 'boxes'
show h
)

Comment

The K Code

Comment

/ <$Implement a Hash Coding Scheme for K 
     - Ver DN-1A(2) - [ECS-RDVH]$>
/
/  Note: this implementation parallels that 
/  chosen for directories so that (. +hash) 
/  is a valid directory IF all of the keys
/  are legal variable names. If it is known 
/  in advance that _all_ keys are legal variable 
/  names, it may be better to use the directory 
/  mechanisms directly.
/
/  Initialize Hash	 hash:B    / B is (0#`;())
/  Store value		 hash:sto[hash;key;value]
/  Fetch value		 fet[hash;key]
/  Delete value 	 hash:del[hash;key]
/  Key exists?		 haskey[hash;key]
/  List of Key Symbols	 keys[hash]
/  List of Bpxed Values  vals[hash]

B:(0#`;())
sto:{l: x[0] ? s:`$ $y
	if[l = #x[0]; x[0],:s
		      x[1],:,z
		      :x]
	x[1;l]:z
	x}

del:{x[;&1 - x[0] = s: `$ $y]}

fet:{l: x[0] ? s:`$ $y
     :[l = #x[0];0#""; x[1;l]]}

show:{`0:x
      `0:"\n"}

haskey:{~(#x[0]) = x[0] ? s:`$ $y }

keys:{x[0]}

vals:{x[1]}

/ NB. Test Values: Type Demo[]

Demo:{B::sto[B;"01";"Here 1"]
      B::sto[B;"Mat";6+2 3#6+!6]
      B::sto[B;"two";"Here 2"]
      B::sto[B;"boxes";2 3 # "abcdef"]
      B::sto[B;"1";"Different One"]
      B::sto[B;"two";"New two"]
      B::sto[B;1;"Same as string 1"]
      if[x=1;:0]
      show fet[B;"boxes"]
      show fet[B;"Not There"]
      show fet[B;"two"]
      show[5:+B]
      B::del[B;"two"]
      B::del[B;"Mat"]
      B::del[B;"01"]
      B::del[B;1]
      B::del[B;"boxes"]
      show[5:+B]
}

Comment

Comparisons

To be completed

Bibliography

McDonnell, Eugene K Idioms Located at Kx FTP Site Available as both HTML and .DOC file

Burke, Chris and Roger Hui, Kenneth E. Iverson, Eugene McDonnell, Donald McIntyre J Phrases (2nd Ed.)  Available at J Software

Other helpful documentation is available at both Kx Systems and J Software.
 

David Ness' summary of work can be found at http://mywebpages.comcast.net/dness

Associative arrays are an important feature of some languages, particularly perl. J and K do not directly posess such facilities but, as this paper hopefully demonstrates, they are easy to add.

Home