OOMK - An Object Oriented Tcl wrapper for Metakit Feb 2003 INTRODUCTION OOMK gives access to a wide range of operations in the Metakit embedded database library, which have not been exposed properly before. It does so with an object-oriented interface, i.e. Tcl "command objects". This wrapper is more a stopgap measure than a truly new direction for MK, it merely tries to bring MK up to date, offering similar functions as those that have been available for several years from Python and C++. The wrapper is based on Will Duquette's "Snit" (Snit is not Incrtcl), version 0.8 - see http://www.wjduquette.com/snit/ for details. STATUS This is version 0.1, the first public release. Simple things work, as should all view / row operations described below. Command clean-up is not terribly convenient yet, it is in many cases still up to the caller to make sure objects get cleaned up again (using the usual "rename $obj {}"). The API is rough, and most likely quite incomplete. It may well be that you find that some things can't be done. I hope that these holes will surface quickly, so I can fix it and add any necessary extra glue. I'll do my best to respond to questions and fix bugs promptly, preferably through the Starkit mailing list (which is more Tcl-oriented than the MK list) - see http://www.equi4.com/mailman/listinfo/starkit CONCEPTS This description assumes you are familiar with Tcl and its OO style. Snit is used under the hood, it will help to understand its details when chasing problems. Fortunately, Snit has good documentation. There are two object types: storages and views. Storages behave as a view, but understand a few more commands, such as "commit". Views are the central concept in Metakit and in OOMK. These represent what would be called "tables" in relational terms. For an overview of the basic concepts of Metakit, see Mark Roseman's MK tutorial: http://www.markroseman.com/tcl/ MK views are all about view operations. You take one or more views, apply some operation on it such as sort, select, join, and you get a new view back representing the result. That view is often just smoke and mirrors, i.e. it *looks* as requested, but internally it plays all kinds of tricks, using the original data. This is significant consequences for memory use and performance. The current model is geared towards static use of views, even though some of the operations in MK support dynamic updating (take a view, create a new sorted one, that "virtual" view then tracks changes in the original). Don't expect too much from that right now. In fact, this means that the current OOMK design is really mostly geared towards reading, not writing. You can write (i.e. insert, modify, delete), and then re-construct new results from scratch, but that is about it. As a full scale DB, the basic Mk4tcl commands are still the way to go (or the OOMK equivalents). The second important aspect of OOMK, is actual access to data and the way one iterates over views. This is completely new and differs from Mk4tcl. Instead of using "mk::get" and "mk::set", access is done by defining a cursor, which is a special case of a Tcl array. The cursor points to a single row at any point in time, but can easily be re-positioned. The data itself is fetched and changed by accessing array elements. There are some "big" reasons for doing things this way, which will eventually become apparent when performance issues are addressed in the future. GETTING STARTED OOMK is a pure-Tcl extension, and so is Snit, so all you need to do is download the sources, and make them available as Tcl packages. Because Metakit is required anyway, the easiest way to use this is with Tclkit. In fact, I'm distributing OOMK as Starkit (also for my own convenience). * grab a suitable build of tclkit for your platform from http://www.equi4.com/tclkit * get SDX, the Starkit Developer eXtension from http://www.equi4.com/sdx http://www.equi4.com/pub/sk/sdx.kit * fetch "oomk.kit" using SDX: sdx update oomk.kit or if you prefer, as a normal download http://www.equi4.com/pub/sk/oomk.kit That's it. Now, to use OOMK, source this starkit in your code: source oomk.kit Or if you prefer the traditional package approach, unpack it instead: sdx unwrap oomk.kit Then, in your code make sure the oomk and snit packages are found: lappend auto_path [file join [pwd] oomk.vfs lib] A SIMPLE EXAMPLE The following is a complete example, with comments added in between. Lets create a datafile called "mydata.mk": source oomk.kit package require oomk mkstorage db mydata.mk Set up a new "persons" view, with name, age, and shoesize fields: db layout persons {name age:I shoesize:F} Append a few entries: [db view persons] as pv $pv append name me age 99 shoesize 6.5 $pv append name you age 21 shoesize 7 $pv append name someone age 39 shoesize 8 #unset pv In the above, a view object was created, it was used to append a few new rows of data, and then the view object is dropped. For this example, the cleanup is being postponed until later, which is why it is commented out. Changes are now in the storage object, but not on file - let's fix that: db commit There is a useful debug tool in OOMK to display the contents of a view: $pv dump The output is: name age shoesize ------- --- -------- me 99 6.5 someone 39 8.0 you 21 7.0 ------- --- -------- SOME VIEW OPERATIONS Ok, we have a small view with data. Let's try a few things: puts [$pv size] Output will be "3", i.e. the number of rows. [$pv select -max age 65] as v dump $v unset v Note how a new view gets created, used, and cleaned up. Output: name age shoesize ------- --- -------- someone 39 8.0 you 21 7.0 ------- --- -------- Here is a different, more basic way to do exactly the same as above: set v [$pv select -max age 98] dump $v $v destroy Stacking can be done, but must use separate steps to allow cleanup: [$pv select -max age 65] as v1 [$v1 sort age] as v2 dump $v2 unset v1 v2 Output: name age shoesize ------- --- -------- you 21 7.0 someone 39 8.0 ------- --- -------- OBJECT CLEANUP ISSUES The advantage of using the "as" idiom instead of calling "destroy", is that this works very well inside procedure bodies with local variables. With "as", views are automatically cleaned up on procedure exit. Object cleanup is, as mentioned before, not perfect. There are some ideas on how to make this far more automatic, such that views objects can be passed around at will, and automatically get dropped after the last reference to the view goes away. This is not the same as Tcl's built-in refcount mechanism (which applies to Tcl_Obj* internal objects). To give you an idea of what this is all about, the following one-liner could be used as replacement for that last example above, were it not for cleanup: [[$pv select -max age 65] sort age] dump Alas, the above won't clean up properly. The closest for now is: [set v2 [[set v1 [$pv select -max age 65]] sort age]] dump $v1 destroy $v2 destroy Which is "hacky" and hard to read, to say the least. Another way to solve this is in my "VKIT" trial. It uses a trick of setting a hidden var in the callers context - http://wiki.tcl.tk/vkit LOOPING To iterate over all rows in any view, there is the "loop" command. It sets up an array with magic entries: $pv loop c { puts "row $c(#): name = $c(name), age = $c(age)" } The output is: row 0: name = me, age = 99 row 1: name = you, age = 21 row 2: name = someone, age = 39 Note that you can access the row position as special field "#". This access mechanism is quite efficient. In the above example, the "shoesize" property is never used, and due to MK's column-wise data model, it will in fact not be accessed at all internally. This has a good effect on performance: iterating over a few properties is very efficient, regardless of the size/complexity of other parts of a view. This same mechanism can be used to access as well as modify individual items of a view: $pv cursor c set c(#) 1 puts $c(name) set c(name) anyone unset c Output will be "someone". Again, unset is used to clean up the cursor. The above has in fact applied a change, as you can see by dumping the view: $pv dump Output: name age shoesize ------ --- -------- me 99 6.5 anyone 39 8.0 you 21 7.0 ------ --- -------- Making this change permanent will require a "db commit", as always. The loop command is quite immature. It does not even handle "break" and "continue" as it should. This will be addressed in a future revision. HASHED ACCESS One of the more intersting functions in the MK core is hashed access. One of the main reasons of writing OOMK in the first place, was to provide at least a basic way to expose this mechanism to Tcl. Hashing in MK can be introduced after the fact. It is somewhat unusual because the data view itself is not affected. Hashed access can co-exist with keeping data in a certain order, for example. The way to use hashing, is to define a secondary view, and set up a special hashing view on top of all this. The hashing view must be set up each time after open, but that in itself is instant. Let's add hashing to the above view. First of all, hashing only works with the leading properties in a view. Let's use the first, i.e. "name", as hash key in this example. First, define the secondary view: db layout persons_h {_H:I R:I} Then, create the new hash layer: [db view persons_h] as pv_h [$pv hash $pv_h 1] as hpv The result is a view called $hpv, which for all practical purposes is equivalent to $pv. It has however a few differences: * a "select" or "find" on the "name" field will use hashing if possible * all changes made to the view *must* be made to $hpv, *not* $pv This means that if you set up $hpv when opening the datafile and use that variable instead of $pv everywhere in your code, you'll get hashing free: puts [$hpv select name you] The output is "2", but in this case it will be obtained through hashing. This has O(1) performance, regardless of the size of the view. BLOCKED VIEWS A second function which OOMK now exposes, is the ability to "block" data. This can store large views (100,000's, or 1,000,000's of rows) in a slightly different form, which avoids the linear performance slowdowns when doing frequent insertions or modifications. The representation on file is a view with several smaller subviews containing the actual data. But logically, blocked views continue to appear as one large view. Unlike hashed views, blocked views have to be designed and set up as such from the outset (this is a limitation of the current MK core). To switch from a flat large view to a blocked view at a later stage, you will have to do a converion and copy the data over. This can all be done with a few lines of Tcl, but it does require extra work to do the transition. Let's convert the above "names" view to blocked format: db layout persons_b {{_B {name age:I shoesize:F}}} Note the brace levels: this is a view with one subview called "_B". [db view persons_b] as v1 [$v1 blocked] as bpv unset v1 The above create a new blocked view, ready for use. To copy data from the original view to the new view, do this: $pb loop c { set cmd [list $pbv append] foreach {x y} [array get c] { if {$x ne "#"} { lappend cmd $x $y } } eval $cmd } It's a bit contorted: it essentially constructs a new "$bpv append ..." command, and then runs that to copy the data to the otehr view. It's also a bit limited - this will not work if $pv has subviews, but in this example that is not the case. Now, the original view is no longer needed: $pv unset db layout names "" Time to sommit the changes: db commit From now on, $bpv can used just as $pv was: $bpv dump The output is omitted, it's the same as before. Blocked views and hashed views can be combined, in different ways: the data view can be blocked and/or the hash map ("persons_h") can be. There are several trade-offs, which are beyond the scope of this documentation. Blocked views scale well beyond regular views. They trade this for a slightly slower positional access. Looping is currently also a bit slower, due to some optimizations not yet in the MK core. VIEW OPERATIONS The OOMK layer exposes a lot more of the underlying MK engine than plain Mk4tcl does. Not all of this is complete, nor is all of this even solid. In fact, not all of these make sense in Tcl as is, probably... Here's an attempt to briefly summarize what is available: UNARY VIEW OPS [$v1 blocked] as v2 described above - using subviews for scalability [$v1 clone] as v2 create a view with same structure but without the data [$v1 copy] as v2 create a new view with same structure and contents [$v1 readonly] as v2 a derived view which does not allow modification [$v1 unique] as v2 a derived view which has no duplicate rows BINARY VIEW OPS [$v1 concat $v2] as v3 all rows of v1, followed by all rows of v2 [$v1 different $v2] as v3 all rows in v1 or in v2, but not in both [$v1 intersect $v2] as v3 all rows which are present in v1 and in v2 [$v1 map $v2] as v3 those rows of v1 which are listed (by row#) in v2 the size of the result is the size of v2 [$v1 minus $v2] as v3 all rows of v1 except those also in v2 [$v1 pair $v2] as v3 horizontal concat: columns of v1, then columns of v2 [$v1 product $v2] as v3 cartesian product, all combinations of all rows [$v1 union $v2] as v3 all rows in v1 or v2, duplicates are included once UNARY VARARGS VIEW OPS [$v1 groupby subview prop ...] as v2 a derived view, "grouped" on prop, each becoming a subview [$v1 indexed map unique prop ...] as v2 similar to hashed, maintains a sorted index for bin search [$v1 ordered ?numkeys?] as v2 keep $v1 sorted when modifying $v2, bin search if possible [$v1 project prop ...] as v2 a derived view with only the specified properties, in order [$v1 range start ?limit? ?step?] as v2 a derived view, set up as a "slice" over $v1 [$v1 rename oprop nprop] as v2 a derived view, with oprop renamed to nprop [$v1 restrict cursor pos count] as v2 don't use this - it's mostly for search optimization BINARY VARARGS VIEW OPS [$v1 hash $v2 ?numkeys?] as v3 hashed view mapping, described above [$v1 join $v2 prop] as v3 join $v1 and $v2 based on one or more common properties Of all the above, the most important worth describing in more detail is "map", which lets you do things like construct complex selections: [$v1 map $v2] as v3 The resulting view is virtual, it lists rows present in $v1. The $v2 view must have an integer first property (name is ignored). The values are used as row numbers, and must be in range of rows present in $v2. For example, if $v2 contains rows "0 3 2 0", then the result consists of 4 rows, row #0 from $v2, then #3, then #2, then #0 again. By constructing a view with the right numbers, this can be used to remap an view and implement selection, repetition, reversal, etc. The main comment with all these operations is, as before, that they are rough and absolutely imcomplete when compared to the vector primitives of say the "APL" or "J" language. It's a start, it's there today, it can probably be tweaked quickly to become reasonably useful. Having said that, several more or less extensive SQL implementations were built on top of this (in Python and Lua, given that Tcl never exposed all of this before). The most recent example is Gordon McMillan's "MkSQL II" in Python, see http://www.mcmillan-inc.com/mksqlintro.html Stay tuned...