OOMK - An Object Oriented Tcl wrapper for Metakit

Feb 2003


INTRODUCTION

  OOMK gives access to a wide range of operations in the Metakit embedded
  database library, which have not been exposed properly before.  It does
  so with an object-oriented interface, i.e. Tcl "command objects".

  This wrapper is more a stopgap measure than a truly new direction for MK,
  it merely tries to bring MK up to date, offering similar functions as
  those that have been available for several years from Python and C++.

  The wrapper is based on Will Duquette's "Snit" (Snit is not Incrtcl),
  version 0.8 - see http://www.wjduquette.com/snit/ for details.


STATUS

  This is version 0.1, the first public release.
  
  Simple things work, as should all view / row operations described below.
  Command clean-up is not terribly convenient yet, it is in many cases
  still up to the caller to make sure objects get cleaned up again (using
  the usual "rename $obj {}").

  The API is rough, and most likely quite incomplete.  It may well be that
  you find that some things can't be done.  I hope that these holes will
  surface quickly, so I can fix it and add any necessary extra glue.

  I'll do my best to respond to questions and fix bugs promptly, preferably
  through the Starkit mailing list (which is more Tcl-oriented than the MK
  list) - see http://www.equi4.com/mailman/listinfo/starkit


CONCEPTS

  This description assumes you are familiar with Tcl and its OO style.
  Snit is used under the hood, it will help to understand its details when
  chasing problems.  Fortunately, Snit has good documentation.

  There are two object types: storages and views.  Storages behave as a
  view, but understand a few more commands, such as "commit".

  Views are the central concept in Metakit and in OOMK.  These represent
  what would be called "tables" in relational terms.  For an overview of
  the basic concepts of Metakit, see Mark Roseman's MK tutorial:
        http://www.markroseman.com/tcl/

  MK views are all about view operations.  You take one or more views,
  apply some operation on it such as sort, select, join, and you get a new
  view back representing the result.  That view is often just smoke and
  mirrors, i.e. it *looks* as requested, but internally it plays all kinds
  of tricks, using the original data.  This is significant consequences for
  memory use and performance.

  The current model is geared towards static use of views, even though some
  of the operations in MK support dynamic updating (take a view, create a
  new sorted one, that "virtual" view then tracks changes in the original).
  Don't expect too much from that right now.  In fact, this means that the
  current OOMK design is really mostly geared towards reading, not writing.
  You can write (i.e. insert, modify, delete), and then re-construct new
  results from scratch, but that is about it.  As a full scale DB, the
  basic Mk4tcl commands are still the way to go (or the OOMK equivalents).

  The second important aspect of OOMK, is actual access to data and the way
  one iterates over views.  This is completely new and differs from Mk4tcl.
  Instead of using "mk::get" and "mk::set", access is done by defining a
  cursor, which is a special case of a Tcl array.  The cursor points to a
  single row at any point in time, but can easily be re-positioned.  The
  data itself is fetched and changed by accessing array elements.  There
  are some "big" reasons for doing things this way, which will eventually
  become apparent when performance issues are addressed in the future.


GETTING STARTED

  OOMK is a pure-Tcl extension, and so is Snit, so all you need to do is
  download the sources, and make them available as Tcl packages.  Because
  Metakit is required anyway, the easiest way to use this is with Tclkit.
  In fact, I'm distributing OOMK as Starkit (also for my own convenience).

   * grab a suitable build of tclkit for your platform from
        http://www.equi4.com/tclkit
   * get SDX, the Starkit Developer eXtension from
        http://www.equi4.com/sdx
        http://www.equi4.com/pub/sk/sdx.kit
   * fetch "oomk.kit" using SDX:
        sdx update oomk.kit
     or if you prefer, as a normal download
        http://www.equi4.com/pub/sk/oomk.kit
  
  That's it.  Now, to use OOMK, source this starkit in your code:
        source oomk.kit

  Or if you prefer the traditional package approach, unpack it instead:
        sdx unwrap oomk.kit
  Then, in your code make sure the oomk and snit packages are found:
        lappend auto_path [file join [pwd] oomk.vfs lib]


A SIMPLE EXAMPLE

  The following is a complete example, with comments added in between.
  Lets create a datafile called "mydata.mk":

        source oomk.kit
        package require oomk
        mkstorage db mydata.mk

  Set up a new "persons" view, with name, age, and shoesize fields:
        
        db layout persons {name age:I shoesize:F}

  Append a few entries:

        [db view persons] as pv
        $pv append name me age 99 shoesize 6.5
        $pv append name you age 21 shoesize 7
        $pv append name someone age 39 shoesize 8
        #unset pv

  In the above, a view object was created, it was used to append a few new
  rows of data, and then the view object is dropped.  For this example, the
  cleanup is being postponed until later, which is why it is commented out.
  
  Changes are now in the storage object, but not on file - let's fix that:

        db commit

  There is a useful debug tool in OOMK to display the contents of a view:

        $pv dump
  
  The output is:

        name     age  shoesize
        -------  ---  --------
        me        99       6.5
        someone   39       8.0
        you       21       7.0
        -------  ---  --------


SOME VIEW OPERATIONS

  Ok, we have a small view with data.  Let's try a few things:

        puts [$pv size]

  Output will be "3", i.e. the number of rows.

        [$pv select -max age 65] as v
        dump $v
        unset v

  Note how a new view gets created, used, and cleaned up.  Output:

        name     age  shoesize
        -------  ---  --------
        someone   39       8.0
        you       21       7.0
        -------  ---  --------

  Here is a different, more basic way to do exactly the same as above:

        set v [$pv select -max age 98]
        dump $v
        $v destroy

  Stacking can be done, but must use separate steps to allow cleanup:

        [$pv select -max age 65] as v1
        [$v1 sort age] as v2
        dump $v2
        unset v1 v2

  Output:

        name     age  shoesize
        -------  ---  --------
        you       21       7.0
        someone   39       8.0
        -------  ---  --------


OBJECT CLEANUP ISSUES

  The advantage of using the "as" idiom instead of calling "destroy", is
  that this works very well inside procedure bodies with local variables.
  With "as", views are automatically cleaned up on procedure exit.
  
  Object cleanup is, as mentioned before, not perfect.  There are some
  ideas on how to make this far more automatic, such that views objects can
  be passed around at will, and automatically get dropped after the last
  reference to the view goes away.  This is not the same as Tcl's built-in
  refcount mechanism (which applies to Tcl_Obj* internal objects).
  
  To give you an idea of what this is all about, the following one-liner
  could be used as replacement for that last example above, were it not for
  cleanup:
  
        [[$pv select -max age 65] sort age] dump

  Alas, the above won't clean up properly.  The closest for now is:

        [set v2 [[set v1 [$pv select -max age 65]] sort age]] dump
        $v1 destroy
        $v2 destroy

  Which is "hacky" and hard to read, to say the least.

  Another way to solve this is in my "VKIT" trial.  It uses a trick of
  setting a hidden var in the callers context - http://wiki.tcl.tk/vkit


LOOPING

  To iterate over all rows in any view, there is the "loop" command.  It
  sets up an array with magic entries:

        $pv loop c {
          puts "row $c(#): name = $c(name), age = $c(age)"
        }

  The output is:

        row 0: name = me, age = 99
        row 1: name = you, age = 21
        row 2: name = someone, age = 39

  Note that you can access the row position as special field "#".

  This access mechanism is quite efficient.  In the above example, the
  "shoesize" property is never used, and due to MK's column-wise data
  model, it will in fact not be accessed at all internally.  This has a
  good effect on performance: iterating over a few properties is very
  efficient, regardless of the size/complexity of other parts of a view.

  This same mechanism can be used to access as well as modify individual
  items of a view:

        $pv cursor c
        set c(#) 1
        puts $c(name)
        set c(name) anyone
        unset c

  Output will be "someone".

  Again, unset is used to clean up the cursor.  The above has in fact
  applied a change, as you can see by dumping the view:
  
        $pv dump

  Output:

        name    age  shoesize
        ------  ---  --------
        me       99       6.5
        anyone   39       8.0
        you      21       7.0
        ------  ---  --------

  Making this change permanent will require a "db commit", as always.

  The loop command is quite immature.  It does not even handle "break" and
  "continue" as it should.  This will be addressed in a future revision.


HASHED ACCESS

  One of the more intersting functions in the MK core is hashed access.
  One of the main reasons of writing OOMK in the first place, was to
  provide at least a basic way to expose this mechanism to Tcl.

  Hashing in MK can be introduced after the fact.  It is somewhat unusual
  because the data view itself is not affected.  Hashed access can co-exist
  with keeping data in a certain order, for example.

  The way to use hashing, is to define a secondary view, and set up a
  special hashing view on top of all this.  The hashing view must be set up
  each time after open, but that in itself is instant.

  Let's add hashing to the above view.  First of all, hashing only works
  with the leading properties in a view.  Let's use the first, i.e. "name",
  as hash key in this example.

  First, define the secondary view:

        db layout persons_h {_H:I R:I}

  Then, create the new hash layer:

        [db view persons_h] as pv_h
        [$pv hash $pv_h 1] as hpv

  The result is a view called $hpv, which for all practical purposes is
  equivalent to $pv.  It has however a few differences:
  
   * a "select" or "find" on the "name" field will use hashing if possible
   * all changes made to the view *must* be made to $hpv, *not* $pv

  This means that if you set up $hpv when opening the datafile and use that
  variable instead of $pv everywhere in your code, you'll get hashing free:

        puts [$hpv select name you]

  The output is "2", but in this case it will be obtained through hashing.
  This has O(1) performance, regardless of the size of the view.


BLOCKED VIEWS

  A second function which OOMK now exposes, is the ability to "block" data.
  This can store large views (100,000's, or 1,000,000's of rows) in a
  slightly different form, which avoids the linear performance slowdowns
  when doing frequent insertions or modifications.

  The representation on file is a view with several smaller subviews
  containing the actual data.  But logically, blocked views continue to
  appear as one large view.

  Unlike hashed views, blocked views have to be designed and set up as such
  from the outset (this is a limitation of the current MK core).  To switch
  from a flat large view to a blocked view at a later stage, you will have
  to do a converion and copy the data over.  This can all be done with a
  few lines of Tcl, but it does require extra work to do the transition.

  Let's convert the above "names" view to blocked format:

        db layout persons_b {{_B {name age:I shoesize:F}}}

  Note the brace levels: this is a view with one subview called "_B".

        [db view persons_b] as v1
        [$v1 blocked] as bpv
        unset v1

  The above create a new blocked view, ready for use.  To copy data from
  the original view to the new view, do this:

        $pb loop c {
          set cmd [list $pbv append]
          foreach {x y} [array get c] {
            if {$x ne "#"} {
              lappend cmd $x $y
            }
          }
          eval $cmd
        }

  It's a bit contorted: it essentially constructs a new "$bpv append ..."
  command, and then runs that to copy the data to the otehr view.

  It's also a bit limited - this will not work if $pv has subviews, but in
  this example that is not the case.

  Now, the original view is no longer needed:

        $pv unset
        db layout names ""

  Time to sommit the changes:

        db commit

  From now on, $bpv can used just as $pv was:

        $bpv dump

  The output is omitted, it's the same as before.

  Blocked views and hashed views can be combined, in different ways: the
  data view can be blocked and/or the hash map ("persons_h") can be.  There
  are several trade-offs, which are beyond the scope of this documentation.

  Blocked views scale well beyond regular views.  They trade this for a
  slightly slower positional access.  Looping is currently also a bit
  slower, due to some optimizations not yet in the MK core.


VIEW OPERATIONS

  The OOMK layer exposes a lot more of the underlying MK engine than plain
  Mk4tcl does.  Not all of this is complete, nor is all of this even solid.
  In fact, not all of these make sense in Tcl as is, probably...

  Here's an attempt to briefly summarize what is available:

  UNARY VIEW OPS

        [$v1 blocked] as v2
                described above - using subviews for scalability

        [$v1 clone] as v2
                create a view with same structure but without the data

        [$v1 copy] as v2
                create a new view with same structure and contents

        [$v1 readonly] as v2
                a derived view which does not allow modification

        [$v1 unique] as v2
                a derived view which has no duplicate rows

  BINARY VIEW OPS

        [$v1 concat $v2] as v3
                all rows of v1, followed by all rows of v2

        [$v1 different $v2] as v3
                all rows in v1 or in v2, but not in both

        [$v1 intersect $v2] as v3
                all rows which are present in v1 and in v2

        [$v1 map $v2] as v3
                those rows of v1 which are listed (by row#) in v2
                the size of the result is the size of v2

        [$v1 minus $v2] as v3
                all rows of v1 except those also in v2

        [$v1 pair $v2] as v3
                horizontal concat: columns of v1, then columns of v2

        [$v1 product $v2] as v3
                cartesian product, all combinations of all rows

        [$v1 union $v2] as v3
                all rows in v1 or v2, duplicates are included once

  UNARY VARARGS VIEW OPS

        [$v1 groupby subview prop ...] as v2
                a derived view, "grouped" on prop, each becoming a subview

        [$v1 indexed map unique prop ...] as v2
                similar to hashed, maintains a sorted index for bin search

        [$v1 ordered ?numkeys?] as v2
                keep $v1 sorted when modifying $v2, bin search if possible

        [$v1 project prop ...] as v2
                a derived view with only the specified properties, in order

        [$v1 range start ?limit? ?step?] as v2
                a derived view, set up as a "slice" over $v1

        [$v1 rename oprop nprop] as v2
                a derived view, with oprop renamed to nprop

        [$v1 restrict cursor pos count] as v2
                don't use this - it's mostly for search optimization

  BINARY VARARGS VIEW OPS

        [$v1 hash $v2 ?numkeys?] as v3
                hashed view mapping, described above

        [$v1 join $v2 prop] as v3
                join $v1 and $v2 based on one or more common properties

  Of all the above, the most important worth describing in more detail is
  "map", which lets you do things like construct complex selections:

        [$v1 map $v2] as v3

  The resulting view is virtual, it lists rows present in $v1.  The $v2
  view must have an integer first property (name is ignored).  The values
  are used as row numbers, and must be in range of rows present in $v2.
  For example, if $v2 contains rows "0 3 2 0", then the result consists of
  4 rows, row #0 from $v2, then #3, then #2, then #0 again.

  By constructing a view with the right numbers, this can be used to remap
  an view and implement selection, repetition, reversal, etc.

  The main comment with all these operations is, as before, that they are
  rough and absolutely imcomplete when compared to the vector primitives of
  say the "APL" or "J" language.  It's a start, it's there today, it can
  probably be tweaked quickly to become reasonably useful.

  Having said that, several more or less extensive SQL implementations were
  built on top of this (in Python and Lua, given that Tcl never exposed all
  of this before).  The most recent example is Gordon McMillan's "MkSQL II"
  in Python, see http://www.mcmillan-inc.com/mksqlintro.html

  Stay tuned...