Tcl Scanners

Development and usage of C, Java and Tcl based scanners in Tcl-applications

Dr. Detlef Groth and Alexander Straub, MPIMG Berlin, Germany, dgroth(at)molgen.mpg.de

Abstract: Although since many years many scanner generation frameworks like flex (C), jflex (java) and fickle or ylex (Tcl) are existing most of the scanners and parsers are still handwritten. This is this case although it is well proven that automatically generated scanners are easier to write and to maintain and are at least as fast as handwritten ones. Tcl lexers like fickle and ylex are Tcl-only implementations and are well suited for relative small files. In bioinformatics however we are working often with files several gigabytes in size. Here tcl-based scanners are offering only a poor performance if compared with C- or java based scanners. In order to overcome those limitations we were trying to utilise C-based scanners inside tcl-programs using the critcl package and java based scanners using the jacl/tcljava package with promising results.

The current scanner generator packages for tcl, ylex, fickle and tcLex have some serious drawbacks thereof limiting its usability. tcLex is a old C-extension and not actively maintained. ylex requires that the whole string which will be scanned must be in memory which is not suitable for large files. Both ylex and tcLex use more a tclish than a lex-like syntax to dynamically construct scanners. The scanners can be saved as tcl-code as well to generate scanners which can be used independently from the ylex-package. Fickle in contrast use lex-like input files and generates only static scanners which can be used independently from the fickle command line application. In order to overcome the global type of variables and procedures I recently developed ifickle which generates itcl-classes. Those scanners can be integrated into a larger framework/package like the biotcl-package. However due to its heavy use of the regexp command both ylex and fickle/ifickle are generating very slow scanners if the file size increases. Ifickle based scanners have a slightly larger startup overhead and are slightly less slow on larger files.

Table 1 compares the performance of various wc-scanners either hand crafted or generated with scanner generation frameworks like perl-parseLex, flex or re2c. As it visible performance of ylex and fickle / ifickle based scanners is not acceptable if larger files as often required if analysing biological data like genomes, proteomes.

In order to overcome the slow performance with scanners generated by tcl-based scanners generators, we were investigating the usability of C and Java based scanners for integration into tcl-programs. From the two C-based scanners, flex and re2c we choose re2c for an wc-implementation. The reason was, that in contrast to flex, re2c does not depend from external libraries and the code generated from re2c can thereof be easily integrated into tcl-programs using the critcl-package. Furthermore it seems that from our data and from the data of others re2c based scanners are 2 - 5 times faster than flex based scanners. Starting to use re2c was somehow difficult, but after fixing some initial issues with buffer filling and wrapping the re2c application into the critcl-application it was possible to use the re2c scanner either standalone or via the tcl application. The performance was comparable with the pure re2c application however the startup cost was with 1 second for small files quite high.

Because coding in C is for many people much more difficult than coding in java we were further investigating the possibility to utilise java based scanners inside tcl. There are two possible choices for communicating between the java machine and the tcl application. Recently there has been some effort to compile the tclblend library with stubs to utilise it inside starkits, however there is currently no starkit enabled tclblend library available. Tclblend would enable tcl programmers to directly communicate with java classes like with tcljava. In our study we were utilising a second approach, communication via sockets. The tcl application is looking for a free socket port and starts the java application. The java application sends it's parsing results via the socket connections to the tcl application. This approach has bigger startup penalty, because two interpreters must be started, but for large files which should be scanned, the startup is less import than the actual scanner performance. As can be seen from table 1 the scanning speed is comparable with a re2c based wc scanner. Both are about 3 orders of magnitude faster in comparison to tcl based scanners.

We were furthermore writing scanners with ifickle, re2c and jflex for biological data (Blast). Table 2 summaries our observations. The speed for tcl based scanners on larger files is very slow, however Coding and Integration with pure tcl based scanners is easier to accomplish in comparison with re2c or java based scanners.

Conclusions

re2c and jflex based scanners can be utilised for tcl-application thereof limiting the bad performance of current tcl based scanner generators. The big speed difference might be due to the fact that both jflex and re2c did not use any regexp library, but rather are building its own effective mechanism to translate regular expressions into a fast scanner. It might be therefore of interest to study the code generated by re2c and especially for jflex and trying to build a similar tcl based scanner generator. Using the critcl/re2c or socket/java approach gives some startup penalties but this is not of importance if large amount of data, as common in biology, needs to be scanned.

References

Tables

Table 1:

Mode/Size(byte)11010010001000010000010000001000000
tcl-hand0.1300.1200.2000.2900.1350.4893.877nd
tcl-yeti-dynam0.5310.5610.5740.6431.5389.76591.706nd
tcl-yeti-static0.4540.4660.4740.5511.4679.49189.334nd
tcl-fickle0.1290.2240.1330.1840.4873.46033.099nd
tcl-ifickle0.4390.4350.4510.4890.7132.80123.947nd
python-hand0.7240.2580.2600.2480.2600.8811.739nd
perl-hand0.0130.0120.0130.0120.0150.0430.321nd
perl-parseLex0.2130.2030.2290.2580.4372.35921.998nd
java-hand0.7700.4750.5020.4990.4966.1015.748nd
java-jflex0.5790.4820.5670.4880.5770.6610.6761.584
tcl-java-jflex1.0031.0011.0020.9960.9981.0141.1041.983
flex-wc0.0060.0050.0050.0060.0120.0410.3763.633
wc0.0050.0800.0060.0070.0090.0070.0200.148
re2c0.0110.0090.0090.0090.0110.0270.1850.618
tcl-critcl-re2c1.4891.4851.3161.3131.3961.3881.4211.524

Table 2:

-SpeedCodingIntegrationStartup
ifickle--+++++++
re2c++++++(+)
jflex+++++(+)

Source Code

iwc-fickle.fcl

Implementation of wc with ifickle.

%{
 #!/usr/bin/tclsh8.4
 public variable nline 0
 public variable nword 0
 public variable nchar 0
 %}
 %buffersize 1024
 %%
 \n { incr nline; incr nchar 2 ; }
 [^ \t\n]+ { incr nword; incr nchar $yyleng ;}
 . { incr nchar;}
 %%
 if {[llength $argv] == 0} {
    puts stderr "usage wc-fickle inputfile"
    exit 0
 }
 if {[catch {open [lindex $argv 0] r} yyin]} {
    puts stderr "Could not open [lindex $argv 0]"
    exit 0
 }
 set sc [iwcfickle \#auto -yyin $yyin]
 $sc yylex
 puts [format "%7d %7d %7d %s" \
     [$sc cget -nline] \
     [$sc cget -nword] \
     [$sc cget -nchar] [lindex $argv 0]]
 close $yyin

WC.flex

Implementation of wc with java-jflex
 /* WC.flex */
 %%
 %public
 %class Wc
 %standalone
 %unicode
 %{
  int nchars = 0;
  int nwords = 0;
  int nlines = 0; 
 %}
 %eof{
  System.out.println("   "+nlines+"\t"+nwords+"\t"+nchars);
 %eof}
 %%
 [\n] { nlines += 1; nchars += 1;  }
 [ \t]+ { nchars += yylength() ; }
 [^ \t\n]+ { nwords += 1; nchars += yylength(); }

wc-critcl.tcl

WC with re2c, tcl and the critcl package

 /* File: wc-critcl.tcl */
 source ./critcl.kit
 package require critcl
 set cfile [file rootname [info script]].c
 if [catch {open $cfile r} infh] {
    puts stderr "Cannot open $cfile : $infh"
    exit
 } else {
    set ccode [read $infh]
    close $infh
 }
 critcl::ccode $ccode
 critcl::cproc lines {} int { return numline; }
 critcl::cproc words {} int { return numword; }
 critcl::cproc chars {} int { return numchar; }
 critcl::cproc myscan {char* filename} void { readFile(filename); }
 if {[llength $argv] != 1 || ![file exists [lindex $argv 0]]} {
    puts "Usage: [info script] <filename>"
    exit 0 
 }
 myscan [lindex $argv 0]
 puts [format "%7d %7d %7d %s" [lines] [words] [chars] [lindex $argv 0]] 

wc-critcl.re

re2c wc application ready for usage inside tcl applications with the critcl package.

/* File: wc-critcl.re */
 #include <stdlib.h>
 #include <stdio.h>
 #include <string.h>
 #include <fcntl.h>

 #define    EOI    319
 #define    BSIZE    8192
 #define    YYCTYPE        uchar
 #define    YYCURSOR    cursor
 #define    YYLIMIT        s->lim
 #define    YYMARKER    s->ptr
 #define    YYFILL(n)    {cursor = fill(s, cursor);}
 #define    RET(i)    {s->cur = cursor; return i;}
 typedef unsigned int uint;
 typedef unsigned char uchar;
 int numline, numchar,numword= 0;
 typedef struct Scanner {
    int            fd;
    uchar        *bot, *tok, *ptr, *cur, *pos, *lim, *top, *eof;
    uint        line;
 } Scanner;
 uchar *fill(Scanner *s, uchar *cursor){
    if(!s->eof) {
        uint cnt = s->tok - s->bot;
        if(cnt){
            memcpy(s->bot, s->tok, s->lim - s->tok);
            s->tok = s->bot;
            s->ptr -= cnt;
            cursor -= cnt;
            s->pos -= cnt;
            s->lim -= cnt;
        }
        if((s->top - s->lim) < BSIZE){
            uchar *buf = (uchar*) malloc(((s->lim - s->bot) + BSIZE)*sizeof(uchar));
            memcpy(buf, s->tok, s->lim - s->tok);
            s->tok = buf;
            s->ptr = &buf[s->ptr - s->bot];
            cursor = &buf[cursor - s->bot];
            s->pos = &buf[s->pos - s->bot];
            s->lim = &buf[s->lim - s->bot];
            s->top = &s->lim[BSIZE];
            free(s->bot);
            s->bot = buf;
        }
        if((cnt = read(s->fd, (char*) s->lim, BSIZE)) != BSIZE){
            s->eof = &s->lim[cnt]; *(s->eof)++ = '\n';
        }
        s->lim += cnt;
    }
    return cursor;
 }
 int scan(Scanner *s){
    uchar *cursor = s->cur;
 std:
    s->tok = cursor;
 /*!re2c
    [ \t]+ {
        numchar += YYCURSOR-s->tok;
        goto std;
    }
    "\n" {
        if(cursor == s->eof) RET(EOI);
        s->pos = cursor; s->line++;
        numline++;
        ++numchar;
        goto std;
    }
    [!-~]+ { ++numword; numchar += YYCURSOR-s->tok; goto std; }
    [\000] { RET(EOI) ;}
 */

 }
 int readFile (char* filename) {
    Scanner in;
    int t;
    memset((char*) &in, 0, sizeof(in));
    in.fd = open(filename, O_RDONLY);
    if (in.fd == -1) {
         printf("error reading file: %s ", filename);
         return 1 ;
    }  
    while((t = scan(&in)) != EOI){    }
    close(in.fd);
    return 0 ;
 }  
 int main(int argc, char *argv[]){
    int ret ;
    if(argc != 2) {
        printf("usage: %s <filename>", argv[0]);
        return 0 ;
    }  
    ret = readFile(argv[1]);
    printf("\t%i\t%i\t%i\t%s\n", numline,numword,numchar,argv[1]);
    return ret ;
 }

This re2c code needs to be compiled in the following way:

 $ re2c wc-critcl.re > wc-critcl.c
 # (optional) compile command line application
 $ gcc -o wc-critcl wc-critcl.c
 # (optional) run command line application
 $ ./wc-critcl wc-critcl.tcl
       21      106    694 wc-critcl.tcl
 # tcl application
 $ tclkit wc-critcl.tcl wc-critcl.tcl
       21     106     694 wc-critcl.tcl
 $ wc wc-critcl.tcl
       21     106     694 wc-critcl.tcl