OboScanners for GeneOntology Files

September 14, 2007 (dg)

Mode/Lines110100100010000100000250000Memory (Mb)
flex0.0010.0010.0010.0040.0150.1820.3132.13
java150.1320.1350.1320.2420.6231.2732.13079
java15_x640.1370.1330.1330.2330.5811.2402.288ND
java160.1310.1340.1310.2050.4411.1941.963800
perl0.0310.0310.0310.0320.0660.3840.9617
plex2.0_320.0010.0010.0020.0110.1091.0112.7650.8
plex2.2_320.0010.0010.0030.0120.1231.1363.1361
plex2.2_640.0020.0010.0030.0160.1551.4793.8081
tcl0.0900.0890.0920.1230.4393.3928.7246

Java-Memory Observations

September 13, 2007 (dg)

A problem with Java based applications is the hugh memory amount required to run the scanner. Tested were diffeent settings of the maximum memory allocation pool using the commandline option -Xmx.

Mode110100100010000Memory (Mb)
java0.140.2560.4531.70815.500788
java -Xmx16m0.1740.2600.4361.72815.598290
java -Xmx32m0.1750.2550.4571.71315.576308
java -Xmx64m0.1750.2540.4561.71315.460339
java -Xmx128m0.1560.2550.4411.70815.607404
java -Xmx2560.1560.2530.4211.71415.520532
java -Xmx5120.1550.2530.4561.70915.507795

OboScanner

September 13, 2007 (dg)

Comparing 64 and 32bit Scanners generated with the tply-lexer for pascal (free pascal) and with the jflex-Lexer for Java. Java programs require about 800Mb of memory whereas the pascal programs require just 1Mb of memory. However the Java programs where faster with the complete gene ontology obofile (about 250000 lines).

Mode/Lines110100100010000100000250000
obo-plex320.0010.0010.0030.0150.1421.3343.623
obo-plex640.0010.0010.0030.0150.1601.4484.038
obo-java15_x640.1400.1350.1330.2060.6231.2972.241
obo-java150.1330.1360.1370.2080.5961.3522.028
obo-java160.1340.1320.1340.2150.4651.1171.943

WC-Comparisons

September 12, 2007 (dg)

Again the same set of blastfiles was used for testing of a word counting scanner. Flex and re2c based scanners again were performing best.

Mode110100100010000
wc-flex0.0030.0110.1021.08312.459
wc-flexpp0.0260.1691.94021.193244.294
wc-gcj-exe0.0970.1230.4413.93442.928
wc-gcj0.0870.3072.87530.163nd
wc-java140.1530.2590.4811.74815.965
wc-java0.1760.2570.4441.70415.682
wc-javaip140.1220.3452.77428.982329.265
wc-javaip0.1200.3452.77128.769335.123
wc-perl-hand0.0060.0180.1551.59018.132
wc-perl-lex0.1640.8729.10897.561nd
wc-plex640.0080.0440.4765.10658.264
wc-plex0.0060.0430.4334.58955.773
wc-re2c0.0020.0050.0350.3463.975
wc-tcl85320.2571.62517.440190.076nd
wc-tcl85640.1831.07111.550126.854nd
wc-tcl0.4012.32725.212274.777nd
wc-unix0.0060.0260.2852.93733.792

BlastParsers vs BlastScanners

June 26, 2007 (dg)

We recently compared our newly generated Blast scanners with currently available BLAST-scanners from the BioJava-project (1), the BioPerl-project [2] and with the Zerg-BLAST parser [3]. Those parsers were compared with our scanners created either with C-based scanner generators like Re2c [4] and Flex [5] or with the Java based scanner generator Jflex [6]. Wheras the parsers mentioned above requires source code editing for parsing and analysing blast files our scanners are emitting SQL-code. Analyzing of blast results can afterwards done with a high level language (SQL). Please note that the BioJava scanner does not work with actual BLAST-versions. File sizes for the blast files has been about 1 (small), 14 (medium) and 140 (large) Mb

Modesmallmediumlargememory(Mb)
blast-biojava2.0198.055err1054
blast-bioperl4.02647.822nd21
blast-flex0.0510.5674.23720
blast-jflex0.6071.9068.532863
blast-re2c0.0270.2832.09419.7
blast-zerg0.0170.1851.3316.5
blast-tclkit85135.480ndnd10.1

Comparison of several scanner generators for a simple BLAST scanner

April 10, 2007 (dg)

Sample: BlastFile with 1 to 10.000 result items

Mode110100100010000
flex0.0030.0150.1991.97925.548
flex-tcl0.0050.0180.1671.77520.701
gcj0.0990.1490.7847.53683.421
gij0.1090.4824.91751.939nd
java0.2280.3480.7533.04827.806
javaip0.1800.5174.71549.852nd
plex0.0110.0820.8569.565107.987
perl0.0310.0500.2352.28023.916
re2c0.0040.0120.0760.7658.438
tcl1.70212.524140.249ndnd

The Re2c based scanner is the fastest, but the setup and the coding is more complicated than for the other scanners. Flex-based scanners are 2-3 times slower than Re2c based scanners, regardless if there is an embedded Tcl-interpreter for better string handling (flex-tcl), Jflex code (java), executed with the Sun-Java Hotspot virtual machine (1.5) as well as to machine code compiled Jflex code (java-gcj) and Plex (sbs-plex = Pascal lex) based scanners are about 5 and 10 times slower than Re2c based scanners. Interpreted Java-Code either executed with the Sun-interpreter (java-ip = “java -Xint”) or with the gnu-interpreter (java-gij) is about 50 times slower than Re2c-Code. The Tcl based scanner is about 1000 times slower than the Re2c based. The per scanner is a line based scanner thereof not able to do complicated scanning with more than two states or patterns on the same line.

Initial Setup of the Bioscanners and Bioparsers Webpage

March 13, 2007 (dg)

Project Aim

Write parsers for biological data based on scanner generators like Flex (C), Re2c(C), Jflex (Java) and Ifickle (Tcl). These scanner generators are providing easier maintainance, development and higher speed than hand written scanners.