Language Comparison: Entity Scanning

Introduction and Details

Implementation and timing of html character entity scanner in multiple languages. Currently C, Eiffel and Java available.

This started when I was writing some Eiffel code for an assignment for university. I had just finished implementing the character entity scanner for the project and was running the program over some test data. I noticed it was hideously slow when compared to netscape communicator at displaying files filled with character entities.

Character entities are those elements in html that allow you to display characters that could otherwise not be easily typed or part of html themselves, such as the European language characters with umlauts and acutes etc, and characters such as ampersand and grater than and less than. These entities take the form of &string; or &#num;. The string is an identifier such as amp or gt that programs that parse this recognise and the num is the ansi code for that character. I don't know how unicode is encoded but I suspect it could be done similarly (the html 4.0 spec has some minimal details on this). Anyway for the purposes of this test though some of the character strings are recognised by some of the implementations it is not used for the testing. Only the numerical entity encoding is tested.

There are three languages that I have implementations of the scanner written in so far (and also a simple perl regexp implementation to show off differences). There are two different algorithms used. Labelled slow_entity and fast_entity, the slow_entity is the algorithm I came up with for the uni project and the fast_entity is the one Martijn came up with. The c and java implementations of the fast_entity were written by Martijn. I wrote the Eiffel implementation of fast_entity and also wrote the Eiffel and c slow_entity.

The idea for this code is to test differences with speed of the different implementations in different languages. The correctness is not an issue. I have checked that each implementation produces correct output for the small and large files though.

If anyone wishes to implement either algorithm in another language that is not yet implemented please do and I will include it here. The perl implementation doesn't count as it is not implementing the algorithm and I doubt I would get a speed increase in perl by implementing either algorithm over the simple regexp.

Downloading and running the code to perform tests

The Code is provided in a tarball which untars to a directory entity. This directory contains subdirectories c, Eiffel, java and perl. The directory contains a makefile buttload 4 release 8 to be used as test data and the testing program run.pl.

Requirements

There is a makefile for each implementation language directory. The c implementation assumes gcc is present and working. The Eiffel written with Small Eiffel -0.77 I have not tested with other environments. You can download this release of Small Eiffel at the Small Eiffel web site or any of its mirrors mentioned on the site. The java implementation assumes a jdk environment is available and working with the commands javac and java available. This was written and tested using the blackdown jdk-1.2.2-RC4-linux-i386-glibc-2.1.2 jdk. All running and testing so far has been on debian potato linux systems.

The above details are what was used here, you may be able to use different versions or systems for all such things.

The run.pl assumes perl is /usr/bin/perl and it depends on having Time::HiRes available to do the hi resolution timing for the testing. I looked at using Benchmark.pm for this however it does not provide fine enough timing resolution. If you don't have Time::HiRes installed on your system simply install it with the CPAN shell. (perl -MCPAN -e shell; and type "install Time::HiRes" at the prompt) or by any other method if you wish.

Results and Experiences

Initially I started writing more than just the Eiffel as I had performance concerns about the Eiffel, later it became a question of how easy it was to cleanly and correctly implement each algorithm in each language. Are there features of any of the languages that made it easier to write, or easier to prove correct (or at least be damn sure the thing is correct). However this is still mostly a performance concerns example. The program developed is not large enough for it to be feasible as a study of how easy each language is to use for programming and correctness really.

I plan soon to write a medium sized project (possibly a graphical X game of some sort) in all three languages. Such that the interface and use of the game is similar or identical (gtk in c or possibly c++, vegtk or similar in Eiffel and swing in java), I expect the project will be over 5000 lines for each implementation. This will further test the usefulness of each language and allow me to make more observations about different elements and features of a language in regards to developing using the language compared to the other. I suppose it is of note that I am mostly a c developer and thus still find it the easiest language to use. Due to this I will for now only concentrate on performance and the issues involved foe this software.

Optimisations

What was slow about the original? or Learn what compiler optimisations are available.

That the Eiffel code in slow_entity.e seemed so slow interactively is what spurred me to do this stuff, I was of course nicely surprised when the c version was so fast. I came to the instant conclusion Eiffel is a joke and how could anyone use it in a production system when it is so slow. I tested the c code and Eiffel code on a 10 MB entity file and found the slow_entity.c was able to process the file in about 3.4 seconds and the slow_eiffel.e was able to process in about 18 minutes the same file. (you see the discrepancy? :)

However I was not quite ready to see Eiffel this dead, I could not believe anyone could create a language that had any following this slow. Of course you will remember that Eiffel as well as compiling in the Eiffel run time environment and garbage collector and other items it has some neat features such as the stack trace generation on errors, the contract assertion checking, invariants on loops etc etc. Every function in the standard library does checks and similar things such as this. This is all compiled in and run when you compile eiffel plain.

Well duh of course it is slow if it has to drag that around through every function call and such, there is however a -boost option to the compile command. And what magic this command performs....

With the -boost applied to the slow_entity.e instead of taking 18 minutes it took 53 seconds. Good improvement.

I Initially was not using any optimisations on the c code and now am only using -O2 with no overly noticeable performance difference, however this is not surprising as the c is fast and the fast_entity.c especially is nearing the disk speed in its processing of entities. I don't know if there are any optimisation flags you can give to javac or java to see if it is possible to increase the speed. For the purpose of this test I don't currently need to have the java running faster (although probably should at least search briefly for options)

Basically what I point out here is you should know how the language you are using works internally and how to use the tools you develop with properly. During writing and testing of software you would be insane to use boost as it turns off debug, assertions, and all the other neat things in Eiffel such as this. However for a production system, you have in theory tested properly and the software is supposedly bug free (this is of course a myth, software is never bug free, I refer you to the fvwm1 man page entry on Bugs) so you will not need the excess baggage in your production binary.

Algorithms

It will be drilled into the head of any student of computer science how important the algorithm used to solve some problem is still the most important component of the solution. You must choose a simple obvious and fast algorithm and then write it well with appropriate variable names and such so the code will be self explanatory. You will often find a simpler algorithm is faster as is the case here.

The slow_entity spends a lot of time copying characters and data from one place to another and doing it often. The fast_entity scans through and only on a definite match will it do any copying. The slow_entity works and is straight forward to a degree, however it is not obvious just from reading what is happening at each stage or why there are many small loops and what all the variables do.

The fast_entity is a simpler algorithm and yet is twice the speed. It will always be worth your time ensuring you have found the best algorithms to use for the solving of a given problem.

Speed

The results speak for them self, c is still the fastest, this may not matter if you wish to use the abilities and features of Eiffel. Java I don't think I can make a definite assumption on yet as I don't know if it can be further sped up with optimisation flags to the compiler or similar.

$ /usr/bin/time ./run.pl 
Testing Fast C
Small Data Set
   0.0191 s 2643.1923 KB/s
   0.0181 s 2787.7864 KB/s
   0.0180 s 2802.2012 KB/s
Average 0.0184 s 2744.3933 KB/s
Large Data Set
   1.8707 s 8082.4776 KB/s
   1.8651 s 8106.7668 KB/s
   1.8718 s 8078.0256 KB/s
Average 1.8692 s 8089.0900 KB/s

Testing Slow C
Small Data Set
   0.0271 s 1856.6415 KB/s
   0.0272 s 1853.7052 KB/s
   0.0279 s 1809.5139 KB/s
Average 0.0274 s 1839.9535 KB/s
Large Data Set
   4.5196 s 3345.4621 KB/s
   4.5372 s 3332.4849 KB/s
   4.5161 s 3348.0705 KB/s
Average 4.5243 s 3342.0058 KB/s

Testing Fast Eiffel
Small Data Set
   0.0399 s 1262.9145 KB/s
   0.0391 s 1289.3423 KB/s
   0.0387 s 1301.9320 KB/s
Average 0.0392 s 1284.7296 KB/s
Large Data Set
   7.9375 s 1904.8950 KB/s
   7.9594 s 1899.6578 KB/s
   7.9295 s 1906.8077 KB/s
Average 7.9421 s 1903.7868 KB/s

Testing Slow Eiffel
Small Data Set
   0.2451 s 205.6001 KB/s
   0.2441 s 206.5099 KB/s
   0.2453 s 205.4660 KB/s
Average 0.2448 s 205.8586 KB/s
Large Data Set
   68.8581 s 219.5839 KB/s
   69.1463 s 218.6686 KB/s
   69.2737 s 218.2662 KB/s
Average 69.0927 s 218.8396 KB/s

Testing Fast Java
Small Data Set
   0.5626 s 89.5843 KB/s
   0.5609 s 89.8558 KB/s
   0.5612 s 89.8110 KB/s
Average 0.5616 s 89.7504 KB/s
Large Data Set
   20.3712 s 742.2313 KB/s
   20.3176 s 744.1898 KB/s
   20.4146 s 740.6508 KB/s
Average 20.3678 s 742.3573 KB/s

Testing Perl Regexp
Small Data Set
   0.1110 s 453.9799 KB/s
   0.1049 s 480.2965 KB/s
   0.1047 s 481.3285 KB/s
Average 0.1069 s 471.8683 KB/s
Large Data Set
   25.4552 s 593.9884 KB/s
   25.5550 s 591.6686 KB/s
   25.4648 s 593.7655 KB/s
Average 25.4917 s 593.1408 KB/s

385.19user 4.36system 6:30.92elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (28494major+13398minor)pagefaults 0swaps

Of course I run it three times and average the results as this will give a better number that wont be as influenced by startup or data not being in memory and such. This is standard practice for doing speed tests of things on computers.

These results were obtained running the test on a system with dual pII 350 cpus using only one cpu for the processing.

From man fvwm for fvwm release 1.24r.

BUGS

       As of fvwm 0.99 there  were  exactly  39.342  unidentified
       bugs.  Identified  bugs  have  mostly  been fixed, though.
       Since then 9.34 bugs have been fixed.  Assuming that there
       are  at  least  10  unidentified bugs for every identified
       one, that leaves us with 39.342 -  9.32  +  10  *  9.34  =
       123.402  unidentified bugs. If we follow this to its logi
       cal conclusion we will have an infinite number of  uniden
       tified  bugs before the number of bugs can start to dimin
       ish, at which point the program will be  bug-free.   Since
       this  is  a  computer program infinity = 3.4028e+38 if you
       don't insist on double-precision. At the current  rate  of
       bug  discovery  we  should expect to achieve this point in
       3.37e+27 years. I guess I  better  plan  on  passing  this
       thing on to my children....

Steven Hanley <sjh@wibble.net>

Last modified: Mon May 1 23:48:07 2000