cmph-bench – A Benchmark for Minimal Perfect Hashing in C

mph-diagThere is an open source software package on Sourceforge that includes implementations in C of several minimal perfect hashing algorithms. It has been uploaded to GitHub at https://github.com/zvelo/cmph. The code is licensed under your choice of LGPL (Lesser GNU Public License) 2 or MPL (Mozilla Public License) 1.1. This is important because MPL has fairly free terms of use, whereas GPL can give people the right to reverse engineer your code if you include libraries licensed (exclusively) with it.

You can download and build the package as follows:

  1. git clone https://github.com/zvelo/cmph
  2. cd cmph
  3. configure
  4. make
  5. make check

There is a sample program in the README. I copied it to a sibling directory (cmph-bench), calling the file cmph-bench.c. I then built it using my baker build tool. Note that the latest version of baker from 2017-06-25 is required, as cmph uses libm.a (the math library) and baker wasn’t resolving symbols in libraries under /usr/lib prior to this version.

After building, the progam cmph-bench is dropped in the target directory and can be run with: target/cmph-bench. I copied the file /usr/share/dict/words, a corpus of almost 100000 unique words used in the Easy Perfect Minimal Hashing benchmark. I then modified the program to create a MPH from the corpus, reload it from disk, and look up every word in the corpus in it. I added code to benchmark each of these phases. Running with the BRZ algorithm:

Constructed MPH in 0.381116 seconds
Loaded MPH in 0.000689 seconds
Looked up 99171 keys MPH in 0.058402 seconds

I added commands to allow any of the supported algorithms to be bench-marked. Running without an algorithm gives you the list of supported algorithms:

usage: cmph-bench (bdz|bdz-ph|bmz|bmz8|brz|chd|chd-ph|chm|fch)

Finally, I added code to track the amount of memory being allocated using the linux function mallinfo. This produced some interesting results. E.g., when I ran: target/cmph-bench chd

Constructed MPH in 0.379223 seconds, memory 20528
Loaded MPH in 0.000653 seconds, memory 182960
Looked up 99171 keys in MPH in 0.059219 seconds, corpus size 938848

Note that the corpus size is the actual size of the corpus including just the words separated by newlines. That means that the memory allocated for the MPH does not include the keys. So how does cmph know that when there’s a hit, its actually a match for the same key? I’m going to have to delve into the code of cmph to figure that out. For now, I’ve pushed the code of cmph-bench to github: https://github.com/jimbelton/cmph-bench. Feel free to download it (or fork it) and play.

My friend Simon did so and sent me some comments:

Problem with the benchmark is that 100k words is way too small. The entire output set probably sits in cache memory and so the timing might not reflect reality for bigger sets. Instead of words then just try using consecutive big numbers as keys, e.g. 1,000,000,000 onwards…
Also, when looking up the keys then look them up in a random order so CPU cache effects do not skew the results.

The randomization would be an interesting extension to the test. The larger data set can be achieved by changing the corpus-words, as Simon shows in his last comment.

Baker is not working for me:

simon@precision-5520:~/cmph/cmph-bench$ ../baker/baker

/usr/bin/ld: dynamic STT_GNU_IFUNC symbol `strcmp’ with pointer equality in `../../../../usr/lib/x86_64-linux-gnu/libc.a(strcmp.o)’ can not be used when making an executable; recompile with -fPIE and relink with -pie
collect2: error: ld returned 1 exit status
fatal: Failed to link C program with: cc target/cmph-bench.o -o target/cmph-bench ../cmph/src/.libs/libcmph.a ../../../../usr/lib/x86_64-linux-gnu/libm.a ../../../../usr/lib/x86_64-linux-gnu/libc.a

Interestingly, if I just do the last link step manually instead of baker with -lm -lc instead of those long paths then everything seems to work.

simon@precision-5520:~/cmph/cmph-bench$ cc target/cmph-bench.o -o target/cmph-bench ../cmph/src/.libs/libcmph.a -lm -lc

I’m not sure why baker is having trouble with libc.a. The -lc flag should not be required. If you run into this problem, try Simon’s work around.

Hmmm… if I try with a 1M or 10M word corpus then it always asserts 😦 — Simon

simon@precision-5520:~/cmph/cmph-bench$ target/cmph-bench brz
cmph-bench: brz.c:417: brz_gen_mphf: Assertion `nkeys_vd < brz->size[cur_bucket]’ failed.Aborted (core dumped)

Not sure why the BRZ algorithm has this limitation. It’s also fairly slow. I will likely do a future post with some graphs of the various algorithms with various sized key sets.

If I change the algo in the bench mark program then I can avoid the assert and even 10 M keys works 🙂 However, the construction time and lookup time do not seem to be linear going from 10M keys to 100M keys 😦 And I didn’t try to randomize the keys lookups yet…

simon@precision-5520:~/cmph/cmph-bench$ perl -e ‘foreach(0..10_000_000){printf qq[%u\n],$_;}’ > corpus-words
simon@precision-5520:~/cmph/cmph-bench$ target/cmph-bench bdz
debug: Open the dump file and get the time and memory stamps
debug: Create minimal perfect hash function using the chosen algorithm
Constructed MPH in 9.087200 seconds, memory 3459552
Loaded MPH in 0.000649 seconds, memory 3458976
Looked up 10000001 keys in MPH in 1.362376 seconds, corpus size 78888899

simon@precision-5520:~/cmph/cmph-bench$ perl -e ‘foreach(0..100_000_000){printf qq[%u\n],$_;}’ > corpus-words
simon@precision-5520:~/cmph/cmph-bench$ target/cmph-bench bdz
debug: Open the dump file and get the time and memory stamps
debug: Create minimal perfect hash function using the chosen algorithm
Constructed MPH in 259.054525 seconds, memory 3843904
Loaded MPH in 0.006719 seconds, memory 34593344
Looked up 100000001 keys in MPH in 30.801986 seconds, corpus size 888888900

Advertisements

About jimbelton

I'm a software developer, and a writer of both fiction and non-fiction, and I blog about movies, books, and philosophy. My interest in religious philosophy and the search for the truth inspires much of my writing.
This entry was posted in c programming and tagged , , , , , , , . Bookmark the permalink.

One Response to cmph-bench – A Benchmark for Minimal Perfect Hashing in C

  1. Pingback: CMPH Benchmark Bar Chart | Programming with Jim

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s