cmph-bench – A Benchmark for Minimal Perfect Hashing in C

mph-diagThere is an open source software package on Sourceforge that includes implementations in C of several minimal perfect hashing algorithms. It has been uploaded to GitHub at https://github.com/zvelo/cmph. The code is licensed under your choice of LGPL (Lesser GNU Public License) 2 or MPL (Mozilla Public License) 1.1. This is important because MPL has fairly free terms of use, whereas GPL can give people the right to reverse engineer your code if you include libraries licensed (exclusively) with it.

You can download and build the package as follows:

  1. git clone https://github.com/zvelo/cmph
  2. cd cmph
  3. configure
  4. make
  5. make check

There is a sample program in the README. I copied it to a sibling directory (cmph-bench), calling the file cmph-bench.c. I then built it using my baker build tool. Note that the latest version of baker from 2017-06-25 is required, as cmph uses libm.a (the math library) and baker wasn’t resolving symbols in libraries under /usr/lib prior to this version.

After building, the progam cmph-bench is dropped in the target directory and can be run with: target/cmph-bench. I copied the file /usr/share/dict/words, a corpus of almost 100000 unique words used in the Easy Perfect Minimal Hashing benchmark. I then modified the program to create a MPH from the corpus, reload it from disk, and look up every word in the corpus in it. I added code to benchmark each of these phases. Running with the BRZ algorithm:

Constructed MPH in 0.381116 seconds
Loaded MPH in 0.000689 seconds
Looked up 99171 keys MPH in 0.058402 seconds

I added commands to allow any of the supported algorithms to be bench-marked. Running without an algorithm gives you the list of supported algorithms:

usage: cmph-bench (bdz|bdz-ph|bmz|bmz8|brz|chd|chd-ph|chm|fch)

Finally, I added code to track the amount of memory being allocated using the linux function mallinfo. This produced some interesting results. E.g., when I ran: target/cmph-bench chd

Constructed MPH in 0.379223 seconds, memory 20528
Loaded MPH in 0.000653 seconds, memory 182960
Looked up 99171 keys in MPH in 0.059219 seconds, corpus size 938848

Note that the corpus size is the actual size of the corpus including just the words separated by newlines. That means that the memory allocated for the MPH does not include the keys. So how does cmph know that when there’s a hit, its actually a match for the same key? I’m going to have to delve into the code of cmph to figure that out. For now, I’ve pushed the code of cmph-bench to github: https://github.com/jimbelton/cmph-bench. Feel free to download it (or fork it) and play.

My friend Simon did so and sent me some comments:

Problem with the benchmark is that 100k words is way too small. The entire output set probably sits in cache memory and so the timing might not reflect reality for bigger sets. Instead of words then just try using consecutive big numbers as keys, e.g. 1,000,000,000 onwards…
Also, when looking up the keys then look them up in a random order so CPU cache effects do not skew the results.

The randomization would be an interesting extension to the test. The larger data set can be achieved by changing the corpus-words, as Simon shows in his last comment.

Baker is not working for me:

simon@precision-5520:~/cmph/cmph-bench$ ../baker/baker

/usr/bin/ld: dynamic STT_GNU_IFUNC symbol `strcmp’ with pointer equality in `../../../../usr/lib/x86_64-linux-gnu/libc.a(strcmp.o)’ can not be used when making an executable; recompile with -fPIE and relink with -pie
collect2: error: ld returned 1 exit status
fatal: Failed to link C program with: cc target/cmph-bench.o -o target/cmph-bench ../cmph/src/.libs/libcmph.a ../../../../usr/lib/x86_64-linux-gnu/libm.a ../../../../usr/lib/x86_64-linux-gnu/libc.a

Interestingly, if I just do the last link step manually instead of baker with -lm -lc instead of those long paths then everything seems to work.

simon@precision-5520:~/cmph/cmph-bench$ cc target/cmph-bench.o -o target/cmph-bench ../cmph/src/.libs/libcmph.a -lm -lc

I’m not sure why baker is having trouble with libc.a. The -lc flag should not be required. If you run into this problem, try Simon’s work around.

Hmmm… if I try with a 1M or 10M word corpus then it always asserts 😦 — Simon

simon@precision-5520:~/cmph/cmph-bench$ target/cmph-bench brz
cmph-bench: brz.c:417: brz_gen_mphf: Assertion `nkeys_vd < brz->size[cur_bucket]’ failed.Aborted (core dumped)

Not sure why the BRZ algorithm has this limitation. It’s also fairly slow. I will likely do a future post with some graphs of the various algorithms with various sized key sets.

If I change the algo in the bench mark program then I can avoid the assert and even 10 M keys works 🙂 However, the construction time and lookup time do not seem to be linear going from 10M keys to 100M keys 😦 And I didn’t try to randomize the keys lookups yet…

simon@precision-5520:~/cmph/cmph-bench$ perl -e ‘foreach(0..10_000_000){printf qq[%u\n],$_;}’ > corpus-words
simon@precision-5520:~/cmph/cmph-bench$ target/cmph-bench bdz
debug: Open the dump file and get the time and memory stamps
debug: Create minimal perfect hash function using the chosen algorithm
Constructed MPH in 9.087200 seconds, memory 3459552
Loaded MPH in 0.000649 seconds, memory 3458976
Looked up 10000001 keys in MPH in 1.362376 seconds, corpus size 78888899

simon@precision-5520:~/cmph/cmph-bench$ perl -e ‘foreach(0..100_000_000){printf qq[%u\n],$_;}’ > corpus-words
simon@precision-5520:~/cmph/cmph-bench$ target/cmph-bench bdz
debug: Open the dump file and get the time and memory stamps
debug: Create minimal perfect hash function using the chosen algorithm
Constructed MPH in 259.054525 seconds, memory 3843904
Loaded MPH in 0.006719 seconds, memory 34593344
Looked up 100000001 keys in MPH in 30.801986 seconds, corpus size 888888900

Advertisements
Posted in c programming | Tagged , , , , , , , | 1 Comment

Minimal Perfect Hashing

mphfA perfect hash is a one-to-one map. A key is taken and hash (into an integer) and looked up in a single location in the map. If it is in the map, it will be found in that location. This makes perfect hashes incredibly fast and very light on the CPU cache and the virtual memory system of the OS.

A minimal perfect hash is the smallest perfect hash that can be found for a set of keys. Determining the minimal perfect hash by brute force is time consuming. The paper http://cmph.sourceforge.net/papers/esa09.pdf describes an algorithm for generating a minimal perfect hash in linear time. That means that the time to determine the minimal perfect hash is directly proportional to the number of keys.

I found a sample implementation of the algorithm in python in the article Throw away the keys: Easy, Minimal Perfect Hashing. This program can be run as a benchmark, and takes about 1.7s to run on my laptop. The article points to another implementation in python, this time of the MOS Algorithm II CACM92, by the scientist who discovered the algorithm. When I ran this version, which is supposed to have a compacter representation, the program never came back.

Update: I looked into why the MOS II algorithm went ballistic. It is not a linear algorithm. The part of the code it gets stuck in is a nested loop of O(n2). I left it running, and six hours later, it was still in this loop (the inner loop of which is labelled with the comment “Step 3: rotate patterns and search for suitable displacement”). It’s author claims “This code runs fast linearly”. I beg to differ.

Posted in Uncategorized | Leave a comment

CLI, Directories, and Gob in Go

go_lang1Building on my last post, Playing with Go (AKA golang), I’m going to hack a command line interface (CLI), do some directory parsing, and add some code to read back the note I serialized to a file using gob. To read what was typed on the command line, import the “flag” package. In main(), you can then:

flag.Parse()
fmt.Println(“arguments: “, flag.Args())

I implemented a couple of useful functions that are missing from go: die, which prints an error message and exits with status 1, and assert, which tests a condition and, if false, dies with a message. They both use go’s bizarre “varargs” syntax, … interface{}, to allow me to pass a format string and an arbitrary list of arguments to a printf like formatter.

The code from the last post now implements the “new” command. I’ve implemented two additional commands, “list”, which lists all the notes dates in order, and “show”, which displays the body of the most recent note. Both the new commands use ioutil.ReadDir to get the list of the notes files in sorted order. The show command uses a gob decoder to read in and decode the note in the latest note file.

Here’s the complete program. Cut and paste it into your editor, save it, and go build it, then play with it. If you have any questions about it, feel free to post a comment.

package main

import "encoding/gob"
import "flag"
import "fmt"
import "io/ioutil"
import "os"
import "strings"
import "time"

type Note struct {
    Body []byte
    Created time.Time
}

func die(format string, args ...interface{}) {
    line := fmt.Sprintf(format, args...)
 
    if (!strings.HasSuffix(line, "\n")) {
        line += "\n"
    }
 
    fmt.Fprint(os.Stderr, line)
    os.Exit(1)
}

func assert(condition bool, format string, args ...interface{}) {
    if (!condition) {
        die(format, args...)
    }
}

func usageIf(condition bool) {
    if (condition) {
        die("usage: gnosis (new|list|show)")
    }
}
 
func main() {
    flag.Parse()
    args := flag.Args()
 
    usageIf(len(args) == 0)
 
    if (args[0] == "new") {
        var note Note
        note.Body, _ = ioutil.ReadAll(os.Stdin)
        note.Created = time.Now().UTC()

        os.Mkdir("notes", 0700)
        name := note.Created.Format("notes/2006-01-02T15:04:05.999999999.gob")
        file, err := os.OpenFile(name, os.O_CREATE | os.O_WRONLY, 0600)
        assert(err == nil, "Error '%v' creating file %v", err, name) 
        encoder := gob.NewEncoder(file)
        err = encoder.Encode(note)
        assert(err == nil, fmt.Sprintf("Error '%v' encoding note %v", err, note))
        file.Close()
        os.Exit(0)
    }
 
    if (args[0] == "list") {
        files, err := ioutil.ReadDir("notes")
        assert(err == nil, "Error '%v' reading directory 'notes'", err)

        for _, file := range files {
            stamp := file.Name()
            stamp = stamp[0:len(stamp) - 4] // Strip away .gob suffix
            fmt.Print(stamp + "\n")
        }
 
        os.Exit(0)
    }

    if (args[0] == "show") {
        files, err := ioutil.ReadDir("notes")
        assert(err == nil, "Error '%v' reading directory 'notes'", err)
        name := "notes/" + files[len(files) - 1].Name()
        file, err := os.Open(name)
        assert(err == nil, "Error '%v' opening file %v", err, name)
        decoder := gob.NewDecoder(file)
        var note Note
        err = decoder.Decode(&note)
        assert(err == nil, "Error '%v' decoding note", err)
        file.Close()
        fmt.Print(string(note.Body))
        os.Exit(0)
    }
 
    usageIf(true)
}

 

Posted in golang programming | Tagged , , , , , , , | Leave a comment

Playing with Go (AKA golang)

golangTo installing Go on debian Linux:

sudo apt-get install golang

Check your version like this:

go version
> go version go1.3.3 linux/amd64

Using you favorite text editor (IDEs are for the weak), create a file with the .go extension. I’ll call mine gnosis.go. Code a minimal go program like this (from go’s Getting Started page):

package main

import "fmt"

func main() {
 fmt.Printf("hello, world\n")
 }

Compile it like this (run from the directory where you save the program):

go build

Finally, you can run it, like this:

./gnosis
> hello, world

I want to make it do something interesting, so I will start with a simple note taking app. First, the gnosis program must read the note from standard input, which by default is the keyboard.

package main

import "io/ioutil"
 import "fmt"
 import "os"

func main() {
    note, _ := ioutil.ReadAll(os.Stdin)

    fmt.Print(string(note))
 }

After compile and running it, I typed:

This is a note
Line 2 of the note
[Ctrl-D]

In response, the program printed:

This is a note
Line 2 of the note

Some interesting things to note: ioutil.ReadAll reads the entire contents of a file (in this case, the standard input), and returns the content as a (byte-array, error) ordered pair. Specifying the error be stored in ‘_’ rather than a named variable tells Go to throw it away. The string function is used to convert the byte array into a string of characters. Finally, I typed [Ctrl-D] because in Linux, that is the way to simulate the end of file from the keyboard.

I would like to have a time-stamp associated with each note. To do this, I imported “time” and added this code right before the previous call to fmt.Print:

import "time"

timestamp := time.Now().UTC().Format("2006-01-02T15:04:05.999999999")
 fmt.Print(timestamp + "\n")

Compiling and running the program now causes the time-stamp for the current time to be output in the format shown on a line before the text of the note. But this leaves one more problem: I want to remember what I wrote in my note, so instead of just printing it out, I store it in a structure and write it to a file:

package main

import "encoding/gob"
import "fmt"
import "io/ioutil"
import "os"
import "time"

type Note struct {
    Body    []byte
    Created time.Time
}

func main() {
    var note Note
    note.Body, _ = ioutil.ReadAll(os.Stdin)
    note.Created = time.Now().UTC()

    os.Mkdir("notes", 0700)
    name := note.Created.Format("notes/2006-01-02T15:04:05.999999999.gob")
    file, _ := os.OpenFile(name, os.O_CREATE, 0400)
    encoder := gob.NewEncoder(file)
    encoder.Encode(note)
    file.Close()
}

There’s a lot more going on here. First, I’ve created a type called Note that contains a body (what you type in) and a creation time. Next, I read the body, as before, and capture the creation time (without formatting it as a string this time).

NOTE: This took me a while to figure out: If the structure member’s names don’t begin with an upper case letter, they are not serialized. This is a very annoying quirk of the gob package that is not documented.

Next, I write the note out as a GOB, which can later be read back in and used. If there isn’t already a notes subdirectory, I create one. Then, I create a file whose name is the string formatted time-stamp of the note, followed by the extension .gob. Finally, I encode the note into the file as a GOB.

The GOB encodes the note in binary form. If you cat the file, it’s contents are quite unreadable, as parts of it are encoded. Try dumping it a command like the following one, with the time-stamp of the filename replaced with the one just created:

 od -c notes/2017-06-15T02\:14\:53.864365634.gob

You should see output like this:

od-gob

The first part of the file is a description of the type (Note) and its members (Body and Created). This is followed by the value, which includes the string you entered as the body and a binary representation of the time-stamp. This value can be decoded into a variable of type Note. I’ll discuss doing so in a future post.

Posted in golang programming | Tagged , , , , | 1 Comment

Installing Linux Mint Debian Edition 2

My previous article on this subject was actually a guide to Configuring Linux Mint Debian Edition 2 (Betsy). In this article, I’ll talk about the actual installation process. First, download the installation ISO:

  1. Browse to https://www.linuxmint.com/download_lmde.php
  2. Select your version at the bottom. I recommend Cinnamon [64-bit]
  3. Pick a mirror and download the file lmde-2-201503-cinnamon-64bit.iso
  4. Run: cd ~/Downloads

To put it on a thumb drive (the first two steps, from Creating a bootable USB drive from an ISO image, may not be required):

  1. Run: sudo apt-get install syslinux-utils
  2. Run: isohybrid lmde-2-201503-cinnamon-64bit.iso
  3. Before inserting your thumb drive, run: lsblk
  4. Insert the thumb drive and wait for Linux to automount it, then rerun: lsblk
  5. The new device that didn’t show up the first time you ran lsblk is the thumb drive.
  6. Unmount it. E.g.: umount /media/jim/USB STICK
  7. Copy the ISO to the thumb drive. E.g.: sudo dd if=path/to/image.iso of=/dev/sdb

If your laptop won’t boot off a USB thumb drive (like my piece of crap Toshiba Satellite L850D), you can burn the ISO to a DVD-RW using brasero. Make sure to burn the image to disk, not the ISO file as data.

Caveat #1

The longstanding bug where, after you sudo apt-get update and sudo apt-get upgrade, cinnamon is broken when you reboot, is still in full force. If you run into this, boot into the backup GUI, access Menu/Administration/Synaptic Package Manager, and update cinnamon.

Caveat #2

I ran into this totally weird problem: on boot, the system would no longer accept my password. Entering it using the mouse keyboard widget worked. When I checked in Menu/Preferences/Keyboard, the language had been set to Cherokee.

Posted in systems administration | Tagged , , , , , , , , | Leave a comment

Why is Booting From USB on a Toshiba L850D Laptop so Difficult?

So far, trying to boot from a USB thumb drive on my Toshiba L850D has been an abject failure. The steps I followed here have leave the laptop in a state where it still won’t boot off the thumb drive, but also fails to boot Windows off the hard disk. It is possible to boot off the DVD drive.

The Toshiba L850D BIOS access is truly horrible. The process to have the laptop boot off USB that I’ve pieced together, which has left the laptop in a disfunctional state, is this:

  1. Power the laptop off.
  2. Hold down the Fn and F2 keys
  3. Power the laptop on and hold the keys for a second or so. The machine will then boot to an Automatic Repair menu.
  4. Select See Advanced Repair Options
  5. Select Trouble Shoot
  6. Select Advanced Options
  7. Select UEFI Firmware Settings
  8. Select Restart. The laptop should reboot into the BIOS settings screen
  9. Select Boot on the top menu (using the left arrow key)
  10. Select USB (using the down arrow)
  11. Move USB up in the boot order by pressing the F6 key
  12. Select Security on the top menu
  13. Select Secure Boot using the down arrow and press F5 to toggle its value to Disabled
  14. Select Advanced on the top Menu
  15. Select >System Configuration using the down arrow, the press Enter
  16. Select Boot Mode using the down arrow and press F5 to toggle its value to CSM Boot
  17. Press F10 to Save and Exit and Enter to confirm

Here is one source I followed: http://www.tomshardware.com/forum/64755-63-toshiba-satellite-boot

I’ve read that you can completely reset the BIOS by removing the CMOS backup battery. This did not work for me. After trying this, the laptop still refused to boot from the hard drive. I believe it is because the CSM Boot is still enabled.

Note: After setting it to boot in CSM mode, you can no longer enter the BIOS with Fn-F2. Instead, press F12 at boot and you’ll get a boot menu with an option to change the BIOS configuration.

I finally gave up and booted the laptop off DVD by selecting the correct option from the boot menu. Note that after booting the LMDE2 installer, the track-pad and keyboard did not work unless boot mode was set to CSM Boot.

Posted in systems administration | Tagged , , , , , | Leave a comment

Enabling Rsync to the WD My Cloud

wd-my-cloudLinux’s rsync utility is a simple way to do incremental backups.This saves a lot of time, because rsync will only copy files that have changed. If you have a large collection of MP3s, for example, rerunning an rsync backup only copies the new files (or modified files if you’ve renamed or retagged them) to the backup.

Here’s how to set up the WD My Cloud to be able to back up to it and restore from it using rsync from Linux:

  1. Run: arp -n
  2. If the MAC address of your WD My Cloud doesn’t show, try: sudo arp-scan -l
  3. Copy the IP address that maps to the MAC address of your WD My Cloud
  4. Paste it into the address bar of your browser
  5. After logging in to your WD My Cloud, you should probably change the root password
  6. You can create users and shares from the GUI
  7. Shares will be stored in the /shares directory and are owned by root
  8. From the GUI, click Settings (you may need to scroll the top menu to the left to see this button)
  9. Select Network from the menu
  10. Set SSH to ON

Some important points:

  1. The default ssh user sshd is an alias for root. To log in to the box, run (substituting the correct IP, determined with arp): ssh sshd@192.168.0.11
  2. Do not try to back up to the root account. It is on a tiny filesystem. Instead, specify the full path to one of the shares you’ve created
  3. You can “dry run” any rsync command to see what it would do by including the -n option, but it doesn’t work well, IMO.

Example backup command:

rsync -ahv Documents root@192.168.0.11:/shares/jim

Restore with the same command, switching the source and destination:

rsync -ahv root@192.168.0.11:/shares/jim/Documents ~

You can use the file browser to look at what’s stored in your share.

Note: I followed the first two parts of How to create your own private DropBox with WD MyCloud to set a shared directory and the SSH daemon on the WD My Cloud. Later steps that in that document that show how to set up an rsync daemon do not work for me. It looks as though WD has removed the ability to modify the software using apt.

Here’s a knowledge base article on using rsync that I found helpful: Using rsync to Transfer and Synchronize Local and Remote Systems

Posted in systems administration | Tagged , , , , | Leave a comment