Given the extremely low overhead of the minimal perfect hashes generated by cmph (on the order of 1/2 a byte per key), I was fairly certain they would generate false positives. That is, if you searched for a key that isn’t in the hash, cmph would tell you it was. As I suspected, this is the case. This has big implications for using minimal perfect hashes as sets.
When you look up any key with cmph_search, it returns an index. To determine whether that index refers to the same key as the one you looked up, you need to keep an array of offsets. Assuming you have more than 65000 keys, that means you will have a 4 byte overhead per key. Once you have the offset, you then need to compare the key at that offset to the one you looked up.
Update (2017-08-30): I have verified that the indexes returned by cmph_search are random. This means you need to create the cmph, then walk back across it looking up each of the strings you loaded it with, and save the string’s offset at the returned index in your array of offsets.
The code to look up completely random keys in a cmph hash is now available in cmph-bench on GitHub. The next step in the benchmark will be to add the index and the keys to the algorithm, and to add the ability to benchmark other solutions, including binary searching and using a dynamic hash like the ones bench-marked in hash-table-shootout.