So, for long involved reasons, I’m dealing with hash tables again- and discovering (again) that I don’t like them. Hash tables are beloved by programmers who don’t know their sharp corners and failure modes. “They’re fast!” I hear people say, “Constant time insert, find, and remove- what’s not to like?” A lot, as it turns out. To the point where my default is use a balanced binary tree instead of a hash table unless analysis shows that a hash table is inevitablity better, and even then I’m cautious.
The problem with hash tables is those pesky little foot notes that tend to get glossed over when introducing them to fresh faced programmers who have yet to learn the wisdom of Tom Waits, that the large print giveth and the small print taketh away. This is exactly the case with hash tables- those glossed over foot notes will savage you in the tender bits sooner or later, and turn what is supposed to be a cheap operation into an expensive one. And watching out for them makes hash tables a much harder data structure to use than, for example, balanced binary trees. My hope here is to shed some light on these gotchas.
In addition, this is also a wonderfull demonstation of my point from previous blog posts that computer science theory- and in this case, straight up advanced mathematics- are important to the workaday programmer. What relevance does number theory and abstract algebra have for such a basic data structure as hash tables? Quite a lot, it turns out.
Let me start with that “constant time” insert, remove, and search. The small print that taketh away here, that most people forget (or ignore), is the word “amortized”. Hash table insert, delete, and find are amortized constant time operations- that means, on average they’re constant time. One time in N, however, they are O(N). The expensive operations happen when you have to resize the hash table- either increasing or decreasing the number of buckets in the table. At which point you have to go through the entire hash table and reinspect every single entry to see if it moves to one of the new buckets or not. But since this only happens one time N, the average cost is N/N=1, so you remain in amortized constant time, according to the theoriticians. But this doesn’t help you if you can’t afford the occassional O(N) cost- for example, if you’re doing real-time or pseudo-real-time programming. If you have large hash tables in your program that change size, expect random long pauses as the hash table gets resized.
And this assumes that the hash table is aggressive enough in expanding to keep the probability of a hash table resize down to 1:N. I’ve met more than one hash table implementation that didn’t. The trick is that when you grow (or shrink) the table, to double (or halve) the number of buckets. This way the average (there’s that word again) number of times an element is inspected is constant. Say you’re inserting a million elements into an initially empty table. If you double the size of the table every resize, that means half the elements won’t have been inspected at all (they’ll have been inserted since the last resize), a quarter will have been inspected once, an eight twice, a sixteenth three times, etc. Run the numbers out, and you find that inserting a million elements means that about a million inspections have to be done- with a small number of elements (the ones inserted early) being inspected large numbers of times, and large numbers of elements being inspected a small number of times or not at all. But that the average number of inspections per object is about 1. Depending upon when you do it, it can be as high as 2, but no higher, or as low as 1, but not lower.
A popular implementation technique of hash tables is, rather than doubling (or halving) in size during resize, to just add or remove some constant number of buckets. And then still claim it’s O(1) cost for insert et. al. It’s a hash table, after all. At times like this, I understand why the zen masters tended to simply hit their students upside the head with a stick when they said something stupid. Maybe the student would be enlightened and maybe they wouldn’t, but at least the master would feel better. The problem is that doing this “trick” turns inserting from an O(1) amortized cost to an O(N) amortized cost! The logic goes like this: set k = 1000, and look at inserting a million elements. Now, instead of resizing about 20 times (like we would have if we doubled every resize), we’ve resized a thousand times. We have a thousand elements (the first thousand elements) that have been resized a thousand times, another thousand elements (the second thousand elements) resized 999 times, etc. Run the numbers out, and you’ll find that they sum to about 500 million inspections have taken place for our million objects, or five hundred inspections on average per object. Worse yet, this rate of inspections grows linearly with the number of objects we put into the hash table- put a billion objects in, and now we’re doing about five hundred *thousand* inspections per object on average. This is because you have to do about a million resizings, instead of about 30.
That’s because this implementation has insert (and remove) costs of O(N/K), where K is a constant. In other words, insert and remove are O(N) operations now. This is what I meant about the small print taketh away. You might think that not being quite so profligate with memory would be a good thing- except this “minor” change to the algorithm totally destroys it’s speed advantage.
It’s also possible, if the implementation isn’t carefull, to constantly be resizing the hash table- this is another problem I’ve encountered with hash tables. For example, assume that the hash table implementation shrinks the table if the number of full buckets drops below half, and grows the table if it becomes over full. If the table starts out with 100 buckets, and the number of buckets falls to 49, the size is shrunk to 50. But then two new elements are added, bringing the number of elements to 51, and requiring the table to be grown to 100 again. Remove two elements,
and the table shrinks to 49 elements again. Note that this only happens at “magic values”- if the number of elements is varying between, say, 73 and 75, there would be no resizing, as elements are added and removed. This is a problem- normal testing might not hit the magic number just right, and thus make the hash table look fast- until, generally some time in production, the magic number is hit and you’re constantly resizing the table and experiencing O(N) behavior, instead of the expected O(1) behavior. When I implement hash tables, I generally don’t shrink the table until it’s less than 3/8th full (or even 1/4 full), giving some buffer zone, so I’m not immediately regrowing the table. In the example above, I wouldn’t shrink the table to 50 buckets until the number of full buckets was 36. This increases the average number of unused buckets, but it restores O(1) amortized cost for insertions and deletions.
There is another way that hash tables can complete degrade in performance, and that’s with bad hash functions. The design of good hash functions is really, really tricky. The problem is that hash functions have to serve several very strict masters.
First, hash functions have to be blindingly fast- because comparisons are already really fast. Especially for small sets, constant factors matter. Balanced binary trees work in O(log N) time (worst case- no small print, no footnotes, no gotchas). Which sounds worse, until you look at the constant factors. Integer and floating point compares are small constant times- being not much more than a clock cycle each. Even string compares are fast- string compares only have to find the first difference, meaning that they can often times quit well before reaching the end of the string, and tuned string compares can run at about 1 clock cycle per byte (less than that- I’d bet that a good MMX or SSE based string compare function would be able to do multiple bytes per clock). If the hash function is 10 times more expensive than the compare (on average), then for N less than about 1024, the binary tree search is faster! Say N=64, that means that on average I only have to do about 6 comparisons, each 1/10th the speed of a hash function- so doing 6 compares is still 40% faster than doing a single hash table lookup. Cache effects also come into play, changing the amortized cost of each operation, but I’ve seen this a lot, where for small enough sets, binary trees are faster than hash tables, simply because comparison is so much cheaper than doing a not-blindingly-fast hash function.
But it’s not enough for a hash function to be fast- it also has to be “good”. What does it mean for a hash function to be “good”? It means that for most common sets, there are very few collision- i.e. the hash table is not wanting to put lots of elements in the same bucket. I suppose that it’s possible for a hash table implementation to use a tree based structure for buckets, so in the worst case (all the strings end up in the same bucket) the hash table performance would degrade to that of a binary tree. But I’ve yet to encounter that in the wild- generally the assumption is “that’ll never happen” (famous last words, right up there with “they’ll never hit us at that range!”), so it’s not worthwhile to implement the extra complexity. Instead they use linked lists or arrays or similar ideas to handle collisions. But this means that if, for some set of elements, the hash function doesn’t distribute those elements across enough different hash buckets, performance will degrade to that horrid O(N) performance again. Collisions also ruin memory utilization- if I have 70 elements in 100 buckets, but those 70 elements hash to only 4 buckets (and yes, Virginia, I’ve seen collision rates this bad), then 96 of my buckets aren’t doing me any good at, I’m just wasting that space.
There are two times when this can happen: malice and accident. Malice is the obvious one- someone is out to get your program, and make it run really really slow, by carefully chosing their inputs so that your hash algorithm has large numbers of collisions. There are three things I can think of to protect against malicious attacks- 1) make collisions less expensive (use a tree-based bucket structure), 2) vary your hash funciton some way for different hash functions for each run (for example, adding a salt value) and hope the cracker can’t predict how your vary the hash functions or defeate the hash functions anyways, or 3) use a cryptographically secure hash function. Cryptographically secure hash algorithms are slow, in terms of computation time- and remember, constant factors are important.
A bigger danger is accidental collisions. I’ve never personally been bit by a malicous attack on one of my hash tables, but I have been bit by hash functions which were insufficiently random and data that was regular in a way that the hash function didn’t muddle enough, causing huge numbers of collisions.
I’ll give an example of this problem, using a current, popular hash function- FNV. The core operation of FNV is multiplication by a large prime. Multiplication is a good instruction for hash functions, it’s reasonbly cheap, and it allows a single bit of the input to affect a number of bits in the output. But the flaw of FNV is that a bit of the input of a multiply will only affect bits of equal or higher order of the output. There is no way for a higher order bit to “reach down” and affect lower order bits. This means that the low order bits of the hash value are completely determined by the low order bits of the input. For example, the strings “123456789″, “ABCDEFGHI”, “QRSTUVWXY”, “abcdefghi”, and “qrstuvwxy” all have the same low 4 bits- because all the characters all have the same low four bits. Adding a salt value to change the initial state of the hash value doesn’t help, the algorithm itself is broken.
Making the hash table size a prime helps some, but it doesn’t solve the problem, and it creates new problems. First of all, it basically requires you to use the modulo operation- a fairly expensive operation. Second, most allocation libraries want to allocate powers of two- so you have increased memory wastage. A small amount, if the implementation is intelligent enough to use primes slightly smaller than a power of two, a lot of memory wastage if they use primes slightly larger than a power of two, which then gets rounded up to the next larger power of two. So if an implementation were to use, for example, 65537 buckets (one more than 216), the memory allocator would likely round this up and allocate enough room for 131,072 buckets, wasting almost half (65,535 buckets worth) of the memory.
Modulo a prime number doesn’t solve the problem, it doesn’t save weak hash functions. The problem is that if x = y, then x mod p = y mod p, for all (positive integers) p. What this means is that you can get linear congruences- if you’re generating sequences, once (if) the sequences start colliding, they’ll keep colliding. This is likely to happen if you’re not hashing strings, but hashing ints- for example, object addresses or object identifiers, where if you see X, you’re likely to also see X+i. Which means when you get two values X and Y such that hash(X) = hash(Y), you’re likely to continue to collide, over and over. The modulo won’t save you. It does allow higher order bits to “wrap around” and affect lower order bits. So, for example, making your hash tables prime numbered in size helps FNV. But it’s not the case that prime number sized hash tables always save you. For example, if you use a variant of a linear congruential (pseudo-)random number generators as your hash function, they don’t help at all.
Choosing a hash function is seriously tricky. Simply because it’s popular doesn’t mean it’s good- FNV is used by Microsoft, and if you can’t trust Microsoft, who can you trust? (Yes, this is sarcasm). This is a point where having an understanding of the theory starts having a huge impact- cryptography, number theory, abstract algebra- all of these have a huge impact. For example, to avoid simple linear congruences, you probably want to pull operations from multiple different algebras, most notably integers modulo a power of two and polynomials in the Galois Field GF(2). Don’t know about Galois field polynomials? You should- they’re used in CRC check sums, lots of random number generators, and many crypto systems (including AES, the replacement for DES). A hash function doesn’t need to be cryptographically secure to need a lot of mathematics behind it.
This has gone on longer than normal even for one of my rants. Hopefully I’ve acheived my purpose and shown some of the “sharp edges” hash tables come with. And maybe, just maybe, you’ll be less inclined to beleive the hype about hash tables. And won’t be quite so surprised when “fast” hash tables are slow.
No related posts.
Pingback: Sp3w » Blog Archive » Linkage 2007.02.27
Pingback: Start Using Unordered Map « ASKLDJD
Pingback: What are some problems with hash tables? - Quora