Feb 25 2008
Problems with Hash Tables
So, for long involved reasons, I’m dealing with hash tables again- and discovering (again) that I don’t like them. Hash tables are beloved by programmers who don’t know their sharp corners and failure modes. “They’re fast!” I hear people say, “Constant time insert, find, and remove- what’s not to like?” A lot, as it turns out. To the point where my default is use a balanced binary tree instead of a hash table unless analysis shows that a hash table is inevitablity better, and even then I’m cautious.
The problem with hash tables is those pesky little foot notes that tend to get glossed over when introducing them to fresh faced programmers who have yet to learn the wisdom of Tom Waits, that the large print giveth and the small print taketh away. This is exactly the case with hash tables- those glossed over foot notes will savage you in the tender bits sooner or later, and turn what is supposed to be a cheap operation into an expensive one. And watching out for them makes hash tables a much harder data structure to use than, for example, balanced binary trees. My hope here is to shed some light on these gotchas.
In addition, this is also a wonderfull demonstation of my point from previous blog posts that computer science theory- and in this case, straight up advanced mathematics- are important to the workaday programmer. What relevance does number theory and abstract algebra have for such a basic data structure as hash tables? Quite a lot, it turns out.
Let me start with that “constant time” insert, remove, and search. The small print that taketh away here, that most people forget (or ignore), is the word “amortized”. Hash table insert, delete, and find are amortized constant time operations- that means, on average they’re constant time. One time in N, however, they are O(N). The expensive operations happen when you have to resize the hash table- either increasing or decreasing the number of buckets in the table. At which point you have to go through the entire hash table and reinspect every single entry to see if it moves to one of the new buckets or not. But since this only happens one time N, the average cost is N/N=1, so you remain in amortized constant time, according to the theoriticians. But this doesn’t help you if you can’t afford the occassional O(N) cost- for example, if you’re doing real-time or pseudo-real-time programming. If you have large hash tables in your program that change size, expect random long pauses as the hash table gets resized.
And this assumes that the hash table is aggressive enough in expanding to keep the probability of a hash table resize down to 1:N. I’ve met more than one hash table implementation that didn’t. The trick is that when you grow (or shrink) the table, to double (or halve) the number of buckets. This way the average (there’s that word again) number of times an element is inspected is constant. Say you’re inserting a million elements into an initially empty table. If you double the size of the table every resize, that means half the elements won’t have been inspected at all (they’ll have been inserted since the last resize), a quarter will have been inspected once, an eight twice, a sixteenth three times, etc. Run the numbers out, and you find that inserting a million elements means that about a million inspections have to be done- with a small number of elements (the ones inserted early) being inspected large numbers of times, and large numbers of elements being inspected a small number of times or not at all. But that the average number of inspections per object is about 1. Depending upon when you do it, it can be as high as 2, but no higher, or as low as 1, but not lower.
A popular implementation technique of hash tables is, rather than doubling (or halving) in size during resize, to just add or remove some constant number of buckets. And then still claim it’s O(1) cost for insert et. al. It’s a hash table, after all. At times like this, I understand why the zen masters tended to simply hit their students upside the head with a stick when they said something stupid. Maybe the student would be enlightened and maybe they wouldn’t, but at least the master would feel better. The problem is that doing this “trick” turns inserting from an O(1) amortized cost to an O(N) amortized cost! The logic goes like this: set k = 1000, and look at inserting a million elements. Now, instead of resizing about 20 times (like we would have if we doubled every resize), we’ve resized a thousand times. We have a thousand elements (the first thousand elements) that have been resized a thousand times, another thousand elements (the second thousand elements) resized 999 times, etc. Run the numbers out, and you’ll find that they sum to about 500 million inspections have taken place for our million objects, or five hundred inspections on average per object. Worse yet, this rate of inspections grows linearly with the number of objects we put into the hash table- put a billion objects in, and now we’re doing about five hundred *thousand* inspections per object on average. This is because you have to do about a million resizings, instead of about 30.
That’s because this implementation has insert (and remove) costs of O(N/K), where K is a constant. In other words, insert and remove are O(N) operations now. This is what I meant about the small print taketh away. You might think that not being quite so profligate with memory would be a good thing- except this “minor” change to the algorithm totally destroys it’s speed advantage.
It’s also possible, if the implementation isn’t carefull, to constantly be resizing the hash table- this is another problem I’ve encountered with hash tables. For example, assume that the hash table implementation shrinks the table if the number of full buckets drops below half, and grows the table if it becomes over full. If the table starts out with 100 buckets, and the number of buckets falls to 49, the size is shrunk to 50. But then two new elements are added, bringing the number of elements to 51, and requiring the table to be grown to 100 again. Remove two elements,
and the table shrinks to 49 elements again. Note that this only happens at “magic values”- if the number of elements is varying between, say, 73 and 75, there would be no resizing, as elements are added and removed. This is a problem- normal testing might not hit the magic number just right, and thus make the hash table look fast- until, generally some time in production, the magic number is hit and you’re constantly resizing the table and experiencing O(N) behavior, instead of the expected O(1) behavior. When I implement hash tables, I generally don’t shrink the table until it’s less than 3/8th full (or even 1/4 full), giving some buffer zone, so I’m not immediately regrowing the table. In the example above, I wouldn’t shrink the table to 50 buckets until the number of full buckets was 36. This increases the average number of unused buckets, but it restores O(1) amortized cost for insertions and deletions.
There is another way that hash tables can complete degrade in performance, and that’s with bad hash functions. The design of good hash functions is really, really tricky. The problem is that hash functions have to serve several very strict masters.
First, hash functions have to be blindingly fast- because comparisons are already really fast. Especially for small sets, constant factors matter. Balanced binary trees work in O(log N) time (worst case- no small print, no footnotes, no gotchas). Which sounds worse, until you look at the constant factors. Integer and floating point compares are small constant times- being not much more than a clock cycle each. Even string compares are fast- string compares only have to find the first difference, meaning that they can often times quit well before reaching the end of the string, and tuned string compares can run at about 1 clock cycle per byte (less than that- I’d bet that a good MMX or SSE based string compare function would be able to do multiple bytes per clock). If the hash function is 10 times more expensive than the compare (on average), then for N less than about 1024, the binary tree search is faster! Say N=64, that means that on average I only have to do about 6 comparisons, each 1/10th the speed of a hash function- so doing 6 compares is still 40% faster than doing a single hash table lookup. Cache effects also come into play, changing the amortized cost of each operation, but I’ve seen this a lot, where for small enough sets, binary trees are faster than hash tables, simply because comparison is so much cheaper than doing a not-blindingly-fast hash function.
But it’s not enough for a hash function to be fast- it also has to be “good”. What does it mean for a hash function to be “good”? It means that for most common sets, there are very few collision- i.e. the hash table is not wanting to put lots of elements in the same bucket. I suppose that it’s possible for a hash table implementation to use a tree based structure for buckets, so in the worst case (all the strings end up in the same bucket) the hash table performance would degrade to that of a binary tree. But I’ve yet to encounter that in the wild- generally the assumption is “that’ll never happen” (famous last words, right up there with “they’ll never hit us at that range!”), so it’s not worthwhile to implement the extra complexity. Instead they use linked lists or arrays or similar ideas to handle collisions. But this means that if, for some set of elements, the hash function doesn’t distribute those elements across enough different hash buckets, performance will degrade to that horrid O(N) performance again. Collisions also ruin memory utilization- if I have 70 elements in 100 buckets, but those 70 elements hash to only 4 buckets (and yes, Virginia, I’ve seen collision rates this bad), then 96 of my buckets aren’t doing me any good at, I’m just wasting that space.
There are two times when this can happen: malice and accident. Malice is the obvious one- someone is out to get your program, and make it run really really slow, by carefully chosing their inputs so that your hash algorithm has large numbers of collisions. There are three things I can think of to protect against malicious attacks- 1) make collisions less expensive (use a tree-based bucket structure), 2) vary your hash funciton some way for different hash functions for each run (for example, adding a salt value) and hope the cracker can’t predict how your vary the hash functions or defeate the hash functions anyways, or 3) use a cryptographically secure hash function. Cryptographically secure hash algorithms are slow, in terms of computation time- and remember, constant factors are important.
A bigger danger is accidental collisions. I’ve never personally been bit by a malicous attack on one of my hash tables, but I have been bit by hash functions which were insufficiently random and data that was regular in a way that the hash function didn’t muddle enough, causing huge numbers of collisions.
I’ll give an example of this problem, using a current, popular hash function- FNV. The core operation of FNV is multiplication by a large prime. Multiplication is a good instruction for hash functions, it’s reasonbly cheap, and it allows a single bit of the input to affect a number of bits in the output. But the flaw of FNV is that a bit of the input of a multiply will only affect bits of equal or higher order of the output. There is no way for a higher order bit to “reach down” and affect lower order bits. This means that the low order bits of the hash value are completely determined by the low order bits of the input. For example, the strings “123456789″, “ABCDEFGHI”, “QRSTUVWXY”, “abcdefghi”, and “qrstuvwxy” all have the same low 4 bits- because all the characters all have the same low four bits. Adding a salt value to change the initial state of the hash value doesn’t help, the algorithm itself is broken.
Making the hash table size a prime helps some, but it doesn’t solve the problem, and it creates new problems. First of all, it basically requires you to use the modulo operation- a fairly expensive operation. Second, most allocation libraries want to allocate powers of two- so you have increased memory wastage. A small amount, if the implementation is intelligent enough to use primes slightly smaller than a power of two, a lot of memory wastage if they use primes slightly larger than a power of two, which then gets rounded up to the next larger power of two. So if an implementation were to use, for example, 65537 buckets (one more than 216), the memory allocator would likely round this up and allocate enough room for 131,072 buckets, wasting almost half (65,535 buckets worth) of the memory.
Modulo a prime number doesn’t solve the problem, it doesn’t save weak hash functions. The problem is that if x = y, then x mod p = y mod p, for all (positive integers) p. What this means is that you can get linear congruences- if you’re generating sequences, once (if) the sequences start colliding, they’ll keep colliding. This is likely to happen if you’re not hashing strings, but hashing ints- for example, object addresses or object identifiers, where if you see X, you’re likely to also see X+i. Which means when you get two values X and Y such that hash(X) = hash(Y), you’re likely to continue to collide, over and over. The modulo won’t save you. It does allow higher order bits to “wrap around” and affect lower order bits. So, for example, making your hash tables prime numbered in size helps FNV. But it’s not the case that prime number sized hash tables always save you. For example, if you use a variant of a linear congruential (pseudo-)random number generators as your hash function, they don’t help at all.
Choosing a hash function is seriously tricky. Simply because it’s popular doesn’t mean it’s good- FNV is used by Microsoft, and if you can’t trust Microsoft, who can you trust? (Yes, this is sarcasm). This is a point where having an understanding of the theory starts having a huge impact- cryptography, number theory, abstract algebra- all of these have a huge impact. For example, to avoid simple linear congruences, you probably want to pull operations from multiple different algebras, most notably integers modulo a power of two and polynomials in the Galois Field GF(2). Don’t know about Galois field polynomials? You should- they’re used in CRC check sums, lots of random number generators, and many crypto systems (including AES, the replacement for DES). A hash function doesn’t need to be cryptographically secure to need a lot of mathematics behind it.
This has gone on longer than normal even for one of my rants. Hopefully I’ve acheived my purpose and shown some of the “sharp edges” hash tables come with. And maybe, just maybe, you’ll be less inclined to beleive the hype about hash tables. And won’t be quite so surprised when “fast” hash tables are slow.
Popularity: 42% [?]







I hate to break it to you, but there is no such thing as an “amoritized” operation. Even the spell checker in this comment box barfs on that.
I’m pretty fond of the skiplist datastructure. Its fast in most or all of the ways that matter, and objects are stored in order, which can be very useful in many applications. Its also elegantly simple.
Not sure where you get the figure that “compares are 10 times faster than hash functions” - it depends on the hash function, some of which operate on bytes but others operate on words. Also, memory access times are usually the governing factor, rather than CPU cycles, complicating any such assertion.
A well-crafted Hashtable will create a new hash function (or seed) and rehash in the case of excessive collisions. See cuckoo-hashing and robin-hood hashing for examples of such hashtables.
At any rate, Hashtables are a probabilistic data structure - you are free to argue the corner cases, but those cases are guaranteed to occur with vanishingly low probabilities.
I’m looking at writing a hash table implementation so this is pretty useful info!
You have two good points here: worst-case performance is bad, and growing/shrinking across an arbitrary boundary can cause resize thrashing. Users of hash tables should be aware of these points.
However, most of your complaints concern bad implementations. This is silly, since there are high-quality open source implementations of hashes and hash functions available. Put together the hash table implementation from kazlib and the lookup3 hash function and you no longer have any implementation issues to deal with. Though you may want to adjust kazlib’s lowmark calculations if you want to trade of memory use vs. the possibility of resize thrashing.
Good post. The other problem with hashtables is that it is much tougher to build a non-blocking hashtable that doesn’t make everyone wait when it resizes.
(Cliff Click has a good discussion of it on his blog)
Great post - I couldn’t agree more.
Personally, I’m an avid fan of red-black trees, over say AVL trees. Bounded cost update behaviour for a very slight increase in average tree height.
T-Trees are nice too to increase pointer density (only important if you really have lots of entries), but are based on an AVL tree models in the literature, although, I have a back burner project to rebuild my T-Tree template using a red-black tree model instead. I’ve never got around to it since it’s harder to reason about the space/time cost trade off to justify the implementation effort.
Also on the subject of red-black trees, Sedgewick has just put up a paper on left-leaning red-blank trees (it’s linked on reddit or you can google for it) which reduces the code complexity even further, which is another nail in the hash-table coffin. In short, there are few excuses left not to use a balanced binary tree implementation.
Minor spelling nit - it’s amortized (or amoritised to us NZers), not amoritized.
Good article! Bit of an abrupt end though - can you give us some links about the Galois field, or FNV? Also, I’m left wondering why you prefer b+ trees; what are the advantages and how do they work?
Sheesh! - Trust me to post a spelling nit, and then make two typos myself (red-blank tress and amoritised). Boy do I feel stupid :-)
And just to clarify my point about bounded update cost between AVL trees and red-black trees:
It’s the actual tree balancing operation after the initial insert or delete that is O(log N) for AVL trees since it can require rebalancing rotations all the way to the root, but it is O(1) (i.e. effectively constant) for red-black trees since it requires at most 3 (from memory) balancing rotations.
Now, updates need an O(log N) find operation (usually) to determine the update position anyway so overall, they’re both effectively O(log N) but the constant is lower for red-black trees.
Very insightful post — thanks for sharing.
However, I feel like the problems you describe are not the case with things like the STL map (http://www.cprogramming.com/tutorial/stl/stlmap.html), or any hashed container in any (mature) language — I cannot possibly believe such a standardized data structure would suffer from the weaknesses you describe. Or am I being too naive ?
I guess hash tables do have a lot of gotchas, but not sure whether they are hit that often.
Perhaps it is a good idea to procastrinate this, making a interface to something that maps out strings (or whatever the keys are) to memory locations, is required to take care of itself. Just require a HashLikeElem *HashLike_makeelem(string, value),
HashHandle *Hashlike_getelem(string),
void HashLike_removeelem(HashHandle* elem),
Item HashHandle_item(HashHandle)
That way, one can just change how this behavior is implemented at any time. Using a handle gives more flexibility, for instance, not having to search something again when removing. Of course, the interface to the ‘HashLike’ might restrict something.
Of course, using interfaces of stuff is usually a good idea. (Unless expendience is required.)
Ordinary developers shouldn’t have to know all that about hash tables to use a library implementation. Just do it. Measure or profile your application, for heaven’s sake. If there are performance issues, you’ll see them. If you don’t see them, they aren’t there.
I was particularly amused by your million-element table example. A balanced tree table will require 20 comparisons on average to find, insert or delete an element in a table that size. A hash function has to be pretty bad to perform that poorly.
You mean “amortized”, but you’ve written “amoritized” throughout. Otherwise, nice article.
Good post, but a couple comments:
* Trees also have the benefit of coming with a cheap ordered traversal built in, which hash tables do not.
* A good hash table implementation can exhibit better locality of reference than a tree. Assuming no collisions, it’s possible to hit one cache line per reference rather than log n.
* With an accurate estimated initial size, you can eliminate hash resizing. Also, hash resizing is only an issue for inserts and possibly deletes. If you do most of your inserts ahead of time, which isn’t necessarily an uncommon scenario, it’s not an issue during most of your software’s execution.
* If you’re really concerned about the periodic O(n) resize hit, it’s possible to avoid it by incrementally resizing your hash on each insert.
* if you’re interested in high degrees of concurrency, hash tables seem to make it easier to implement finer grained locking than trees.(ie: if you need to rebalance at the root of your tree, you need to lock the entire tree.) It seems like you could do something with per-node locking, but that seems likely to be expensive.
Honestly, I see stuff like this and the best approach, as almost always, seems to be to abstract away the structures behind a nice API, and then profile/replace when you find performance issues.
Hi,
Excellent post. There really is no substitute for knowing what one is doing, in programming or anything else.
A point I would like to make is that purely from a common-sense point of view, trees & tries beat hash tables hands down for storing a lexicon - if most lookups are expected to succeed, hash tables have to compute the hash and then do a linear search by comparison. Trees only need log(N) comparison which, as you pointed out, is much faster.
Otoh, for cases where more lookups are expected to fail, hash tables should be better provided the hash function is sensible (I like Hsieh’s hash function).
I confirmed these in my app last week - a lexicon of ~100K short strings was searched much faster with an STL map (RB tree).
I think you mean “amortized” O(1) time. Only one “i”.
you forget to include in your analysis of the cost of cache misses. http://judy.sourceforge.net/
I hate to post off topic, but honestly, who cares if he made a spelling mistake?
Grrate poste!
[...] Problems with hash (tables) What relevance does number theory and abstract algebra have for such a basic data structure as hash tables? Quite a lot, it turns out. [...]
Some generalized responses:
As it seems to have bothered people, I’ve fixed the spelling on amortized, at least where I saw it. Editing on the web sucks (’cmon, people- search and replace? Over the whole text box, not just the portion on the screen? Is that so much to ask?).
Dealing with the cost of cache misses greatly increases the complexity of the cost analysis. One thing to remember: it’s not just the cost of the cache misses of the data, it’s also the cost of the cache misses of the code. As a general rule, more code = more complexity = more cache misses, but these are not linear relationships (chaotic, more like). One of the things I like about binary trees is that they’re simple. Which means both that they’re much less likely to be wrong, and the code tends to be smaller as well.
If I said B-trees, I really meant Binary trees- the classic root node, two subtrees, and some balancing information. Weight balanced, height balanced, red-black trees- they all have their pluses and minuses.
The end was rather abrupt- the problem was that I realized that I had just opened a can of worms that would require more verbage than I’d already written. Do I write another Galois-field introduction? Also, the glaring hole was what hash function do I use? Up until recently, I generally used a linear feedback shift register- which, while it has it’s own linear recurrances, these tend to be less common in the wild than with modulo-based hash functions. I have some ideas on a new hash function, but I want to make sure it’s a decent one before publishing it (to limit the amount of crow I’ll be required to consume).
Now to more specific responses:
Ian Clark: I like skip lists too, at least as a mutable data structure. But notice that they have O(log N) bounds as well- and binary trees keep elements in sorted order as well (and weight balanced trees allow you to get the i-th element in O(log N)).
Damien Morton: I said if the hash function is 10x slower than the compare function. Some hash functions (and some compare functions) are going to be more or less expensive. For example, cryptographically secure hash functions tend to be way more expensive. But my experience has been that 10x is about average for hash function costs vr.s compare function costs, and maybe even generous to the hash function. Consider that 1) the compare function doesn’t need to compare the whole structures, it can stop the first time it finds a difference, 2) most compare functions work on words as well, not just bytes, and 3) compare is a very cheap operation (basically a subtraction). Vr.s the multiplications and modulos and table lookups commonly found in hash functions.
Sriram Srinivasan: I was trying desperately to avoid mentioning purely applicative data structures, so I didn’t mention multithreading concerns, but you’re right: the thread behavior of tree based data structures is much nicer than that of hash table based data structures- for lots of reasons.
Leon Mergen: Given that a recent version of Visual Studio C++ used FNV, I wouldn’t gaurentee that simply because it’s a mature environment that hash tables are implemented correctly.
Jasper: I agree with this idea- except I’d go the other way. Start by using binary trees, and switch to hash tables when it’s been shown that binary trees are too slow. The reason for this is because binary trees have a much more predictable performance envelope. You want the failure mode to happen early and where you can see it. For example, operations on binary trees with a billion elements are only 3x slower than operations on binary trees with a thousand elements. It’s generally easy in testing to get situations where you have a thousand elements in the binary tree- if you observe that it’s 3x fast enough, you’re done- even if, from some insane congruence of events, the code gets a billion elements in the tree, it’s still fast enough. Hash tables are real fast- right up until they’re not. Also, the performance of binary trees is not heavily affected by the data in the binary tree- notice how many of my gotchas on hash tables are data-dependent.
Bob Foster: No, you won’t see the failure modes. Because the performance failures of hash tables are data or situation dependent. Which are very easy to miss in testing, and, in my experience, only rear their ugly heads out in the field. At 3AM Sunday morning. At the important customer’s site. In outer Mongolia. OK, I’m exagerating here, but not by much.
mschaef: I need to write a post about the multi-threaded potientials of hash tables vr.s trees. But by far the biggest advantage I see for trees is that I can implement them purely applicatively and have only one word of transactional memory associated with them. But that’s a radically different blog post.
Agreed with most of the points you made, but there do exist cases when I simply use the hash tables :) Moreover hash functions have no substitute for sure!
PS : If you have time, please look at my post on bloom filters
If you want to use hash tables in a real time setting you might be interested in this paper:
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/courses/826-resources/PAPERS+BOOK/linear-hashing.PDF
It introduced linear hashing which is a great way to avoid having to rehash the whole data in one step. (you still have to resize a dynamic array, but then again, with binary trees you would have to allocate many small chunks of memory instead which can either lead to fragmentation or to the same performance characteristics for arena based allocation…)
Also if you are concerned about collisions, cuckoo hashing (or a generalization thereof) might be worth looking at. It’s very simple to implement and greatly reduces the probability of collisions.
Er… are you serious? Let me summarize your article: hash tables are worse than binary trees, because I’ve seen bad implementations of hash tables and I have a good implementation of balanced binary trees. “I’ve met more than one hash table implementation that didn’t.” No, seriously - I might have met an implementation of balanced binary trees that prints “the quick brown fox jumps over the lazy dog” and exits… can I use that as an argument?
The only thing that made sense was the O(N) resize that will occur once in a while. Someone already pointed out that you can spread that cost if it’s really that big.
Joshua Haberman already pointed out these problems, I just got pissed off badly enough that I thought a restating was necessary. I can see now… wanna be programmer decides not to use hash tables because he read somewhere they have a lot of problems.
The article should have focused either on the theoretical underpinnings of both hash tables and binary trees (as Bob Foster comments, hash tables win hands down), or on actual implementations. (In which case I still doubt binary trees would have won… balancing is *hard*.)
Finally… I *have* to comment on what Amit says. “There really is no substitute for knowing what one is doing, in programming or anything else.” Yeah. Well said. Too bad it’s shortly followed by “…hash tables have to compute the hash and then do a linear search by comparison”, huh?
Balancing trees is hard? Maybe with red-black trees, but maybe not even with them (left-leaning red-black trees are an interesting idea, and greatly simplify red-black tree balancing). But weight balanced and height balanced trees are hard to balance? Seriously?
In my 14 years a professional programmer, and 30-some years programming, I’ve yet to meet a hash table implementation that avoided all of the problems I outlined (including, I comment, the multiple implementations I’ve written). And, in that time, outside of some homework problems, I’ve yet to see a balanced tree implementation that didn’t get balancing right. Primarily because if you don’t get balancing right, this becomes obvious real quick- while if your hash function has a linear congruence causing a high rate collisions on certain common data patterns, you can easily miss that.
As for explaining the theoretical underpinnings of the two data structures, go read Knuth.
I find that usually when I want to use a hash table, it’s a case where the problem size is known in advance and fixed (or at least the maximum number of elements can be known). For example, given an array of known size, containing small data elements (say 16 bytes or less), I want to find groups of duplicates, and speed is of the utmost importance. Hashing happens to be the fastest solution to that problem, and implementing such a hash table (where you never have to worry about resizing) is quite simple (simpler than even the simplest possible binary tree).
In some cases, your input is already a hash value. For example, we use MD5 hashes to uniquely identify the contents of data records (some of which may be stored on disk, not in memory). We use this both for eliminating redundant records, but also as the unique identifier for storing references to specific records. Given that we already have a nice high-quality hash value computed for other reasons, placing information about the records in a hash table is a no-brainer.
Your argument against the FNV hash is a straw man. The specific recommendation from the authors of that hash for using a table with a smaller number of bits is to do XOR folding (shift the higher order bits down and XOR them with the low order bits). That invalidates your claim that FNV won’t affect the low order bits (at least if used as recommended). Now maybe the Microsoft hash implementation doesn’t do that, but you can’t say the authors of FNV didn’t warn them!
I agree that for general purpose use, a binary tree is better (and 99.5% of the time, I just use std::set or std::map for my purposes), but in some cases, hash tables are an excellent solution.
Criticizing hash table wasting space in favor of balanced binary tree is very strange. Consider a naive linear probe hash table of 32-bit integers with 50% load (fancy cuckoo hash table can work well up to 90% load). The empty bucket is 4 bytes, while the overhead of a tree node is way bigger. A half empty hash table with 1 million integers occupies 8 MB (wasting 4MB). A typical STL red-black tree node uses 32 bytes (color + parent + left + right) on 64-bit system *besides* data. So the same amount of data would take 20MB on a 32-bit system and 40 MB on 64-bit system. A compact red-black tree (that uses least significant bit of one of the pointers to encode color, assuming pointer values are aligned) would use at least 12MB on a 32-bit system and 24MB on a 64-bit system.
There are plenty of fast non-linear hash functions you can use. The FNV example you mentioned is lame (doesn’t match actual recommended implementation with xor ops). Both Hsieh’s and Jenkins hashes are strongly resistant to collisions and can take a random initial seed, which makes collision attacks practically impossible.
Regarding speed. Properly implemented hash tables are typically much faster. here are some numbers: http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html
Regarding ease of implementation of left leaning red-black tree. Jason Evans doesn’t seem agree with you after actually implemented one:
http://www.canonware.com/~ttt/2008/04/left-leaning-red-black-trees-are-hard.html
Without seeing any of your code, I’d believe Jason Evans more, given his contribution to FreeBSD libc, specifically a high performance malloc implementation.
My recommendation is simple: if you want a dynamic ordered set/map, use a good red-black tree implementation. If you want an unordered set/map, use a linear probe hash table (for example, Google’s dense_hash_set/map for speed and sparse_hash_set/map for space advantages) with Hsieh and Jenkin’s hash with a random seed.
Your arguments would be a lot more interesting if you posted working code and benchmark numbers.
steven mentioned linear hashing, and it bears repeating. Linear hashing means you only have to rehash one element on each insert/delete, so you get *bounded* constant (not just expected constant) time for each operation. If you use a segmented array with geometrically sized segments, then growing or shrinking the backing store requires no copying between the old and new stores, so it too has bounded constant time. The first array slot points to a segment with 16 slots, the next slot points to a segment of 32 slots, the next segment is 64, etc. Then growing the table by a factor of 2 means allocating a new segment and dropping a pointer in the backing store. No realloc()/memcpy() is needed, so the grow operation is O(1) (assuming malloc() is pretty fast), which is obviously way better than the O(N) you claimed. Only the “fresh faced” programmers you mocked in your opening have O(N) worst case hash tables. This scheme is no harder than balanced binary trees.
Nice post Brian!
Just one comment: setting an upper threshhold of 3/4 capacity (which is typical) and a lower threshold of 3/8 capacity would seem to give you the thrashing you are complaining about.
For example, if my table capacity is 64 and the current number of entries reaches 49 (now greater than 3/4 * 64), my table will double to a capacity of 128. Then if I remove two entries the current number will drop to 47 (now less than 3/8 * 128) and the table will resize back to 64.
Using 1/4 for the lower threshold would seem to give you the hysteresis you need to avoid the thrashing. Does this make sense?