<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Problems with Hash Tables</title>
	<atom:link href="http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/feed/" rel="self" type="application/rss+xml" />
	<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/</link>
	<description>Robert Fischer and Brian Hurt on Punditry, Programming Languages, and Other Religious Issues</description>
	<pubDate>Thu, 28 Aug 2008 19:27:55 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.1</generator>
		<item>
		<title>By: Kenny</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-33153</link>
		<dc:creator>Kenny</dc:creator>
		<pubDate>Wed, 07 May 2008 04:49:57 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-33153</guid>
		<description>steven mentioned linear hashing, and it bears repeating.  Linear hashing means you only have to rehash one element on each insert/delete, so you get *bounded* constant (not just expected constant) time for each operation.  If you use a segmented array with geometrically sized segments, then growing or shrinking the backing store requires no copying between the old and new stores, so it too has bounded constant time.  The first array slot points to a segment with 16 slots, the next slot points to a segment of 32 slots, the next segment is 64, etc.  Then growing the table by a factor of 2 means allocating a new segment and dropping a pointer in the backing store.  No realloc()/memcpy() is needed, so the grow operation is O(1) (assuming malloc() is pretty fast), which is obviously way better than the O(N) you claimed.  Only the "fresh faced" programmers you mocked in your opening have O(N) worst case hash tables.  This scheme is no harder than balanced binary trees.</description>
		<content:encoded><![CDATA[<p>steven mentioned linear hashing, and it bears repeating.  Linear hashing means you only have to rehash one element on each insert/delete, so you get *bounded* constant (not just expected constant) time for each operation.  If you use a segmented array with geometrically sized segments, then growing or shrinking the backing store requires no copying between the old and new stores, so it too has bounded constant time.  The first array slot points to a segment with 16 slots, the next slot points to a segment of 32 slots, the next segment is 64, etc.  Then growing the table by a factor of 2 means allocating a new segment and dropping a pointer in the backing store.  No realloc()/memcpy() is needed, so the grow operation is O(1) (assuming malloc() is pretty fast), which is obviously way better than the O(N) you claimed.  Only the &#8220;fresh faced&#8221; programmers you mocked in your opening have O(N) worst case hash tables.  This scheme is no harder than balanced binary trees.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: vicaya</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-33116</link>
		<dc:creator>vicaya</dc:creator>
		<pubDate>Fri, 02 May 2008 07:02:08 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-33116</guid>
		<description>Criticizing hash table wasting space in favor of balanced binary tree is very strange. Consider a naive linear probe hash table of 32-bit integers with 50% load (fancy cuckoo hash table can work well up to 90% load). The empty bucket is 4 bytes, while the overhead of a tree node is way bigger. A half empty hash table with 1 million integers occupies 8 MB (wasting 4MB).  A typical STL red-black tree node uses 32 bytes (color + parent + left + right) on 64-bit system *besides* data. So the same amount of data would take 20MB on a 32-bit system and 40 MB on 64-bit system. A compact red-black tree (that uses least significant bit of one of the pointers to encode color, assuming pointer values are aligned) would use at least 12MB on a 32-bit system and 24MB on a 64-bit system. 

There are plenty of fast non-linear hash functions you can use. The FNV example you mentioned is lame (doesn't match actual recommended implementation with xor ops). Both Hsieh's and Jenkins hashes are strongly resistant to collisions and can take a random initial seed, which makes collision attacks practically impossible.

Regarding speed. Properly implemented hash tables are typically much faster. here are some numbers: http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html

Regarding ease of implementation of left leaning red-black tree. Jason Evans doesn't seem agree with you after actually implemented one:
http://www.canonware.com/~ttt/2008/04/left-leaning-red-black-trees-are-hard.html

Without seeing any of your code, I'd believe Jason Evans more, given his contribution to FreeBSD libc, specifically a high performance malloc implementation.

My recommendation is simple: if you want a dynamic ordered set/map, use a good red-black tree implementation. If you want an unordered set/map, use a linear probe hash table (for example, Google's dense_hash_set/map for speed and sparse_hash_set/map for space advantages) with Hsieh and Jenkin's hash with a random seed.

Your arguments would be a lot more interesting if you posted working code and benchmark numbers.</description>
		<content:encoded><![CDATA[<p>Criticizing hash table wasting space in favor of balanced binary tree is very strange. Consider a naive linear probe hash table of 32-bit integers with 50% load (fancy cuckoo hash table can work well up to 90% load). The empty bucket is 4 bytes, while the overhead of a tree node is way bigger. A half empty hash table with 1 million integers occupies 8 MB (wasting 4MB).  A typical STL red-black tree node uses 32 bytes (color + parent + left + right) on 64-bit system *besides* data. So the same amount of data would take 20MB on a 32-bit system and 40 MB on 64-bit system. A compact red-black tree (that uses least significant bit of one of the pointers to encode color, assuming pointer values are aligned) would use at least 12MB on a 32-bit system and 24MB on a 64-bit system. </p>
<p>There are plenty of fast non-linear hash functions you can use. The FNV example you mentioned is lame (doesn&#8217;t match actual recommended implementation with xor ops). Both Hsieh&#8217;s and Jenkins hashes are strongly resistant to collisions and can take a random initial seed, which makes collision attacks practically impossible.</p>
<p>Regarding speed. Properly implemented hash tables are typically much faster. here are some numbers: <a href="http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html" rel="nofollow">http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html</a></p>
<p>Regarding ease of implementation of left leaning red-black tree. Jason Evans doesn&#8217;t seem agree with you after actually implemented one:<br />
<a href="http://www.canonware.com/~ttt/2008/04/left-leaning-red-black-trees-are-hard.html" rel="nofollow">http://www.canonware.com/~ttt/2008/04/left-leaning-red-black-trees-are-hard.html</a></p>
<p>Without seeing any of your code, I&#8217;d believe Jason Evans more, given his contribution to FreeBSD libc, specifically a high performance malloc implementation.</p>
<p>My recommendation is simple: if you want a dynamic ordered set/map, use a good red-black tree implementation. If you want an unordered set/map, use a linear probe hash table (for example, Google&#8217;s dense_hash_set/map for speed and sparse_hash_set/map for space advantages) with Hsieh and Jenkin&#8217;s hash with a random seed.</p>
<p>Your arguments would be a lot more interesting if you posted working code and benchmark numbers.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jason Hoerner</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-33099</link>
		<dc:creator>Jason Hoerner</dc:creator>
		<pubDate>Wed, 30 Apr 2008 21:09:24 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-33099</guid>
		<description>I find that usually when I want to use a hash table, it's a case where the problem size is known in advance and fixed (or at least the maximum number of elements can be known).  For example, given an array of known size, containing small data elements (say 16 bytes or less), I want to find groups of duplicates, and speed is of the utmost importance.  Hashing happens to be the fastest solution to that problem, and implementing such a hash table (where you never have to worry about resizing) is quite simple (simpler than even the simplest possible binary tree).

In some cases, your input is already a hash value.  For example, we use MD5 hashes to uniquely identify the contents of data records (some of which may be stored on disk, not in memory).  We use this both for eliminating redundant records, but also as the unique identifier for storing references to specific records.  Given that we already have a nice high-quality hash value computed for other reasons, placing information about the records in a hash table is a no-brainer.

Your argument against the FNV hash is a straw man.  The specific recommendation from the authors of that hash for using a table with a smaller number of bits is to do XOR folding (shift the higher order bits down and XOR them with the low order bits).  That invalidates your claim that FNV won't affect the low order bits (at least if used as recommended).  Now maybe the Microsoft hash implementation doesn't do that, but you can't say the authors of FNV didn't warn them!

I agree that for general purpose use, a binary tree is better (and 99.5% of the time, I just use std::set or std::map for my purposes), but in some cases, hash tables are an excellent solution.</description>
		<content:encoded><![CDATA[<p>I find that usually when I want to use a hash table, it&#8217;s a case where the problem size is known in advance and fixed (or at least the maximum number of elements can be known).  For example, given an array of known size, containing small data elements (say 16 bytes or less), I want to find groups of duplicates, and speed is of the utmost importance.  Hashing happens to be the fastest solution to that problem, and implementing such a hash table (where you never have to worry about resizing) is quite simple (simpler than even the simplest possible binary tree).</p>
<p>In some cases, your input is already a hash value.  For example, we use MD5 hashes to uniquely identify the contents of data records (some of which may be stored on disk, not in memory).  We use this both for eliminating redundant records, but also as the unique identifier for storing references to specific records.  Given that we already have a nice high-quality hash value computed for other reasons, placing information about the records in a hash table is a no-brainer.</p>
<p>Your argument against the FNV hash is a straw man.  The specific recommendation from the authors of that hash for using a table with a smaller number of bits is to do XOR folding (shift the higher order bits down and XOR them with the low order bits).  That invalidates your claim that FNV won&#8217;t affect the low order bits (at least if used as recommended).  Now maybe the Microsoft hash implementation doesn&#8217;t do that, but you can&#8217;t say the authors of FNV didn&#8217;t warn them!</p>
<p>I agree that for general purpose use, a binary tree is better (and 99.5% of the time, I just use std::set or std::map for my purposes), but in some cases, hash tables are an excellent solution.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian Hurt</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32994</link>
		<dc:creator>Brian Hurt</dc:creator>
		<pubDate>Sun, 13 Apr 2008 21:59:57 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32994</guid>
		<description>Balancing trees is &lt;EM&gt;hard&lt;/EM&gt;?  Maybe with red-black trees, but maybe not even with them (left-leaning red-black trees are an interesting idea, and greatly simplify red-black tree balancing).  But weight balanced and height balanced trees are hard to balance?  Seriously?

In my 14 years a professional programmer, and 30-some years programming, I've yet to meet a hash table implementation that avoided all of the problems I outlined (including, I comment, the multiple implementations I've written).  And, in that time, outside of some homework problems, I've yet to see a balanced tree implementation that didn't get balancing right.  Primarily because if you don't get balancing right, this becomes obvious real quick- while if your hash function has a linear congruence causing a high rate collisions on certain common data patterns, you can easily miss that.  

As for explaining the theoretical underpinnings of the two data structures, go read Knuth.</description>
		<content:encoded><![CDATA[<p>Balancing trees is <em>hard</em>?  Maybe with red-black trees, but maybe not even with them (left-leaning red-black trees are an interesting idea, and greatly simplify red-black tree balancing).  But weight balanced and height balanced trees are hard to balance?  Seriously?</p>
<p>In my 14 years a professional programmer, and 30-some years programming, I&#8217;ve yet to meet a hash table implementation that avoided all of the problems I outlined (including, I comment, the multiple implementations I&#8217;ve written).  And, in that time, outside of some homework problems, I&#8217;ve yet to see a balanced tree implementation that didn&#8217;t get balancing right.  Primarily because if you don&#8217;t get balancing right, this becomes obvious real quick- while if your hash function has a linear congruence causing a high rate collisions on certain common data patterns, you can easily miss that.  </p>
<p>As for explaining the theoretical underpinnings of the two data structures, go read Knuth.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Marcel Popescu</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32981</link>
		<dc:creator>Marcel Popescu</dc:creator>
		<pubDate>Thu, 10 Apr 2008 19:27:26 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32981</guid>
		<description>Er... are you serious? Let me summarize your article: hash tables are worse than binary trees, because I've seen bad implementations of hash tables and I have a good implementation of balanced binary trees. "I’ve met more than one hash table implementation that didn’t." No, seriously - I might have met an implementation of balanced binary trees that prints "the quick brown fox jumps over the lazy dog" and exits... can I use that as an argument?

The only thing that made sense was the O(N) resize that will occur once in a while. Someone already pointed out that you can spread that cost if it's really that big.

Joshua Haberman already pointed out these problems, I just got pissed off badly enough that I thought a restating was necessary. I can see now... wanna be programmer decides not to use hash tables because he read somewhere they have a lot of problems.

The article should have focused either on the theoretical underpinnings of both hash tables and binary trees (as Bob Foster comments, hash tables win hands down), or on actual implementations. (In which case I still doubt binary trees would have won... balancing is *hard*.)

Finally... I *have* to comment on what Amit says. "There really is no substitute for knowing what one is doing, in programming or anything else." Yeah. Well said. Too bad it's shortly followed by "...hash tables have to compute the hash and then do a linear search by comparison", huh?</description>
		<content:encoded><![CDATA[<p>Er&#8230; are you serious? Let me summarize your article: hash tables are worse than binary trees, because I&#8217;ve seen bad implementations of hash tables and I have a good implementation of balanced binary trees. &#8220;I’ve met more than one hash table implementation that didn’t.&#8221; No, seriously - I might have met an implementation of balanced binary trees that prints &#8220;the quick brown fox jumps over the lazy dog&#8221; and exits&#8230; can I use that as an argument?</p>
<p>The only thing that made sense was the O(N) resize that will occur once in a while. Someone already pointed out that you can spread that cost if it&#8217;s really that big.</p>
<p>Joshua Haberman already pointed out these problems, I just got pissed off badly enough that I thought a restating was necessary. I can see now&#8230; wanna be programmer decides not to use hash tables because he read somewhere they have a lot of problems.</p>
<p>The article should have focused either on the theoretical underpinnings of both hash tables and binary trees (as Bob Foster comments, hash tables win hands down), or on actual implementations. (In which case I still doubt binary trees would have won&#8230; balancing is *hard*.)</p>
<p>Finally&#8230; I *have* to comment on what Amit says. &#8220;There really is no substitute for knowing what one is doing, in programming or anything else.&#8221; Yeah. Well said. Too bad it&#8217;s shortly followed by &#8220;&#8230;hash tables have to compute the hash and then do a linear search by comparison&#8221;, huh?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: steven</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32944</link>
		<dc:creator>steven</dc:creator>
		<pubDate>Tue, 01 Apr 2008 15:00:02 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32944</guid>
		<description>If you want to use hash tables in a real time setting you might be interested in this paper:
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/courses/826-resources/PAPERS+BOOK/linear-hashing.PDF
It introduced linear hashing which is a great way to avoid having to rehash the whole data in one step. (you still have to resize a dynamic array, but then again, with binary trees you would have to allocate many small chunks of memory instead which can either lead to fragmentation or to the same performance characteristics for arena based allocation...)

Also if you are concerned about collisions, cuckoo hashing (or a generalization thereof) might be worth looking at. It's very simple to implement and greatly reduces the probability of collisions.</description>
		<content:encoded><![CDATA[<p>If you want to use hash tables in a real time setting you might be interested in this paper:<br />
<a href="http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/courses/826-resources/PAPERS+BOOK/linear-hashing.PDF" rel="nofollow">http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/courses/826-resources/PAPERS+BOOK/linear-hashing.PDF</a><br />
It introduced linear hashing which is a great way to avoid having to rehash the whole data in one step. (you still have to resize a dynamic array, but then again, with binary trees you would have to allocate many small chunks of memory instead which can either lead to fragmentation or to the same performance characteristics for arena based allocation&#8230;)</p>
<p>Also if you are concerned about collisions, cuckoo hashing (or a generalization thereof) might be worth looking at. It&#8217;s very simple to implement and greatly reduces the probability of collisions.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sameer Agarwal</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32730</link>
		<dc:creator>Sameer Agarwal</dc:creator>
		<pubDate>Sun, 02 Mar 2008 06:13:46 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32730</guid>
		<description>Agreed with most of the points you made, but there do exist cases when I simply use the hash tables :) Moreover hash functions have no substitute for sure!
PS : If you have time, please look at my post on bloom filters</description>
		<content:encoded><![CDATA[<p>Agreed with most of the points you made, but there do exist cases when I simply use the hash tables :) Moreover hash functions have no substitute for sure!<br />
PS : If you have time, please look at my post on bloom filters</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32692</link>
		<dc:creator>Brian</dc:creator>
		<pubDate>Wed, 27 Feb 2008 23:33:05 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32692</guid>
		<description>Some generalized responses:

As it seems to have bothered people, I've fixed the spelling on amortized, at least where I saw it.  Editing on the web sucks ('cmon, people- search and replace?  Over the whole text box, not just the portion on the screen?  Is that so much to ask?).

Dealing with the cost of cache misses greatly increases the complexity of the cost analysis.  One thing to remember: it's not just the cost of the cache misses of the data, it's also the cost of the cache misses of the code.  As a general rule, more code = more complexity = more cache misses, but these are not linear relationships (chaotic, more like).    One of the things I like about binary trees is that they're &lt;EM&gt;simple&lt;/EM&gt;.  Which means both that they're much less likely to be wrong, and the code tends to be smaller as well.

If I said B-trees, I really meant Binary trees- the classic root node, two subtrees, and some balancing information.  Weight balanced, height balanced, red-black trees- they all have their pluses and minuses.

The end was rather abrupt- the problem was that I realized that I had just opened a can of worms that would require more verbage than I'd already written.  Do I write another Galois-field introduction?  Also, the glaring hole was what hash function do I use?  Up until recently, I generally used a linear feedback shift register- which, while it has it's own linear recurrances, these tend to be less common in the wild than with modulo-based hash functions.  I have some ideas on a new hash function, but I want to make sure it's a decent one before publishing it (to limit the amount of crow I'll be required to consume).

Now to more specific responses:
Ian Clark: I like skip lists too, at least as a mutable data structure.  But notice that they have O(log N) bounds as well- and binary trees keep elements in sorted order as well (and weight balanced trees allow you to get the i-th element in O(log N)).

Damien Morton: I said &lt;EM&gt;if&lt;/EM&gt; the hash function is 10x slower than the compare function.  Some hash functions (and some compare functions) are going to be more or less expensive.  For example, cryptographically secure hash functions tend to be way more expensive.  But my experience has been that 10x is about average for hash function costs vr.s compare function costs, and maybe even generous to the hash function.  Consider that 1) the compare function doesn't need to compare the whole structures, it can stop the first time it finds a difference, 2) most compare functions work on words as well, not just bytes, and 3) compare is a very cheap operation (basically a subtraction).  Vr.s the multiplications and modulos and table lookups commonly found in hash functions.

Sriram Srinivasan: I was trying desperately to avoid mentioning purely applicative data structures, so I didn't mention multithreading concerns, but you're right: the thread behavior of tree based data structures is much nicer than that of hash table based data structures- for lots of reasons.

Leon Mergen: Given that a recent version of Visual Studio C++ used FNV, I wouldn't gaurentee that simply because it's a mature environment that hash tables are implemented correctly.  

Jasper: I agree with this idea- except I'd go the other way.  Start by using binary trees, and switch to hash tables when it's been shown that binary trees are too slow.  The reason for this is because binary trees have a much more predictable performance envelope.  You want the failure mode to happen early and where you can see it.  For example, operations on binary trees with a billion elements are only 3x slower than operations on binary trees with a thousand elements.    It's generally easy in testing to get situations where you have a thousand elements in the binary tree- if you observe that it's 3x fast enough, you're done- even if, from some insane congruence of events, the code gets a billion elements in the tree, it's still fast enough.  Hash tables are real fast- right up until they're not.  Also, the performance of binary trees is not heavily affected by the data in the binary tree- notice how many of my gotchas on hash tables are data-dependent.

Bob Foster: No, you won't see the failure modes.  Because the performance failures of hash tables are data or situation dependent.  Which are very easy to miss in testing, and, in my experience, only rear their ugly heads out in the field.  At 3AM Sunday morning.  At the important customer's site.  In outer Mongolia.  OK, I'm exagerating here, but not by much.

mschaef: I need to write a post about the multi-threaded potientials of hash tables vr.s trees.  But by far the biggest advantage I see for trees is that I can implement them purely applicatively and have only one word of transactional memory associated with them.  But that's a radically different blog post.</description>
		<content:encoded><![CDATA[<p>Some generalized responses:</p>
<p>As it seems to have bothered people, I&#8217;ve fixed the spelling on amortized, at least where I saw it.  Editing on the web sucks (&#8217;cmon, people- search and replace?  Over the whole text box, not just the portion on the screen?  Is that so much to ask?).</p>
<p>Dealing with the cost of cache misses greatly increases the complexity of the cost analysis.  One thing to remember: it&#8217;s not just the cost of the cache misses of the data, it&#8217;s also the cost of the cache misses of the code.  As a general rule, more code = more complexity = more cache misses, but these are not linear relationships (chaotic, more like).    One of the things I like about binary trees is that they&#8217;re <em>simple</em>.  Which means both that they&#8217;re much less likely to be wrong, and the code tends to be smaller as well.</p>
<p>If I said B-trees, I really meant Binary trees- the classic root node, two subtrees, and some balancing information.  Weight balanced, height balanced, red-black trees- they all have their pluses and minuses.</p>
<p>The end was rather abrupt- the problem was that I realized that I had just opened a can of worms that would require more verbage than I&#8217;d already written.  Do I write another Galois-field introduction?  Also, the glaring hole was what hash function do I use?  Up until recently, I generally used a linear feedback shift register- which, while it has it&#8217;s own linear recurrances, these tend to be less common in the wild than with modulo-based hash functions.  I have some ideas on a new hash function, but I want to make sure it&#8217;s a decent one before publishing it (to limit the amount of crow I&#8217;ll be required to consume).</p>
<p>Now to more specific responses:<br />
Ian Clark: I like skip lists too, at least as a mutable data structure.  But notice that they have O(log N) bounds as well- and binary trees keep elements in sorted order as well (and weight balanced trees allow you to get the i-th element in O(log N)).</p>
<p>Damien Morton: I said <em>if</em> the hash function is 10x slower than the compare function.  Some hash functions (and some compare functions) are going to be more or less expensive.  For example, cryptographically secure hash functions tend to be way more expensive.  But my experience has been that 10x is about average for hash function costs vr.s compare function costs, and maybe even generous to the hash function.  Consider that 1) the compare function doesn&#8217;t need to compare the whole structures, it can stop the first time it finds a difference, 2) most compare functions work on words as well, not just bytes, and 3) compare is a very cheap operation (basically a subtraction).  Vr.s the multiplications and modulos and table lookups commonly found in hash functions.</p>
<p>Sriram Srinivasan: I was trying desperately to avoid mentioning purely applicative data structures, so I didn&#8217;t mention multithreading concerns, but you&#8217;re right: the thread behavior of tree based data structures is much nicer than that of hash table based data structures- for lots of reasons.</p>
<p>Leon Mergen: Given that a recent version of Visual Studio C++ used FNV, I wouldn&#8217;t gaurentee that simply because it&#8217;s a mature environment that hash tables are implemented correctly.  </p>
<p>Jasper: I agree with this idea- except I&#8217;d go the other way.  Start by using binary trees, and switch to hash tables when it&#8217;s been shown that binary trees are too slow.  The reason for this is because binary trees have a much more predictable performance envelope.  You want the failure mode to happen early and where you can see it.  For example, operations on binary trees with a billion elements are only 3x slower than operations on binary trees with a thousand elements.    It&#8217;s generally easy in testing to get situations where you have a thousand elements in the binary tree- if you observe that it&#8217;s 3x fast enough, you&#8217;re done- even if, from some insane congruence of events, the code gets a billion elements in the tree, it&#8217;s still fast enough.  Hash tables are real fast- right up until they&#8217;re not.  Also, the performance of binary trees is not heavily affected by the data in the binary tree- notice how many of my gotchas on hash tables are data-dependent.</p>
<p>Bob Foster: No, you won&#8217;t see the failure modes.  Because the performance failures of hash tables are data or situation dependent.  Which are very easy to miss in testing, and, in my experience, only rear their ugly heads out in the field.  At 3AM Sunday morning.  At the important customer&#8217;s site.  In outer Mongolia.  OK, I&#8217;m exagerating here, but not by much.</p>
<p>mschaef: I need to write a post about the multi-threaded potientials of hash tables vr.s trees.  But by far the biggest advantage I see for trees is that I can implement them purely applicatively and have only one word of transactional memory associated with them.  But that&#8217;s a radically different blog post.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sp3w &#187; Blog Archive &#187; Linkage 2007.02.27</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32685</link>
		<dc:creator>Sp3w &#187; Blog Archive &#187; Linkage 2007.02.27</dc:creator>
		<pubDate>Wed, 27 Feb 2008 16:35:03 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32685</guid>
		<description>[...] Problems with hash (tables) What relevance does number theory and abstract algebra have for such a basic data structure as hash tables? Quite a lot, it turns out. [...]</description>
		<content:encoded><![CDATA[<p>[...] Problems with hash (tables) What relevance does number theory and abstract algebra have for such a basic data structure as hash tables? Quite a lot, it turns out. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael</title>
		<link>http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32678</link>
		<dc:creator>Michael</dc:creator>
		<pubDate>Tue, 26 Feb 2008 19:07:20 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-32678</guid>
		<description>I hate to post off topic, but honestly, who cares if he made a spelling mistake?

Grrate poste!</description>
		<content:encoded><![CDATA[<p>I hate to post off topic, but honestly, who cares if he made a spelling mistake?</p>
<p>Grrate poste!</p>
]]></content:encoded>
	</item>
</channel>
</rss>
