<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
	>
<channel>
	<title>Comments on: Problems with Hash Tables</title>
	<atom:link href="http://enfranchisedmind.com/blog/posts/problems-with-hash-tables/feed/" rel="self" type="application/rss+xml" />
	<link>http://enfranchisedmind.com/blog/posts/problems-with-hash-tables/</link>
	<description>programming, politics, &#38; other religious issues</description>
	<lastBuildDate>Mon, 15 Mar 2010 00:31:40 +0000</lastBuildDate>
	
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Brian</title>
		<link>http://enfranchisedmind.com/blog/posts/problems-with-hash-tables/#comment-37154</link>
		<dc:creator>Brian</dc:creator>
		<pubDate>Mon, 11 Jan 2010 18:34:46 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-37154</guid>
		<description>Anon: the problem with including caching behavior is that it depends incredibly upon subtle details of implementations- like the effects of the GC (or lack thereof), how things are laid out in memory, what sort of cache organization you have, what else is going on, are you measuring warm cache or cold cache (i.e. is data already loaded into the cache or not), and so on.  Reasoning about cache behavior quickly leads you into a thicket of complicated details, changing any one of which changes the conclusions.  For example, I remember a Linux patch several years ago which measurably and significantly increased the performance of Linux simply by rearranging the members of the process structure.

But if you&#039;re looking at cache effects, I think you&#039;re missing the main point.  On average, in general, hash tables are going to be faster than trees (although the disparity generally isn&#039;t as large as many people imagine).  The point I was trying to make was over the cost of &quot;black swan events&quot;- those rare, one in a million events.  

Consider the following scenario: you have a map that generally has, say, 32 elements in it.  If you use a hash table, some set of operations takes, say, 10 milliseconds, but if you use a tree, the same set of operations takes, say, 100 milliseconds.  And with such a small number of elements, even if the hash table occasionally takes O(N) time, this likely isn&#039;t  a noticeable cost- O(32) isn&#039;t that big of a deal.  So you might consider the hash table implementation is significantly faster.  

The problem is that when you roll this code out into production, for some weird, unforeseen reason, what is normally 32 elements balloons out to a million elements.  Those operations on the tree now take 4x as long.  Maybe even 8x as long depending upon the vagaries of cache, comparison costs, etc.  So what used to take 100ms now takes 400ms, maybe 800ms.  Annoying, but not a serious problem.  The hash table based implementation, if you&#039;re lucky, doesn&#039;t take any longer- the joys of O(1) cost.  But if you hit a O(N) corner case, now the hash table is taking 30,000 times as long- and what was 10ms is now 5 minutes.</description>
		<content:encoded><![CDATA[<p>Anon: the problem with including caching behavior is that it depends incredibly upon subtle details of implementations- like the effects of the GC (or lack thereof), how things are laid out in memory, what sort of cache organization you have, what else is going on, are you measuring warm cache or cold cache (i.e. is data already loaded into the cache or not), and so on.  Reasoning about cache behavior quickly leads you into a thicket of complicated details, changing any one of which changes the conclusions.  For example, I remember a Linux patch several years ago which measurably and significantly increased the performance of Linux simply by rearranging the members of the process structure.</p>
<p>But if you&#8217;re looking at cache effects, I think you&#8217;re missing the main point.  On average, in general, hash tables are going to be faster than trees (although the disparity generally isn&#8217;t as large as many people imagine).  The point I was trying to make was over the cost of &#8220;black swan events&#8221;- those rare, one in a million events.  </p>
<p>Consider the following scenario: you have a map that generally has, say, 32 elements in it.  If you use a hash table, some set of operations takes, say, 10 milliseconds, but if you use a tree, the same set of operations takes, say, 100 milliseconds.  And with such a small number of elements, even if the hash table occasionally takes O(N) time, this likely isn&#8217;t  a noticeable cost- O(32) isn&#8217;t that big of a deal.  So you might consider the hash table implementation is significantly faster.  </p>
<p>The problem is that when you roll this code out into production, for some weird, unforeseen reason, what is normally 32 elements balloons out to a million elements.  Those operations on the tree now take 4x as long.  Maybe even 8x as long depending upon the vagaries of cache, comparison costs, etc.  So what used to take 100ms now takes 400ms, maybe 800ms.  Annoying, but not a serious problem.  The hash table based implementation, if you&#8217;re lucky, doesn&#8217;t take any longer- the joys of O(1) cost.  But if you hit a O(N) corner case, now the hash table is taking 30,000 times as long- and what was 10ms is now 5 minutes.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jrh</title>
		<link>http://enfranchisedmind.com/blog/posts/problems-with-hash-tables/#comment-37148</link>
		<dc:creator>jrh</dc:creator>
		<pubDate>Sat, 09 Jan 2010 02:41:49 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-37148</guid>
		<description>You are absolutely right! In the real world, I almost always end up with red-black trees over hash maps. There are a few cases where you can tune hash tables because you know exactly what the data will be. But normally, red-black trees are superior in real applications.</description>
		<content:encoded><![CDATA[<p>You are absolutely right! In the real world, I almost always end up with red-black trees over hash maps. There are a few cases where you can tune hash tables because you know exactly what the data will be. But normally, red-black trees are superior in real applications.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anon</title>
		<link>http://enfranchisedmind.com/blog/posts/problems-with-hash-tables/#comment-37147</link>
		<dc:creator>Anon</dc:creator>
		<pubDate>Fri, 08 Jan 2010 23:44:45 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-37147</guid>
		<description>Very informing for an average programmer like myself, and I would be convinced, but I&#039;m still not sold on the cache miss problem. Obviously you are, and with a lot more experience, so please could you explain further?

You&#039;ve already argued well that code is simpler and better cached with trees, but the problem still remains that the tree itself is not as well localised as hash tables - what is your response to that?</description>
		<content:encoded><![CDATA[<p>Very informing for an average programmer like myself, and I would be convinced, but I&#8217;m still not sold on the cache miss problem. Obviously you are, and with a lot more experience, so please could you explain further?</p>
<p>You&#8217;ve already argued well that code is simpler and better cached with trees, but the problem still remains that the tree itself is not as well localised as hash tables &#8211; what is your response to that?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mebigfatguy</title>
		<link>http://enfranchisedmind.com/blog/posts/problems-with-hash-tables/#comment-37146</link>
		<dc:creator>mebigfatguy</dc:creator>
		<pubDate>Fri, 08 Jan 2010 23:04:36 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-37146</guid>
		<description>I suggest we call an amoritorium on the spelling complaints.</description>
		<content:encoded><![CDATA[<p>I suggest we call an amoritorium on the spelling complaints.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Start Using Unordered Map &#171; ASKLDJD</title>
		<link>http://enfranchisedmind.com/blog/posts/problems-with-hash-tables/#comment-36926</link>
		<dc:creator>Start Using Unordered Map &#171; ASKLDJD</dc:creator>
		<pubDate>Sat, 10 Oct 2009 16:12:44 +0000</pubDate>
		<guid isPermaLink="false">http://enfranchisedmind.com/blog/2008/02/25/problems-with-hash-tables/#comment-36926</guid>
		<description>[...] Unlike a balanced tree structure, hash table are somewhat unpredictable in nature. If the hash function is poorly suited for the given data set, it would backfire and result in linear runtime. Because of this nasty behavior, some programmers would even avoid hash table altogether. [...]</description>
		<content:encoded><![CDATA[<p>[...] Unlike a balanced tree structure, hash table are somewhat unpredictable in nature. If the hash function is poorly suited for the given data set, it would backfire and result in linear runtime. Because of this nasty behavior, some programmers would even avoid hash table altogether. [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
