So, my post on The Problem with STM: your languages still suck got listed on reddit, where a number of comments were generated. A number of the comments are worthy of a response, which is what I offer here.
Mutable data means never having to say you’re synchronized
grauenwolf said:
If you are working with immutable data you don’t need STM.
This is quite right- nor, do I comment, do you need any sort of other synchronization. Since no one can change it, it doesn’t matter how many threads are accessing the data, and in what way, at the same time. Also note that this has nothing to do with the language at all- the larger the percentage of data that is immutable, the smaller the number of synchronization bugs, and this is as true in Java as it is Haskell (and yes, Virginia, you can have immutable data in Java- final is your friend). Learning a functional language is an advantage even if you keep on programming in Java or C#, if for no other reason than it’ll teach you functional data structures.
However, you can’t have a useful programming language which only deals in immutable data structures- trust me, the Haskell people tried that, and tried really hard, and failed. At the end of the day, writing a file or reading from pipe is a side effect. And a lot of code is just easier to write using small amounts of mutable data. So it’s definitely worthwhile to encourage programmers to limit the number and locations of side effects in their code, but simply outlawing side effects altogether isn’t a viable option.
Mutating Immutable Data
There is, I think, some confusion among some of the people out there about how do you change an immutable data structure- and this is important, because it goes to the heart of one of the points I was trying to make. There is a pattern in functional languages which might be called “return a new structure which simply reuses large parts of the old structure”, which is used where an imperative implementation would do a (destructive) update (feel free to skip this section if you already know this). An example makes this more clear (code in Haskell, but it should hopefully be more or less self-explanatory, or explained). Say we have a tree data structure, like:
data Tree a = Leaf | Branch (Tree a) a (Tree a)
This just says that our data type, named Tree, holds elements of some type a, and is either a Leaf node, or a Branch node- and that Branch nodes have three elements, the left sub-tree (which is just another Tree), the element at that branch (elements are only held within a branch), and the right subtree. For example’s sake, I’m leaving out balancing. Now, this is an immutable data structure. So how do we add an element to our tree?
The solution is that we can’t- but what we can do is create a whole new tree with the new element in it. So our insert function would have the type:
insert :: Ord a => a -> Tree a -> Tree a
insert is just the name of the function the :: means we’re just giving the type of the function the Ord a => means that whatever type a the tree is holding needs to be ordered- i.e. it needs to have the various comparison operators defined for it, and the rest of the type means that insert is a function that takes two arguments- some value of type a and a tree that holds type a’s, and returns a new tree (that also holds type a’s). Don’t worry if the type looks weird- the important point here is that the function doesn’t modify the old tree, it creates a whole new tree to insert a new element.
This sounds horribly inefficient- what happens if your tree holds millions of elements? Do you copy all the millions of nodes? No- something magical happens in the insert function. Let’s take a look at it:
insert x Leaf = Branch Leaf x Leaf insert x (Branch l y r) = if x < y then Branch (insert x l) y r else if x > y then Branch l y (insert x r) else Branch l x r
The first line just says that if we’re trying to insert an element into a Leaf (i.e. an empty tree), we just create a one-node tree with empty subtrees, and the element as the root element. The rest of the code is what to do if the tree isn’t empty- in that case, if the element we’re inserting (x) is less than the root node of the tree, we recursively insert the node into the left subtree, creating a whole new left subtree. We then create a new tree with the same root element and right subtree as the old tree, but with the new left subtree. We pull the same stunt if the element we’re inserting is greater than the root node, except we recursively insert into the right subtree instead. And, in the unlikely event that the element we’re inserting is equal to the original root node, we just replace the old root element with the new root element.
Notice that at each level, we’re only changing one sub-tree, however! For example, when x < y, we update the left sub-tree, but we don’t change the right sub-tree. Indeed, we share it between the old tree and the new tree. This is perfectly safe, if you think about it, because the right subtree is immutable- no one can change it. Not the new tree, not the old tree, no one. Of course, the exact same logic holds on the subtree we “changed” as well- at each level, we change one subtree but reuse the other. The vast majority of the nodes- the ones not changed- are therefor shared between the old and new trees. Only the specific nodes we changed differ- those nodes between the ultimate root and where ever we decided to insert the new element. Assuming that the tree is more or less balanced, this means that for a tree with N elements (aka N nodes), we’ll only have to allocate log(N) or so nodes- so if the tree contains millions of elements, we only need to allocate a few dozen new nodes to insert a new element into the tree. And functional programming languages have very very fast allocators, so allocating all those nodes isn’t a big deal, especially compared to the overhead of synchronization.
But this raises the question: what happens to the old tree? The answer is: nothing. It’s still as valid as it ever was. The code that inserted the new element can choose to drop it’s pointers to the old tree in preference to using the new tree, in which case sooner or later the garbage collector will be along to collect the few dozen nodes that aren’t shared between the old and new trees, and thus have become garbage. Or it might choose to drop it’s pointers to the new tree, and go back to using the old tree, thus perfectly and easily undoing the insert. The next time you find yourself spending a lot of time and effort writing and debugging an undo system for some data structure, remember that immutable data structures have one built in. Or the calling code made decide to go on using both- at which point it will be glad that, since the data is immutable, it doesn’t have to worry about which nodes are which trees, and who is responsible for freeing what (freeing an object is a mutation- I don’t think you can do proper immutable data in C++. You can get close, with great pain and effort, but no cigar), or what locks you need to be holding to access which parts of a given tree. There are advantages to immutability that go beyond parallelization.
Implicit Serialization
Now, you still need to be able to, in code, go “what is the current, most up-to-date copy of this data structure” and “make this the current copy of this data structure” in many problems. The data structure itself may be (probably should be) immutable, the point here is that you want to share it with other parts of the code. The normal way you would do this would be to have a single piece of mutable data shared between all the various pieces of code that need to have access to it (either as a global variable, possibly with limited visibility, or as an allocated variable whose lifespan is long enough to encompass all of it’s uses). This is what I was talking about in my original post- you have a single reference cell, which holds a mutable pointer to the current root node of the tree, say. Code can then read this pointer when they want the current tree, and write it when they want to change what the current tree is.
But there is an alternate solution to this which I’d like to explore, because it demonstrates one of the five great errors of parallel programming (the five horsemen of the parallel apocalypse, I call them)- not being parallel enough. The idea here is that we simply pass around what the current data structure is. So if a function needs to get the current data structure, it gets it passed in, and if it needs to change the tree, it returns, in addition to whatever it’d normally return, it’d also return it’s updated data structure. So let’s say our function doit would normally take a value of type Foo and return a value of type Bar, so it’d have a type signature of:
doit :: Foo -> Bar
Here, we see doit is a function that takes one argument, of type Foo, and returns a value of type Bar. Using this trick, the new type signature for doit would be:
doit :: Tree Quux -> Foo -> (Bar, Tree Quux)
So now doit takes two arguments- the current tree (holding elements of type Quux), and the original Foo. It returns a tuple of the Bar it’d original return, and the updated tree. A tuple is just an unnamed structure whose elements are only referred to by position, so we can talk about the first or second elements of a tuple. By the way, this is how purely functional code returns multiple values, it returns a tuple. This trick is just an extension to how we updated immutable data structures in the first place, so if it looks familiar, it’s because it is.
Haskell has the concept of a “State Monad”, which I won’t talk about (much). Except to say that it’s basically just a tricky way to transform code to pass around a piece of state just like I’m doing above, without having to deal with all the tedious book keeping and boilerplate. And it looks much more like mutable data- you call the function get to read the current state, and the function put to write a new state. But it’s a monad (and worse yet, a monad transformer), which involves a whole bunch of explanation and concepts that are really not germane to the point at hand. So, suffice to say, it has exactly the same problem as the above code.
The problem comes in when we want to have multiple pieces of code both wanting to update the same current data structure. Say we also had a function doit2 that also wants to access the tree. We could write code like:
let myBar, tree2 = doit1 myFoo originalTree in let myBaz, tree3 = doit2 someArg tree2 in ...
The problem with this bit of code is that we’re enforcing a serialization of the two functions doit1 and doit2 based upon the updated tree. doit2 can’t start executing until doit1 is finished, because it needs to know what state doit1 leaves the tree in. Laziness, in the Haskell sense, saves you a little bit here, but not much. The first get in doit2 has to happen after the last put in doit1, that’s what the data dependency means. Consider the case where doit1 does a put very late in it’s processing, and doit2 does a get very early in it’s processing. In a single-threaded program, this lack of laziness isn’t that big of a deal (and can even be a help)- but in a parallel program, this is death. We’ve “solved” the problems of parallel access to the data by eliminating the parallelism. Not good.
The Five Horsemen of the Parallel Apocalypse
There are a lot of partial solutions to parallelism out there, so I thought I’d actually list the five big new errors that parallel programming introduces. Originally this was intended to be a blog post on it’s own, but it’s hard to put enough meat on it’s bones to make it worthwhile, so maybe making it part of a bigger blog post is a good idea. All of these errors are things you only encounter in writing parallel code, and they’re all things are can pretend to work, or can work in some cases but not in others. Worse yet, many of them are true Heisenbugs- they are dependent upon subtle timing errors between parallel code executions, so seemingly inconsequential changes in seemingly unrelated parts of the code, such as adding or removing debugging print statements, or pausing executions in a debugger, can radically change, even mask, the error.
So, take it away, David Letterman:
Number Five: Deadlocks. Thread A holds lock 1 and is trying to get lock2, thread B holds lock 2 and is trying to get lock 1, Mexican standoff. If this was the only problem multi-threaded code had, I’d declare the problem solved. One trick is to number your locks so trying to lock a lower-numbered lock when already holding a higher numbered lock (like thread B is trying to do) is an error. Another way is to do occasional checks for deadlocks- create the bipartite graph of threads and locks, where there is a directed edge from a thread to a lock if the thread is waiting on that lock, and a directed edge from a lock to a thread if a thread owns the lock. If there is a cycle in this graph (which can be detected in O(N) worst-case), you have a deadlock, and somebody attempt to get a lock fails. Another way is to use timeouts. Like I said, there are solutions- maybe not the best, but usable- for deadlocks.
Number Four: Race conditions. Two or more threads changing the same piece of data without synchronization. The classic case is updating bank accounts. Consider the nightly batch job that goes through and updates everyone’s accounts for the checks cashed that day. The process reads the account balance, and if it’s greater than the amount of the check, subtracts the amount of the check from the account balance and writes the updated account balance back. So, I have $1000 in my account, and cashed two checks today, one for $300 and one for $400. Thread A is handling the first check, reads my account balance (of $1000). But meanwhile, at the exact same time, thread B is handling the second check. Thread B now reads my account balance- which is still $1000. Thread A has now calculated that I do, indeed have enough funds in my account, and that my updated balance is now $700. Thread A now writes what it thinks my new balance is back. Thread B, continuing to thing my balance is $1000, then calculates that I do have enough funds in my account for the second check, and that my updated balance is now $600, which it writes back. So, when both threads are done, my account balance is $600, the last value written. The change made by thread A has vanished.
Note that race conditions can be incredibly hard to detect. The amount of time it takes to read a value from memory, test it, do a subtraction, and write it back, is very very small. We’re talking maybe a few clock cycles, or a few nanoseconds, here. The time window for thread B to come in and mess everything up then is also very very small- delay thread B even by a few nanoseconds, and thread B doesn’t read my account balance until after thread A has written it, and there is no problem. I once had to track down a race condition that, doing several thousands of tests per second, only happened about once every four days or so. Which, unfortunately, meant that one of our customers was being taken down by this bug every other week or so, and thus needed it fixed- ignoring it wasn’t an option. But consider the impracticality of units tests in this case- you’ve been running the same unit test for three days running, and haven’t observed a problem- how likely are you to go “I guess there isn’t a problem, I’ll stop now”?
Number Three: Live locks. Say you have three threads all competing for the same resource (getting access to the CPU, getting the same lock, etc.). A and B are higher priority threads, and get the resource in preference to low priority thread C. Worse yet, whenever one of threads A or B release the resource, the other thread of A or B is already waiting for it, so poor old low-priority thread C never gets it’s turn. Thread C is live locked- a situation not unlike the dead locked situation described above. But the problem is that ti’s less obvious that there’s a problem. In the dead lock situation, no thread involved in the deadlock is making any progress at all. In the live lock situation, threads A and B are happily making all sorts of progress, and are both regularly giving up the resource to other threads (well, an other thread).
I introduced live locks here using priorities, and in some sense they are a problem with priorities. But not having priorities also causes problems. Consider the program that wants to perform some hideously expensive computation, so it’ll take all the CPUs running flat out many hours to solve the problem. But the program is nice and throws up a dialog box with a cancel button, in case the user gets bored and tired of waiting. Well, handling the mouse click event on the cancel button needs to be higher priority than the big expensive computation. And now we have priorities, and the possibility of live locks. But the same problem can show up even without explicit priorities. For example, one problem a lot of STM implementations have is with large, long, transactions that update a lot of state competing with small, short transactions updating a small amount of state. Say threads A and B are popping in doing short transactions that are just enough to make thread C, doing it’s long transaction, have to abort and retry. Poor thread C can never finish it’s transaction if A and B keep screwing it up. The solution to this is that at some point C needs to be able to say to A and B no, you guys get to stop and wait until I’m done, so thread C can (eventually) complete it’s transaction.
Number Two: Priority inversions. With priorities, we also have the possibility of a high priority thread being blocked by a lower priority thread. Say we have three threads, high priority thread A, medium priority thread B, and low priority thread C. A is waiting on C to release some resource (a lock, say)- but C can’t get in to release the resource because B is hogging all the CPU. The priorities of A and B are now inverted. Although this bug often doesn’t lead to “and then all hell breaks loose” sorts of bugs, it can still leads to serious problems- for example, if A is the handler for the cancel button on the popup window, and it’s trying to get in to cancel the big, expensive computation being performed in thread B, then the user will get annoyed that he clicked the cancel button minutes ago and the damned thing still hasn’t stopped. Also, if there are any real-time guarantees that thread A needs to meet (no matter how soft real-time they are), priority inversions can cause A to blow it’s schedule.
And the number one horseman of the parallel apocalypse is: not being parallel enough. This is especially true when the number of cores available is growing at an exponential rate.
Faking It
So, getting back to the direct responses, grauenwolf writes again:
Now lets say you have some code like such:
1. Read values A and B
2. Perform some I/O operation using A and B
3. If the I/O operation succeeded, update value A else update BYou can’t do that in a normal clojure transaction. So what do you recommend?
You can’t do that in a normal Haskell transaction either (unless you use unsafePerformIO, which we don’t like to talk about, and is a really bad idea). The first thing I’d recommend is not doing that: rethink your architecture, or rethink the problem. Because you have a deep problem: what happens if someone else wants to come along and change A and/or B while you’re off doing the I/O?
So, that said, you can still do the above, with a little trickery. It’s possible to write a classic-style lock using transactional memory- the act of locking and unlocking are what takes place in the transaction. A simple implementation might be:
type lock = TVar Bool makeLock :: IO lock makeLock = newTVar False getLock :: lock -> IO () getLock lck = atomically do isLocked <- lck; if isLocked then retry else writeTVar lck True putLock :: lock -> IO () putLock lck = writeTVar lck False
Note that retry means the transaction blocks until one of the variables it has read changes- in this case, it’s only read one variable (the TVar that holds the boolean that tells us if it’s locked or not), so it blocks and waits until someone else unlocks the lock. A language sufficiently powerful enough to be useful can not prevent you from being stupid- all it can do is encourage you not to be.
Ostrich Programming
RalfN says (in part):
STM are not real options in C++ or Java.
That’s exactly my point. The question is, is this the fault of STM, or the fault of Java/C++?
On the other hand, I don’t think it’s really that big of a deal, if certain guarantees can’t be enforced. I mean, thousands of lines of code have been written without any guarantee concerning the pointers. Yet the programmers were disciplined and the bugs were few.
And here we come to the crux of the issue. Every time I hear an opinion like this, I think of that child in Africa. You’ve seen the pictures- some kid in some pest-ridden hell hole, hungry, diseased, but smiling away, as happy as can be. Why? Because he doesn’t realize how badly his conditions suck. Everyone he’s ever seen or heard of is living in the exact same hell hole, so that’s just how life is. The point here is this: debugging a program is not free. But it’s easy to miss how expensive debugging is, if you have nothing else to compare it too. I’ve spent the bulk of my career so far in C/C++/assembler land, so trust me, I know all about wild pointer bugs, and what a pain they are to track down. Been there, done that, got the scars to prove it.
I estimate that 2/3rds of the effort that goes into the average C/C++ program is debugging cost- both the obvious cost of tracking down bugs that shouldn’t have occurred in the first place, and in less obvious cost of unnecessary engineering effort to try and prevent the bugs in the first place. But that’s a hard statement to prove, because demonstrating it requires you to learn a functional programming language like Haskell or OCaml well enough to be comfortable with it, then implement a non-trivial program in it, and then honestly compare your experiences with previous experiences in C/C++.
But let’s ignore that for now. OK, C/C++ were good enough for the 70’s and 80’s, and even into the 90’s. So what? The complexity of programs keeps getting larger. Do this test: dig out one of the old copies of WordStar, say, or WordPerfect (for Dos), or Turbo C. Boot it up in an emulate and work in it for a little bit. Notice how limited they are. How few features. I’m not even talking about fancy whiz-bang graphics here, I’m talking about core functionality. And as programs get more complicated, the harder they are to debug and get working right.
Also, in this day and age of ubiquitous network connectivity, bugs aren’t just annoyances that cause you to have to restart (400 times a day) and maybe lose some data, bugs are increasingly security holes that cause you to get your credit card numbers stolen and your PC turned into a bot. Increasingly, bugs cost people real money- and not necessarily the people affected by the bug. It doesn’t matter if you run OpenBSD as your desktop behind three firewalls in a bombproof bunker in an undisclosed location, if your bank or supermarket or whomever gets hacked. And you still have to deal with the tsunamis of spam being sent by other people’s insecure PCs being turned into bots. The cost of insecure, bug-ridden code is paid for by all. It doesn’t matter if there is only one place in your million lines of code where you forgot to bounds check your array accesses, leaving open a buffer overflow exploit. Or only one place where you forgot to sanitize your inputs before shipping them off to the SQL server, leaving only a single SQL injection attack left. That’s the one the bad guys are going to zero in on, and say goodbye to your credit card numbers.
We as an industry have, ever so slowly, started fixing these bugs. For example, I can’t think of a single modern (designed post-1990) language that has had any widescale adoption at all that has the wild pointer and buffer overflow bugs of C/C++. But we’re about to walk off a cliff with respect to bugs. The five horsemen I talked about above have a lot in common with wild pointer bugs from C/C++, in that where the bug is and where the bug is detected can be arbitrarily far apart.
Message Passing Revisited
My actual opinion on message passing is a lot more nuanced than the original post let on to, mainly in a vain attempt to keep on the original subject. I actually think message passing is a great idea, and widely and highly useful- and that it’s not so much competing with STM, as complementing it. But message passing by itself is not sufficient! First off, while it does work well in some cases (and this was known long before Erlang came around- look up MPI some time, or even IPv4), this doesn’t mean it works well in all cases.
The big advantage of message passing, especially share-nothing message passing, is that it’s easy to retrofit into languages with little more than a library (for example, see MPI, or IPv4). The big disadvantage I see with message passing it that it doesn’t scale- if you design your program to use, say, 100 threads, and you run it on a machine (or cluster) with 200 cores, you’re not going to be using half of your cores. And fixing this problem requires not just a tweak of your program, but often times a major restructuring, to the point of rewriting. But architect a program that uses 100,000 threads and try to run on a single-core box and watch it drag.
The solution I think will work will be to use both message passing and STM. Use message passing to break up the program into big parts, say to scale across the machines in the cluster, and then use lots of threads communicating with STM within a single “message server” to scale across the cores.
6 Comments
Your markup is screwy — just after “Faking It” you need to change </BLOCK> into </blockquote>
Great article, but for the love of God start using a spelling and grammar checker! I lost count of how many “it’s” there were in place of the possessive “its”. It may seem like nitpicking but every mistake stuffs up my English parser and disrupts me from enjoying your otherwise fine writing.
Fixed. Thanks.
I would like to hear your thoughts on Objective-C 2.0 with regards to this article.
In particular I see three things worth commenting on:
(1) The immutable objects in Objective-C are much more robust in terms of accidental mutation. This is because you operate on them with messages and the collections are class clusters which restricts the safe assumptions you can make about their internal design. Do you feel this makes any significant progress away from C\C++ and towards Haskell?
(2) Objective-C uses messaging and so hypothetically could work with your scenario of a program using message passing and STM at the same time. You say “share-nothing message passing” and while afaik Objective-C cannot enforce this you could certainly write a program that did not share much data between threads.
(3) Objective-C 2.0 introduces @properties which go a long way towards simplifying programmer usage of locks.
I am not very interested in commentary on the new Cocoa classes for parallel programming or OpenCL, but I will certainly read your thoughts on those topics if you choose to share them. I am more curious how you would evaluate Objective-C for future parallel programming work. You have made your thoughts on C\C++ and Haskell very clear, but I see Objective-C as something of a grey area. My suspicion is that you will find it an inadequate language, but I am unsure enough that I want to ask.
thanks for your continuing thoughts on these issues, since they are obviously important, and i wanna learn more.
i’d like to note that haskell not only has stm, it also has lightweight processes with message passing. it surprised me when i found that out, since i figured erlang was really the only thing with really good lwp + messaging features. man, haskell seems cool.
of course, i haven’t myself gotten fully into the grok of monads yet, so i’m still just a haskell wannabe l00zer. grn.
Nice article. Especially since I already loved the first one.
Just a quick note, in the haskell code for locks makeLock should have type STM Lock
or should be wrapped in atomically. Same with putLock. I also believe concrete types have to start with uppercase letters so it should be ‘type Lock = TVar Bool’.
Of course this only adds noise and is totally besides the point of the blog post, so feel
free to ignore me.
@ Raoul: Would you mind giving a pointer at where to find a message passing
implementation for haskell? Note that I really don’t have clue about message passing so
if it’s just the stuff in Control.Parallel then I’m happy to go back and do my homework
so I can recognize it as such.
One Trackback
[...] the high cost of STM is much less of an issue. Brian then further flushes out his ideas in a follow up blog in which he responds to responses to his [...]