1024 by 2012!

This will be a short blog post, but I just wanted to point out this article from El Reg on the future roadmap of the Sparc CPU. Especially check out the graphic associated with that article.

Here we have Sun/Oracle’s roadmap for the Sparc processor- and we see that by 2012, Cascade Falls will have 8 sockets, each socket having 16 cores, each core running 8 threads in parallel. Kittens, cats, sacks, wives, that’s 1024 threads going to St. Ives. Of course, this is less impressive when you consider that the Niagra Falls line already has 256 threads, and the changes are just 2x the number of cores per chip (or one generation of Moore’s law, if you care to look at it that way), and 2x the number of sockets supported.

And it’s not just Sun that’s in this market- Intel is there as well, with the Larrabee Processor. They’re using the P54C as the basis for the CPU, which is superscalar (pipelined) but doesn’t do out of order execution (sound familiar yet?). Each chip will house 32 of these P54C cores, and each core will have 4 threads executing on it. That’s the same number of threads per chip (128) as Sun’s Cascade Falls chip. Put 8 of them in a server, and you’ve got yourself 1024 threads again.

Massive parallelism is coming.

But what I find really interesting is the internal design of these chips, which comes up in this video. Basically, the chips are non-superscalar, in-order chips. Pipelined, yes, but that’s about it. These are Pentiums, not P-Pros, let alone Core-2 Duos. The assumption is that each thread will be blocking regularly on memory accesses. But rather than trying to work around the blocking by all sorts of tricks like out of order execution, is to just have more threads. If one thread blocks, move on to another thread.

This is important, as it overcomes the next wall we’re likely to hit: memory latency. So, just out curiosity, I dug out my Hennessy & Patterson (2nd ed, I need to update), and looked up typical cache miss rates (on page 384 in my book, I’m using the Data cache rates for those playing along at home).

Assume a 8K L1 cache (64K divided up among 8 threads), and a 128K L2 cache (1M likewise divided amongst 8 threads). That means that 89.8% of loads hit L1, and another 7.3% hits L2, and 2.9% go out to memory. And assume 25% of instructions issue a read (which is about average), and L1 cache takes 2 clocks, L2 takes 20, and Memory takes 100 clock cycles (these are fairly average for modern cpus, I think). So every 100 instructions issues 25 loads, of which 22.45 hit L1 cache, 1.83 hit L2 cache, and 0.72 hit memory. Assuming that excepting the loads and stores, every other instruction takes 1 clock cycle to execute, the total time to execute that 100 instructions is 70 (non-load instructions) + 22.45*2 (L1) + 183*20 (L2) + 0.72*100 = 228.46. The CPU is spending a little over half of it’s time blocked waiting for memory.

Upping the memory latency of main memory from 100 clocks to, say, 500 clocks ups the time to execute those 100 instructions to 516.46 clock cycles. If you have 8 threads executing, on average you can execute 1.55 instructions per clock, from some thread or another.

Up until now, memory design has been a trade-off between throughput (bytes/sec moved) and latency (time it takes to satisfy a read). For those of you who remember the Rambus fiasco, the main technological difference (i.e. ignoring the fact that they were patent trolls) between Rambus and DDR was that Rambus optimized more for throughput, while DDR optimized more for latency. For situations where memory latency wasn’t a problem (like video encoding or 3D graphics rendering), Rambus won handily. But for everything else, DDR won handily. In a situation where a core is executing only a single thread, if that thread blocks waiting for a read to complete, then the core sits idle until that read completes. This is why memory latency was, and still is, so important in memory design. Having the core sit idle for 100 clocks is a heck of a lot better than having the core sit idle for 500 clocks.

What SMT (Simultaneous Multi-Threading, the trick of running more than one thread on a core, called Hyperthreading by Intel) does it overcomes the memory latency bottleneck, allowing memory subsystems to optimize on throughput. As we have seen, upping the memory latency from 100 to 500 clocks dropped our instructions per clock per core from 3.5 to 1.5. While we’re not happy about it, it’s survivable. Upping the memory latency on a single-threaded core the same amount, and performance would die. I don’t have solid numbers to play with, but the worst numbers I saw in the Rambus days (in the early days of the P4) was 350 clock latency on memory accesses, which was a significant performance hit (like 30-40%) on most common workloads.

And if memory latencies get worse, the multithreaded approach can offset this by going to more threads per core. If memory latencies hit, say, 1000 clock cycles, then you can simply up the number of threads each core is executing by a factor of 2. Even dropping the cache sizes by a factor of 2 as well (same amount of cache, just more threads sharing it, upping the number of cache misses per load), with 16 threads, you’ll still see an average of 1.43 instructions per clock to be executed.

You can’t play the same trick with instruction level parallelism- the parallelism simply isn’t there. I’m reminded of this seminal paper from (good lord) almost 20 years years ago, on the limits of instruction level parallelism. Even with basically unlimited look ahead, there was only a limited amount of parallelism available. Having the CPU scan further ahead in the instruction stream simply means that the CPU is examining more instructions it can’t execute yet either- at great cost in complexity. Modern x86 CPUs have a hard time sustaining execution rates above 1 instruction per clock (see, for example, this paper). The simple multithreaded approach is executing significantly more instructions per thread (over all threads) than the complex out of order approach (1.55 instructions/clock/core vr.s 1 instruction/clock/core, and that’s assuming the complex approach has the “low latency” 100 clock cycle cost to access memory, while the multitheaded approach has 500 clock cycle cost to access memory).

Of course, this requires the application to be exceedingly multithreaded to take advantage of this. Currently, Sun is marketing this chip at web servers. The idea is that it’s easy to get 1024-way parallelism, if you’re serving 1024 different http connections at the same time. And, like Sun, Intel is positioning this chip to a specialized market, this time GPUs. What these two markets have in common is that they are easy to parallelize, and they’re both “core markets” to their respective customer bases. This makes sense- go for the low hanging fruit first. But as this approach becomes more mainstream, the pressure of the performance advantages it offers will require ever more programs to go heavily-multithreaded.

Related posts:

  1. Digg Away
  2. The problem with STM: your languages still suck
This entry was posted in Programming Language Punditry. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • http://barrkel.blogspot.com/ Barry Kelly

    You can have this more or less today, running on Azul – 864 cores vs 128, not threads, but cores. This Sun approach, which pretty much amounts to 8x hyperthreading, isn’t going to get you 8x performance – it’s all about getting maximum performance out of the 128 cores you already have. In the extreme, with tightly optimized code with good branch prediction and good cache locality, and strides that can be predicted, you’re best off with one thread per core.

    This Sun approach is more about maximizing throughput by leaving a lower level of scheduling up to CPU cores. To take advantage of it, you need to write your code so that it has up to 8x more parallelism than you would if you didn’t suffer latency. That extra parallelism has its own costs, of course…

  • http://tech.puredanger.com Alex Miller

    If you haven’t seen it yet, check out Brian Goetz and Cliff Click’s talk from JavaOne “Not Your Father’s Von Neumann Machine”. It lays this all out with great impact and much more detail and why it affects us now. Excellent talk.

    http://www.azulsystems.com/events/javaone_2009/session/2009_J1_JVMLang.pdf

  • Brian

    Sorry for the delay in responding.

    Barry Kelly: I’m not sure I agree with you. There is a finite amount of engineering effort that can be spent optimizing any given program, and that effort should be allocated in a way that increases the total performance of the program the most. As a simplification, you might split optimization up into two different parts: optimizing each thread for optimal performance, and making the program more multi-threaded. A given allocation of resources would then be a fraction x of the resources (programmer time) spent on optimizing the performance of individual threads, and 1-x spent on making things more multi-threaded. The question then becomes: what value of x gives the greatest speed-up for the program?

    Azul systems is, in fact, making the same argument that Sun’s Niagra and Intel’s Larrabee are- that x should be relatively small. That it is more efficient, overall, to have lots of inefficient threads, than to have a few highly optimized threads. For a classic “large x” system (which wants a few, highly optimized, threads) would be your “normal” CPUs- the Cores and Opterons and Sparc-VII/VIII/Juptier chips. Chips with only a few cores (and a few threads per core, or even 1).

    With serious optimization, really good cache hit rates, proper instruction scheduling, etc., a modern “fat” CPU like the Intel Core can execute 3 instructions per clock. Put eight of these monsters on a chip, and you’re getting, at best, 24 instructions per clock. But on the Azul system, with it’s 864 cores, even if only 1 core in 20 is doing anything on at any point in time on average, it’s still executing 43.2 instructions per clock. I think you’ll agree that having an IPC of 0.05 qualifies code as not being highly tuned. But even with very poorly tuned threads with terrible IPCs, the Azul system is doing 80% more work per clock cycle than the Intel Core chip could do at best. And the reality is that it’s rare for even highly optimized code to get much past 1.0 IPC, and it’s rare for even unoptimized code to have an IPC as low as 0.05, so the advantage of the Azul system over the Intel Core system is probably even larger than this.

    Whether you want a smaller number of cores and have each core executing more threads, or more cores executing fewer threads, is debatable. There are several non-linear trade-offs being made. But from a programmer’s perspective, it just doesn’t matter. The point I’m trying to make is that the future of performance is lots and lots of inefficient threads, not a few super-efficient threads.

  • Brian

    One other note I’d like to make: with 864 cores on a chip. those chips don’t have fancy features. They’re no going to be doing the super-scalar execution, nor are they going to be doing branch prediction. They’re going to be 486-style chips, executing one instruction per clock (at best), in order. Branch prediction is for when the CPU is trying to speculatively execute past the branch. If you’re not speculatively executing instructions (which takes a hell of a lot of complexity and transistors to do), then branch prediction is pointless.

    The interesting thing here is that optimizing for the 486 is much simpler than optimizing for, say, the Core or the Opteron. It doesn’t matter what order you issue the instructions in, for example, they still take the same amount of time. Fewer instructions are better, and cheaper instructions are better than expensive ones, but other than that, optimization is mainly about eliminating cache misses and big-O notation improvements.

  • http://barrkel.blogspot.com/ Barry Kelly

    Branch prediction is also important for avoiding cache misses, speculative execution aside.

    WRT saying optimizing for 486 vs Core, I think you meant the reverse; 486 and pentium in particular were quite sensitive to instruction ordering for maximum performance, and getting good FPU performance meant treating the FP stack as if it were a register machine, constantly queuing operations and fxch new operands to the top, rinse and repeat. To a fair degree, that’s less important now.

    I don’t think the future will lie in much explicit threading by programmer applications; rather, it’ll be “blobs” of execution, tasks, which will be handed off to an OS-level task scheduler (as opposed to process or thread scheduler) which will process tasks as appropriate across all the cores in the system, taking account of how busy everything is. With this in mind, I think a CPU making scheduling decisions is of doubtful use; yes, it makes sense for small-time opportunistic concurrency for stalls and the like, but you don’t want the CPU deciding to ignore a task for a (relatively) long time when there’s idle resources elsewhere in the system.

    The OS or other system-level process needs to do the task coordination, though, as it’s too easy for individual processes to end up with broken assumptions about how busy the system is, and end up oversubscribing the hardware threads, and wasting too much time in context switching.

  • Brian

    Instruction scheduling was very important for the Pentium, because the Pentium was super-scalar (capable of executing multiple instructions in a single clock cycle), and to access that capability you needed careful instruction ordering. The 486 was much less sensitive to instruction ordering, as it was an in-order scalar design (one instruction completed before the next instruction started). How you optimized for the 486 was to execute fewer instructions, and cheaper instructions.

    On the other hand, both of these CPUs were not that much faster than the memory they were attached to- you weren’t seeing 100 clock cycle delays to read main memory. I can’t find good numbers on them, but if I recall correctly the P5 only took a few tens of clock cycles to access memory. At that point, it was all about how many instructions you could push through the CPU per second.

    These days, it’s all about memory. You’re right in that instruction selection is a lot less important these days, because the actual overhead of executing the instructions is damned near free, what needs to get optimized is memory accesses. This is why modern CPUs are such monsters, with huge caches and all sorts of out of order and speculative execution to work around the huge memory delays. Branch prediction is only necessary for speculative execution, by the way- if you’re not try to speculatively execute instructions past the branch, you can just wait until the branch resolves. Why do you need to speculatively execute past the branch? Because code before the branch is blocked waiting for memory, so you might as well try to do something useful with the time. Or the CPU could switch to another thread and start doing work there.

    Going to the OS is certainly going to be part of the repertoire of the multi-threaded toolbox. The advantage, as you point out, is that it allows the OS to schedule things with knowledge of the entire system. The problem with it is that it enforces a certain level of granularity in the chucks you schedule- making a system call, even the cheapest one, is expensive. We’re talking microseconds of time here. So trying to schedule tasks that take less than, say, 10 microseconds will spend a very large chunk of their time just calling the OS and scheduling overhead.

  • http://barrkel.blogspot.com/ Barry Kelly

    When I said the OS needs to supply task coordination, I didn’t mean that context switching or entering the kernel is needed; simply that the OS needs to coordinate all the task dispatchers across all processes on the machine, so that the CPU is not oversubscribed. Enqueuing task should not be much more than an interlocked operation on a memory operand.

    Think Microsoft TPL, Intel TBB, Apple GCD – this is what I’m talking about.

  • Categories