Archive for February, 2007

Feb 17 2007

Implementation Exposure Through Inheritance

Published by Robert Fischer under Uncategorized

There’s a bit of ugliness with object-oriented development which I keep encountering, so I’m wondering if any of the locals out there have a better solution.

Specifically, my problem is that the inheritance relationship seems to be conflating two purposes in at least Java and CSharp. On the one hand, it is defining a one-way interoperability relationship: that is, a class X that inherits from Y means that wherever you could use a Y before, you can now use X, as well. In Java, this kind of relationship can be seen with BigInteger inheriting from Number: wherever a Number is called for, a BigInteger could be used. On the other hand, there is also the implementation work: a class X that inherits from Y means that class X uses the underlying implementatin of Y. This kind of relationship can be seen with Properties inheriting from Hashtable: Properties is simply choosing to inherit from Hashtable because that’s the underlying implementation.

And the conflation of these two very different purposes is complete: there is no way for X to say “I am meeting all the same contract as Y, but I am using my own implementation.” You can perform the inheritance and override all the methods, which works great until the inherited code contract changes, and now (with no compiler warning) you’ve exposed an underlying method which is built off a completely detached implementation. And that’s completely ignoring the fact that you’re still performing a full initialization on Y, even if you’re not ever going to use its faculties! For code that you control, you can always create IY, the Interface implementation of Y, but what do you do about code you don’t control. For a real work example which I have encountered, what if you wanted to implement your own BigInteger implementation (say, based on GMP?).

There’s also no way to say “I want to use all the methods of X in my implementation, but I don’t want to expose its methods to the outside world.” For a Java implementation interpretation, consider the ability to code a class as if you inherited it, but the underlying class methods all behave as private. The reason one would want to do this is because of encapsulation: what if Properties, for instance, wanted to change to get the free sorting provided by a TreeMap? Unfortunately, making the change to now inherit from TreeMap instead of Hashmap could break existing code: it’s been legal for years to work with Properties as though they were a Hashmap, and now you’re making that an illegal cast. This is likely another run-time bug that will be exposed without any help from the compiler, which means that it is a very dangerous change to be making. Yet this change can be made without any adjustment to the the conceptual entity that is “Properties”: what a properties represents is the same, and even the set of method signatures would be the same, but this new implementation could break code. The inheritance from Hashmap is just an implementation detail, yet it has become a key part of the API.

So how do you manage to keep the benefits of inheritance without exposing your underlying implementation? One solution that’s occurred to me is to ALWAYS have an Interface and a Factory for every class, but that’s a lot of noise to sidestep this programming language issue. There’s got to be a better solution — anyone got one?

Popularity: 4% [?]

7 responses so far

Feb 11 2007

Usage

Published by Robert Fischer under Uncategorized

A hint for those of you posting to this blog: if you sign in, only your first comment will ever be moderated. Once you’ve got the green light for your first post, the blog takes that as proof you’re not a spambot and allows you to post freely.

So, if you don’t like your comments hanging in limbo for possibly up to a day or two, register with the blog and you’ll be good to go.

Popularity: 2% [?]

No responses yet

Feb 08 2007

Thoughts on Parallelism

Published by Brian under Uncategorized

I was going to write a blog post flaming this blog post, but I’ve change my mind. The only real sin he’s committed is making up his mind before all the facts are in my opinion.

But I don’t think he’s right. Not because message passing is bad or is going to go away- it’s not. It’s that it’s not a case of either or. The more I think about it, the more I think the future is going to be a combination of STM and message passing. Both have their advantages and disadvantages.

The advantage of message passing is that it’s distributable. You don’t care if the other process you’re sending a message to is on the same CPU or the other side of the planet, it doesn’t matter. The disadvantage of message passing is that it requires, at minimum, at least one copy of the data, and often multiple copies of the data. Plus task switching and cache flushing. It is heavy weight, at least compared to STM and shared memory multithreading.

This is a non-trivial problem. Parallelism comes in a gradiant, based on how much data needs to be transmitted between the processing elements per time period. One on extreme end of the gradiant you have very little data that needs to be distributed. The gold standard here is stuff done by distributed.net takes on- brute force crypto key searches and the like. We’re talking about a few hundred bytes every couple of days. This makes it trivial to distribute the computational load over widely dispersed machines.

Note that this also explains why supercomputers died in the 90’s, but big iron database servers didn’t. Supercomputers were, at heart, about solving either large numbers of small linear algebra problems, or a small number of large linear algebra problems. In either case, the problems were fairly easily distributed with small communication demands- a few megabytes per second- that supercomputers could be replaced by racks of PCs communicating via ethernet. Databases, however, require more communication between nodes (especially if you want to parallelize a single query, or allow multiple queries to access the same table simultaneously), and as such didn’t distribute via ethernet. So Sun and IBM are still profitable, Cray and SGI are dead.

Distributed databases are possible, for at least some workloads, I comment (before half a dozen people do). The problem is that only certain types of workloads distribute- write a query that doesn’t distribute, and all your work has to be done on a single box. Which is why they haven’t caught on- and Sun and IBM are still in business.

The problem is that everything needs to be parallelized. But being distributed isn’t quite as big an issue. What’s driving the sudden obsession with parallelization is the sudden appearance of multicore chips. But notice something about the multicore chips- how close the CPUs are to each other. Often they even share cache. If two processes can share a data structure by just sharing a pointer to the datastructure, communication speeds are nearly infinite. This allows for multithreaded programs that have incredibly huge communication requirements. Tightly coupled programs are possible in this environment.

Tightly coupled parallelism is most likely to arise with some form of automatic parallelism. I think the fullest power of STM will be found in a Haskell-like purely functional language (with STM built into the monads automatically, meaning you can’t accidentally forget the STM) combined with cilk-style microthreading. Lazy evaluation and parallelism are very similiar concepts- both are dealing with asynchronous computation. Lazy evaluation moves the computation to the first time the value is needed, while parallel evaluation the evaluation happens (conceptually, at least) in parallel. But this level of sharing requires very tight binding, very high communication rates. But I’m not doing a lot of computation per byte of communication. This is even more the case if the different CPUs are, in fact, running on an SMT CPU (aka Hyperthreading by Intel).

I actually think that a lot of problems will be ameniable to both levels approaches. Do message passing, ala MPI or Erlang, at the top level, breaking the problem up into big chunks which can then be distributed over a local network or a highly non-uniform ccNUMA architecture. But then each chunk is solved in a microthread+STM approach, allowing for good utilization of local resources and tightly coupled parallel processes.

As a side note, the “emergent complexity” he worries about having with STM is also a problem with message passing. Emergent complexity is always a problem- and the fundamental simplicity of the individual components doesn’t help (much). And, unlike STM, message passing does admit the possibility of a deadlock- thread A waiting for a message from thread B which is waiting for a message from thread A. It’s much less likely than in the classic (and unarguably awful) threads+locks paradigm, but it’s still possible. The only advantage message passing has over STM in the emergent complexity department is that it encourages bigger chunks per thread (due to the relatively high cost of communication)- fewer interacting components means less emergent complexity.

But fewer interacting components also means less parallelism, and less performance on future hardware. The new Moore’s law is that the number of cores will doubly every couple of years now. However parallel you program is today, it’ll need to be twice as parallel in a couple of years- and dozens to thousands of times more parallel in a decade or so! Actually, the situation is even worse than this. Once multithreaded programming becomes common, the new performance metric will be performance per transistor. If I can trade off 10% performance and fit the CPU in 1/5th the die size on the same process, I can fit 5 cores where 1 used to fit and the CPU is 4.5x faster for programs that can take advantage of 5 cores.

Of coruse, 5 cores doesn’t mean I get 5x the memory bandwidth. It actually becomes more important in this case to be tightly bound, so gaggles of threads can share cache efficiently. Five threads all talking to memory and not to each other would die on performance. On the other hand, this also means that moderately size machines will start being more non-uniform in the memory architecture. Talking to the thread on the next core over is very cheap- but talking to that thread microseconds away is painfull and slow. The former is a job for STM, the latter for message passing.

That said, I could be wrong. I know what damned well doesn’t work, but I don’t know what does work yet. One good thing does arise out of this debate, I comment. Wether it’s Erlang and message passing, Haskell and STM, or some hybrid, it looks like the solution is going to be functional programming plus something else. What really drove the adoption of OO programming was GUIs. Note that Doug Englebart was one of the inventors of OO programming at the same time he was inventing the modern GUI interface. Heck, the first Smalltalk dates from 1972. But it wasn’t until GUIs hit the mainstream in the mid/late 80’s that OO programming also took off. In a similiar way, I think parallelism is going to drive the adoption of functional programming. Which will be for the betterment of programming in general.

Popularity: 17% [?]

23 responses so far

Feb 08 2007

Two Whitepapers Every Developer Should Read

Published by Robert Fischer under Uncategorized

Just forwarding these around work, so I thought I’d share with the blog, too.

Why Functional Programming Matters — It is fundamentally different and fundamentally better than imperative programming. Note that “imperative programming” includes “object-oriented” languages like Java. This article argues why with illustration.
http://www.math.chalmers.se/~rjmh/Papers/whyfp.pdf

Skip Lists — This is my go-to priority queue implementation these days.
ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf

Popularity: 3% [?]

2 responses so far

Feb 06 2007

Java and Lazy Lists: I Stand Corrected (Sort Of)

Published by Robert Fischer under Uncategorized

Jakarta Commons-Collections actually does provide a lazy list kind of beast. They just didn’t realize they did it.

There is a class called ObjectGraphIterator, which could be use to simulate iterating over a lazy list (were one to be kinda tricky about it). Combined with the TransformIterator, we can accomplish the get-time retrieval manging that I was looking for.

So Java lazy list functionality can be reasonably achieved (in a somewhat limited fashion) using business-standard open source libraries. You still don’t end up with a list at the end of it all, but the main gist of the trick is there.

If you’re not sure what I’m babbling about, check out this post for an introduction to lazy lists in their natural environment, and this post for my railing about Java’s lack of lazy lists.

Popularity: 3% [?]

No responses yet

« Prev - Next »

Green Web Hosting! This site hosted by DreamHost.