Ocaml the scripting language

Most companies develop a “default language”- this is a syndrome where in if you use the default language for any job, no explanation needs to be given. Use any other language, and at least some explanation needs to be given. With Jane Street (my new company), that language is Ocaml. It’s not that other languages aren’t used- I’m aware of Jane Street code written in C, C++, C#, Java, Perl, Python, and Visual Basic. But never the less, the default language is (and the vast majority of the code- several hundred thousand lines worth of it- is in) Ocaml.

So this week was spent writting a bunch of programs, in Ocaml, which read files in various formats (comma seperated values, pipe delimited fields, fixed length fields, etc) and throw them into a database. This is a job that a normal scripting language, like Perl or Python, would generally excel at. However, they got written in Ocaml for three reasons. One was that it was the default assumed language at Jane Street. Second, it was my first opportunity to write Ocaml at work, and third, and most importantly, the main purpose of these programs was to start building up an infrastructure for later, much more complicated, programs, and to give me experience with the build environment. Using the right tool for the job means the right tool for the entire job- trade small inefficiencies up front to avoid large inefficiencies later. It’s reasonable to use your gigawatt laser beam to kill a fly- if the main purpose of the exercise is to test out your gigawatt laser beam.

So I now have a week’s worth of experience using Ocaml as a scripting language. And the early results say that Ocaml isn’t a half bad scripting language.

There are a number of things Ocaml has that help make it a better scripting language that are not that uncommon- for example, garbage collection, decent regular expression, full access to the POSIX API, etc. What I want to focus on are the things I didn’t except to be much use in a “scripting” environment. To start with, fold and map rock.

Consider the (simplest) case of fixed length fields. The first thing I do is create a list of tuples- the tuple elements being the field name, the start column (1-based, like it is in the documentation), and the end column:

let column_list = [
    "column1", 1, 3;
    "column2", 4, 9;
    ...
];;

I can then use List.map and String.sub to “split apart” a line into the component fields with a simple:

let split line =
    List.map (fun (cname, start, stop) ->
        cname, (String.sub line (start-1) (stop-start+1)))
        column_list
;;

Note that the map I’m using here is non-destructive, it doesn’t change column_list. So constantly remapping column_list, sort of using it as a template, works just fine.

But it gets more fun than that. Many of the columns then need special conversions done to them. For example, many columns need to be wrapped in SQL quotes, while other columns are datetimes that need the correct time zone added, etc. So what I do is define a mapping of column names to the proper conversion functions:


(* A converter *)
let quote s = "'" ^ s ^ "'";;

(* Another converter *)
let add_timezone tz s = s ^ " " ^ tz;;

(* This I put in a common library *)
module StringMap = Map.Make(String);;

let converters = List.fold_left
    (fun m (k, d) -> StringMap.add k d m) StringMap.empty
[
    "column1", quote;
    "some_column", (add_timezone "EST");
    ...
];;

Note the partial function application there. I have a number of cases where the converter has some sort of parameter, such as the time zone to add above, or the number of digits of precision a NUMERIC has, that while constant for a given column, may be different for different columns. Partial function application allows me to use common parameterized conversion functions. Also, converters is constant data, not unlike column_list. So I start with it as a list, throw it into a map, and then stop mucking with it. Then I can take the list of column_name * column_values that I got out of split for a line and call the correct conversion function on every column by just going:

let convert lst =
    let f (k, d) = try
        k, ((StringMap.find k converters) d)
    with
        | Not_found -> (k, d)
   in
   List.map f lst
;;

As a comment to the speed freaks out there, for small numbers of elements, using an O(log N) balanced binary tree, like I’m doing here, is actually faster than using an O(1) hash table. It’s simply not that expensive to compare strings, compared to the cost of computing even a middling-decent hash value. I’ve backed this up with actual timings of real code. Not that it matters at all in this case- the database is the biggest time sink by far. A few clock cycles here or there aren’t going to be noticed.

A similiar trick except using folds instead of maps works to convert my list of now converted column names and associated column datas into a select statement ready for submission to the database. But I’ve made my point about the utility of folds and maps. And partial function application has also been more usefull than I would have thought last week.

Another advantage the above code displays that I would like to point out: not everything needs to be an object. This is a failure of a lot of the “object oriented” languages- they have their golden hammer (objects) and dammit, everything is going to be a nail. There is little (if any) advantage to writting tree.add key data over add tree key data. This is a case where inheritance isn’t going to be used- I know exactly what type the tree is, and it’s unlikely to change. And the extra verbage I’d need to wrap the convert functions into objects is simply wasted code. It’s the price paid for trying to make everything into a nail, er, object.

Another thing that surprised me in it’s usefullness was Ocaml’s strong static typing. This has to be the most heretical thing I’ve said in quite a while, which is saying something. But it’s true- Ocaml’s strong static typing is usefull, because it reduces the amount of debugging I need to do. I admit it- I’ve never liked debugging my code, and I’ve often had cause to hate it with a firery passion. Design is fun, implementation is fun, testing is automatabile, but debugging is just a drag. I’d almost rather be in meetings rather than debugging (almost). Strong static typing allows the computer to find my mistakes for me. The amount of time I’ve spent debugging my code has dropped- from something over 50% to something less than 20%. It’s not quite true that in Ocaml that when it compiles, it works- but it’s much closer to being true (measured in the amount of time it takes on average to figure out why it isn’t working and fix it) than in any other language I know of.

Nor have I missed dynamic typing (either in the pure form of a Ruby or a Python, or as a half-way form like Java). Even in pure run time type checked languages the capabilities are rarely used. Generally, the vast majority of variables have specific types, which are known to the programmer at run time even if they’re not known to the compiler. This is especially true of short, simple programs. Oddly enough, the times I’ve seen them used to greatest advantage tend to be in infrastructure roles. The gold standard here that I know of is the Java Hibernate library. SQL isn’t that frimping complicated, folks. These programs aren’t complicated enough to make Hibernate worth it. It’s actually more complicated to set up Hibernate and set up the required mappings than it is to simply form up the insert query and ship it off to the ODBC connection. Which gets you into a paradox- by the time the program gets complicated enough to benefit from the dynamic-typing-using infrastructure, it’s become complicated enough that run time typechecking is a hazard to the rest of the program.

The above code goes for eval (run time code generation and evaluation) as well. OK, Lisp hackers can talk here, but I’ve rarely seen perl or python or ruby code do much of anything with eval- and most uses of it are dangerous. How can you gaurentee that the code generated is correct? This is another way to hide errors from the compiler- errors that can lurk in code for years, undectected and hidden, until they spring forth- inevitably at the customers site. In Katmandu. Or Ploughkeepsie if Katmandu sounds interesting. At 3AM. On Sunday. During your vacation. In Fiji.

Another thing I don’t miss is built-in regular expressions and hash tables. Using bolt-on libraries for both of these adds a little bit of code overhead, but not that much (much, much less than the code overhead of making everything an object). We’re talking about adding a line or three per program. And, especially in the case of hash tables or maps, having them built in limits them. In the above code, I’ve only been using strings as keys to my maps- but I could just as easily use much more complicated data structure as a key, and have in other circumstances.

For this particular job, I think Ocaml was not badly suited. It’s likely that Ruby or Python would have worked better for this particular job. But I have to say that, as gigawatt lasers go, this one works remarkably well for killing flies.

Related posts:

  1. This is your brain; this is your brain on OCaml
This entry was posted in Classic, OCaml, Programming Language Punditry and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • Categories