How does anyone actually do XML parsing in Ruby? I’m talking about real XML parsing with namespaces, entities, XML encoded text nodes — the whole 9 yards. I don’t need XPath: I’m willing to walk the tree a la Groovy’s XMLSlurper. But I need something that doesn’t suck.
I’ve been trying to do some simple XML parsing: read in an XML, extract some data elements (including namespaces that are declared), and then take one particular node and store all the child nodes in a CLOB (a.k.a. “text”) database field.
REXML has been a nightmare. It just barely works for parsing. Its XPath, while nominally supported, is getting me all kinds of weird results. It’s never clear if it’s resolving entities or XML encoding or not. The pretty printer will wrap in the middle of an XML encoded entity. The deprecated #write method completely fails. And, best of all, the formatters ship with ruby 1.8.6 patchlevel 111 for darwin, but not ruby 1.8.6 patchlevel 36 for x86_64-linux. And the website is down, so I can’t pull down the code manually.
The alternatives I’ve looked at haven’t been awesome. LibXML-ruby is experimental, requires nonportable library installations, and has the libxml interface that we know and hate. Hpricot half-supports XML, and doesn’t support namespaces.
Quite frankly, XML parsing is one of those things that needs to be a solved problem before any language can be considered ready for the real world. As far as I can tell, Ruby fails at this point.
I know that it’s the open source world, and I should quit my bitchin’ and write up my own XML parser which doesn’t suck. There are two problems with this — 1) I’m not an XML expert, I don’t want to be an XML expert, and so if I were to write something, it’d take me a very long, very unhappy time and probably be wrong in nonobvious ways; 2) in the same amount of time, I could probably rewrite my application in Groovy/Grails, where XML parsing is easy. This amount of time, though, is a lot more than I want to spend. So if I flushed the time I’ve spent building my application down the toilet because something as common as XML parsing is broke in Ruby, I’m going to be really unhappy.
So someone please, please, please correct me by pointing out a great and glorious and easy-to-use XML library in Ruby.
20 Comments
Seaside (seaside.st) already has that problem licked, being based on Smalltalk which has been around for far longer. In fact, the Universes browser in Squeak uses XML as the payload.
From reading Sam Ruby’s blog, REXML seems to be a pain in the ass… That said, you could ask there for advices, or check if expat would fit your needs (I’ve never used it, but I’ve seen it mentioned @intertwingly)
My experience with Ruby is exclusively in the realm of Rails. In that world, YAML and JSON are the sensible choices, so there’s no reason to use XML. If you’re integrating with other worlds that do rely on XML, it might be worth using Java libraries through JRuby.
I’ve been looking at FastXML, but I have no idea how well it handles namespaces. It’s just a wrapper for libxml2 and libxslt, but it provides an API similar to Hpricot.
http://github.com/segfault/fastxml/tree/master
Maybe we can start a “Save our XML” fund?
If you moved to JRuby you would have more options.
At a pinch you could port XmlSlurper to JRuby. You could replace the Groovy MetaClass connections with JRuby equivalents The bulk of the code is not Groovy specific.
Give me a shout of you need some help (I wrote XmlSlurper)
> Seaside (seaside.st) already has that problem licked, being based on Smalltalk which has been around for far longer. In fact, the Universes browser in Squeak uses XML as the payload.
I’m not sure that helps him a lot as he seems to be using Ruby. Otherwise he could just have switched to Python which has several good XML libs.
There is zero guarantee that anything will work when using open source. After all you get exactly what you pay for. Port everything to C# for the piece of mind and real guarantee that you won’t get bitten in the ass down the road by incomplete and/or amateurish framework. And the new LINQ-based XML API in C# leaves everything else in the dust. See some examples to see what I mean. It is functional programming at its best which means your code will actually resemble the final XML output. How cool is that ?
@Wheelwright
Just because I’m paying MSFT a load of money doesn’t mean that I’m not going to get an incomplete and/or amateurish framework. The .Net standard library is demonstration enough of that. So, no, thank you, I’ll skip the vendor lock-in.
@John Wilson
There’s at least a few differences between Groovy and Ruby which will lead to differences between XMLSlurper and its port. For one thing, the dot-string stunt in Groovy for funny method names doesn’t work.
Robert,
does your application need to deal with elements whose names are not valid Ruby names? It does make accessing attributes a bit of a problem, though.
Of course, I think that you best bet is to abandon Ruby and embrace a more modern programming language :)
Whilst I do (obviously) prefer open source solutions I would not dismiss LINQ. MS have some very good people amd it’s not a good idea to ignore what they are doing.
In my experience it’s surprisingly easy to re-implement and existing app in a new language.
@John Wilson
I’m not saying anything bad or dismissive about LINQ or MSFT’s labs (F#, for instance, is nifty/interesting). However, to adopt LINQ requires me to shift to the .Net framework, and I’ll pass on that. I was also responding to the assertion that open source software is somehow inherently inferior to proprietary software.
The Ruby/Rails solution seemed like the right one, but this XML thing has been frustrating. There’s also threading problems: I’d like to throw off a thread to do background processing, and that’s nontrivial with Rails. So, instead, I’m polling for changes.
I really am more efficient on the view and controller development in Rails, and the plugins rock, but you hit these nontrivial things and life sucks.
@John Wilson
And, yes — I need to navigate nodes prefixed with namespaces. So those aren’t valid Ruby identifiers.
“There is zero guarantee that anything will work when using open source.”
lol…you drank a lot of Microsft Kool-aid. You need to get your head out of your ass. Microsoft solutions are shit unless your building apps that fit tightly within their definition of how life should be. If you need to step outside the boundaries for any reason whatsoever, you’re fucked. The company writes software for people who don’t really know how to code. It’s a drag&drop world of stupidity where the average .NET developer doesn’t know the simplest of CS concepts. Go back to your safe little crib and let the big boys write some real code.
“And the new LINQ-based XML API in C# leaves everything else in the dust. ”
I’d rather use Groovy’s XML processing features over LINQ/C# any day.
Finally, Msft can’t even write a good IDE. Visual Studio is a piece of shit that doesn’t have half the features that a modern Java IDE has. Even Eclipse (which is free) kicks its ass. Throw in IntelliJ and it’s just flat out a complete ass kicking.
@Marc
I happen to agree with you (although I know nothing of LINQ), but if you’re going to take that kind of tone around here, I’m going to have to moderate your comments.
I can definitely relate. XML is a very powerful tool that (when supported) makes many tasks much easier. While it seems most folks are pushing towards Java/C#/JVM Language, I’d suggest one of the following:
1. Give things a go with ruby-libxml. This honestly feels like the best option to creating a decent XML library for Ruby. REXML uses regex for parsing the tree instead of a real parser that creates its own internal tree. Using libxml gets you around this sort of issue and back into working on creatively using Ruby. I’m not saying this is trivial, but it seems like the best bet if you don’t want to leave Ruby/Rails.
2. Python is a great option. There are a multitude frameworks and libraries that make becoming productive easy for a wide variety of developers and applications. Python also has excellent XML support in Amara/4Suite. There are also other tools such as lxml and ElementTree. If you are not stuck in Ruby, Python really is a great choice in terms of language expressiveness and a huge community of stable libraries.
Good luck!
@Robert
My apologies to you and Wheelright. Microsoft development is a hot button issue for me since I had to deal with that world at one time and once you take it up the rear-end from those guys a few times, it makes you a tad angry. It makes me even more angry that people still buy into their lies. It’s because of Microsoft and their desire to push VB as a viable OO platform that there’s still people who think composition > inheritance when they don’t even understand “is a” and “has a” relationships. That company has done everything they can possibly do to make software development a cesspool. Gates himself thinks that US developers are untalented/uneducated and so you need his simple toys to make stuff happen.
Finally, if you want to moderate my comments…that’s cool. I’m an ass at least half the time. :)
@Marc: “I’d rather use Groovy’s XML processing features over LINQ/C# any day.”
Being curious guy that I am I looked at Groovy example of processing XML here: http://today.java.net/pub/a/today/2004/08/12/groovyxml.html
After much gymnastics the author got-down the solution to this in Groovy:
public interface GroovyNodeIterator
{
public Object process( DOMNodeGroovyObject doc );
}
import groovy.lang.GroovyClassLoader;
…
public static Map calculateAmounts( Document doc ) throws Exception
{
GroovyClassLoader groovyLoader = new GroovyClassLoader();
GroovyShell shell = new GroovyShell( groovyLoader, new Binding() );
Class adderClass = groovyLoader.loadClass( “AmountAdder” );
GroovyNodeIterator obj = (GroovyNodeIterator)adderClass.newInstance();
return (Map)obj.process( new DOMNodeGroovyObject( doc ) );
}
class AmountAdder implements GroovyNodeIterator
{
public Object process( DOMNodeGroovyObject doc )
{
accountValues = [:];
for( accountNode in doc.xpath( “/transactions/account” ) )
{
accountValues[ accountNode.id ] = 0;
for( transNode in accountNode.xpath( “transaction” ) )
{
accountValues[ accountNode.id ] +=
Integer.valueOf( transNode.amount );
}
}
return accountValues;
}
}
Which is equivalent of ONE line of code in C#/XLINQ (line breaks added for clarity):
from account in xDoc.Descendants(“account”)
group account by account.Attribute(“id”) into accountGroup
select new
{
id = accountGroup.Key.Value,
amount = accountGroup.Elements(“transaction”).Sum(t => (int)t.Attribute(“amount”))
};
As I said LINQ-based XML is light years beyond anything open-source or Java can throw at it.
That’s not Groovy. That’s just some guy using Groovy to call Java code to parse XML.
Try this: http://groovy.codehaus.org/Reading+XML+using+Groovy%27s+XmlSlurper
or this: http://groovy.codehaus.org/Reading+XML+using+Groovy%27s+XmlParser
and then come back and talk about light years and OSS.
@Marc
Solve the problem I have just solved in ONE line of code with the output being STRONGLY TYPED IntelliJ aware object (no lame weakly-typed hashtables etc:) and we will talk.
@Wheelwright
Or how about you stop bothering me with all this typing BS that I don’t care about…being that I’m using a type-optional language and all. If your best argument for LINQ’s greatness is to change the rules to help it look better, than you’re not doing so hot, friend.
I’m not going to sit here and pretend that LINQ sucks. I don’t think that’s the case at all. However, referring to it as light years ahead of anything else is just flat out wrong. That may be true in the .NET world but it’s clear you don’t even know what else is out there beyond .NET so I don’t see how you can make such ridiculous assertion.
For help with namespaces and libxml-ruby, check out http://thebogles.com/blog/an-hpricot-style-interface-to-libxml/
Cheers, Peter.
2 Trackbacks
[...] like I was a bit hasty in my condemnation of Ruby XML parsing: Hpricot apparently handles namespaces okay through the magical .%() syntax. This tip is shared by [...]
[...] I was struggling with XML parsing in Ruby, the consensus was to try out libxml. I got on the devel mailing list in preparation for giving it [...]