REXML ... performance & memory usage

jwigal · November 7, 2006, 10:59am

Wow … I am trying to use REXML to parse through an 8.8Mb xml file …
it currently eats almost 800Mb of ram before it seems to do anything …

Does anybody have any tips on getting REXML to run faster and/or smaller
???

I know it’s slow just because it’s pure ruby … and there’s a lot going
on … but … I can sit here for many minutes just waiting for ANY
console output showing that it’s actually gotten to the first
root.elements.each( xpath_expr ) iteration …

Hints/Tips are/would be VERY much appreciated.

Thanks in advance.

jd

jwigal · November 7, 2006, 10:59am

Jeff W. wrote:

Does anybody have any tips on getting REXML to run faster and/or
smaller ???

If having a pure ruby parser is not a requirement and you’re on *nix,
then you can get great performance out of:

http://libxml.rubyforge.org/

It uses libxml2 for the parsing, and as such is quite speedy.

Tom

jwigal · November 7, 2006, 10:59am

Tom W. wrote:

It uses libxml2 for the parsing, and as such is quite speedy.

Tom

I had to make two fixes to the source to get things to compile

ruby_xml_parser.c & ruby_xml_document.c both needed to have #include
“stdargs.h” included … the compiler wasn’t happy about trying to deal
with the va_list data type without it.

But, it’s compiling now … just thought I’d pass the information along
for ya.

After modifying my script to use the libxml binding … it’s sitting @
about 220M used instead of 800+M … ( better ) … and does only take
10-20 seconds to start iterating over data …

So, thank you for the pointer …

jd

jwigal · November 7, 2006, 10:59am

Jeff W. wrote:

Wow … I am trying to use REXML to parse through an 8.8Mb xml file …
it currently eats almost 800Mb of ram before it seems to do anything …

At that file size, I’d also slightly start thinking of biting into the
bitter pill and using stream / pull parsing instead of tree parsing.
Even with using a C parser a DOM buildup is not going to do much good
for performance if you need to to processing at that scale more than
seldom. But then again, there’s the premature optimization quote that
says to wait with that just yet.

David V.

jwigal · November 7, 2006, 10:59am

Jeff,

I recently ported the (freeware) Chilkat XML parser to Ruby, but it only
runs
on Windows. I’m curious to see how it performs in comparison. Do you
have
a simple example w/ data that I can use to convert to Chilkat
XML? I’ll be happy
to write the code…

Best Regards,
Matt Fausey

jwigal · November 7, 2006, 10:59am

Jeff W. wrote:

After modifying my script to use the libxml binding … it’s sitting @
about 220M used instead of 800+M … ( better ) … and does only take
10-20 seconds to start iterating over data …

WOW.

You might try optimizing your XPath query. I’m no expert at this (or
even knowledgeable), but I did find in the past that changing the XPath
sometimes made a drastic difference in performance.

Devin

jwigal · November 7, 2006, 11:01am

On Sat, 2006-11-04 at 09:33 +0900, Jeff W. wrote:

It’s a good job I try to keep up with happenings on ruby-talk Thanks
for posting about this - it’s fixed in CVS now.

Also, given your input data, you might be interested to know that I’m
currently working on a developmental branch for libxml-ruby 0.4, which
includes a new, faster SAX callback interface (among many other
changes). The branch name is DEV_0_4, and it’s getting to be quite
stable now.

Also, we have a mailing list:

http://rubyforge.org/mail/?group_id=494

Thanks again,

jwigal · November 7, 2006, 11:00am

David V. wrote:

Jeff W. wrote:

Wow … I am trying to use REXML to parse through an 8.8Mb xml file …
it currently eats almost 800Mb of ram before it seems to do anything …

At that file size, I’d also slightly start thinking of biting into the
bitter pill and using stream / pull parsing instead of tree parsing.
Even with using a C parser a DOM buildup is not going to do much good
for performance if you need to to processing at that scale more than
seldom. But then again, there’s the premature optimization quote that
says to wait with that just yet.

I would not necessarily call that premature optimization. If these
kinds of files are to be parsed frequently and if only a portion of them
needs extracting then I would also go down the stream parser road.

From my experience stream parsers are also appropriate if you have to
transform the XML tree of a document into some other object structure.
IMHO the coding effort for transforming a DOM into another object tree
vs. doing the same with the stream approach is quite equivalent. And
runtime wise you save yourself one whole tree traversal by going stream.

Kind regards

robert

jwigal · November 9, 2006, 12:04am

On 11/4/06, Jeff W. [email protected] wrote:

Hints/Tips are/would be VERY much appreciated.

magic/xml has extremely convenient stream parsing interface.
It’s based on REXML so it’s pretty slow, but it handles hundreds of
MBs big XMLs using just a few MBs of memory.

The idea is simple - you give it a block, and the block
keeps getting incomplete subtrees. It can either decide
to complete the current subtree (all children read to memory),
or to get inside it.

It’s something like:

XML.parse_as_twigs(STDIN) {|node|
next unless node.name == :page
node.complete! # Read all children of … node
t = node[:@title] # :@title is a child
i = node[:@id] # :@id is another child
print “#{i}: #{t}\n”
}

A short tutorial at http://zabor.org/taw/magic_xml/tutorial.html

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl’s XML::Twig).

Enjoy

jwigal · November 9, 2006, 9:11am

On 11/9/06, Tomasz W. [email protected] wrote:

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl’s XML::Twig).

Back in the world of j… there are these libs (nux and dom4j and
probably more). They let you stream parse and register callbacks to
xpath expressions. Whenever a registered xpath is encountered it
invokes the callback for that xpath using a dom object (not w3c
DOM…) for the complete sub tree. This is very convenient and raises
the abstraction a bit (the xpath part) from what seems to be your
approach. They don’t allow full xpath but only those parts that make
sense in this context.

Anyways, look into it, it’s very nice.

/Marcus

ps. I think XML processing tools sucks quite a bit in Ruby (I love
Ruby…). You cannot do high performance processing in a cross
platform way (as far as I know). Libxml on *nix or MSXML on win (since
REXML sucks perfomance wise). It’s kind of sad. Is it impossible to
make libxml/libxsl work on Windows?

jwigal · November 9, 2006, 12:23am

Tomasz W. wrote:

console output showing that it’s actually gotten to the first
keeps getting incomplete subtrees. It can either decide
print “#{i}: #{t}\n”
Enjoy

Thanks for the tip, I’ll have to take a look…

jd

jwigal · November 9, 2006, 1:00pm

Marcus B. wrote:

Back in the world of j…

groan
facedesk
moan

Right, can ANYONE explain this braindead fad to me?

Hint: No matter what some of the more loudmouthed bloggers would like to
insinuate in the massive ongoing circlejerk of FUD (from both the Ruby
and the Java side of things):

A) There is no conspiracy of panicking Java (yes, that IS the word)
developers desperately trying to eradicate Ruby in fear for their jobs

B) Having more advanced development tools doesn’t increase your penis
size nor girth

C) Being able to code without advanced development tools doesn’t
increase your penis size nor girth

D) Blog commenters that swoon over keypress count comparisons aren’t
visionaries that have Seen The Truth, they’re hapless muppets without
much attention span and too much time on their hands, people that get
actual work can tell what’s completely irrelevant to actual practice and
so much waste of webspace and bandwidth

E) Ruby won’t kill Java, Java won’t kill Ruby, C# won’t kill Java, Ruby
won’t kill Python, Ajax won’t kill the desktop, ActiveRecord won’t kill
Hibernate, Rails won’t kill Rife, Rife won’t kill Rails…

F) No matter how long, or with which fervency you’ll compare apples to
oranges, they won’t taste equally good to all people

Now, is there any chance the general audience of this mailing list will
ever be able to mention other programming languages for the sake of
comparison without in some way indicating revilement of such or
reluctance to do so?

David V.

PS: I wonder how many people will see this considering points B and C
are likely to send spam filters into a hissy fit.

jwigal · November 9, 2006, 3:08pm

On 03/11/06, Tom W. [email protected] wrote:

If having a pure ruby parser is not a requirement and you’re on *nix,
then you can get great performance out of:

http://libxml.rubyforge.org/

It uses libxml2 for the parsing, and as such is quite speedy.

I can vouch for that. I changed a bit of slow code from using REXML to
libxml, with fairly minor alterations. The work didn’t take long, and
it made a huge difference:

REXML: 0.539 seconds
libxml: 0.012 seconds

Paul.

jwigal · November 9, 2006, 3:19pm

On Thu, 2006-11-09 at 20:58 +0900, David V. wrote:

B) Having more advanced development tools doesn’t increase your penis
size nor girth

Dammit! Another six-hundred quid down the drain…