Wow … I am trying to use REXML to parse through an 8.8Mb xml file …
it currently eats almost 800Mb of ram before it seems to do anything …
Does anybody have any tips on getting REXML to run faster and/or smaller
???
I know it’s slow just because it’s pure ruby … and there’s a lot going
on … but … I can sit here for many minutes just waiting for ANY
console output showing that it’s actually gotten to the first
root.elements.each( xpath_expr ) iteration …
It uses libxml2 for the parsing, and as such is quite speedy.
Tom
I had to make two fixes to the source to get things to compile
ruby_xml_parser.c & ruby_xml_document.c both needed to have #include
“stdargs.h” included … the compiler wasn’t happy about trying to deal
with the va_list data type without it.
But, it’s compiling now … just thought I’d pass the information along
for ya.
After modifying my script to use the libxml binding … it’s sitting @
about 220M used instead of 800+M … ( better ) … and does only take
10-20 seconds to start iterating over data …
Wow … I am trying to use REXML to parse through an 8.8Mb xml file …
it currently eats almost 800Mb of ram before it seems to do anything …
At that file size, I’d also slightly start thinking of biting into the
bitter pill and using stream / pull parsing instead of tree parsing.
Even with using a C parser a DOM buildup is not going to do much good
for performance if you need to to processing at that scale more than
seldom. But then again, there’s the premature optimization quote that
says to wait with that just yet.
I recently ported the (freeware) Chilkat XML parser to Ruby, but it only
runs
on Windows. I’m curious to see how it performs in comparison. Do you
have
a simple example w/ data that I can use to convert to Chilkat
XML? I’ll be happy
to write the code…
After modifying my script to use the libxml binding … it’s sitting @
about 220M used instead of 800+M … ( better ) … and does only take
10-20 seconds to start iterating over data …
WOW.
You might try optimizing your XPath query. I’m no expert at this (or
even knowledgeable), but I did find in the past that changing the XPath
sometimes made a drastic difference in performance.
It’s a good job I try to keep up with happenings on ruby-talk Thanks
for posting about this - it’s fixed in CVS now.
Also, given your input data, you might be interested to know that I’m
currently working on a developmental branch for libxml-ruby 0.4, which
includes a new, faster SAX callback interface (among many other
changes). The branch name is DEV_0_4, and it’s getting to be quite
stable now.
Wow … I am trying to use REXML to parse through an 8.8Mb xml file …
it currently eats almost 800Mb of ram before it seems to do anything …
At that file size, I’d also slightly start thinking of biting into the
bitter pill and using stream / pull parsing instead of tree parsing.
Even with using a C parser a DOM buildup is not going to do much good
for performance if you need to to processing at that scale more than
seldom. But then again, there’s the premature optimization quote that
says to wait with that just yet.
I would not necessarily call that premature optimization. If these
kinds of files are to be parsed frequently and if only a portion of them
needs extracting then I would also go down the stream parser road.
From my experience stream parsers are also appropriate if you have to
transform the XML tree of a document into some other object structure.
IMHO the coding effort for transforming a DOM into another object tree
vs. doing the same with the stream approach is quite equivalent. And
runtime wise you save yourself one whole tree traversal by going stream.
magic/xml has extremely convenient stream parsing interface.
It’s based on REXML so it’s pretty slow, but it handles hundreds of
MBs big XMLs using just a few MBs of memory.
The idea is simple - you give it a block, and the block
keeps getting incomplete subtrees. It can either decide
to complete the current subtree (all children read to memory),
or to get inside it.
It’s something like:
XML.parse_as_twigs(STDIN) {|node|
next unless node.name == :page
node.complete! # Read all children of … node
t = node[:@title] # :@title is a child
i = node[:@id] # :@id is another child
print “#{i}: #{t}\n”
}
I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl’s XML::Twig).
I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl’s XML::Twig).
Back in the world of j… there are these libs (nux and dom4j and
probably more). They let you stream parse and register callbacks to
xpath expressions. Whenever a registered xpath is encountered it
invokes the callback for that xpath using a dom object (not w3c
DOM…) for the complete sub tree. This is very convenient and raises
the abstraction a bit (the xpath part) from what seems to be your
approach. They don’t allow full xpath but only those parts that make
sense in this context.
Anyways, look into it, it’s very nice.
/Marcus
ps. I think XML processing tools sucks quite a bit in Ruby (I love
Ruby…). You cannot do high performance processing in a cross
platform way (as far as I know). Libxml on *nix or MSXML on win (since
REXML sucks perfomance wise). It’s kind of sad. Is it impossible to
make libxml/libxsl work on Windows?
Right, can ANYONE explain this braindead fad to me?
Hint: No matter what some of the more loudmouthed bloggers would like to
insinuate in the massive ongoing circlejerk of FUD (from both the Ruby
and the Java side of things):
A) There is no conspiracy of panicking Java (yes, that IS the word)
developers desperately trying to eradicate Ruby in fear for their jobs
B) Having more advanced development tools doesn’t increase your penis
size nor girth
C) Being able to code without advanced development tools doesn’t
increase your penis size nor girth
D) Blog commenters that swoon over keypress count comparisons aren’t
visionaries that have Seen The Truth, they’re hapless muppets without
much attention span and too much time on their hands, people that get
actual work can tell what’s completely irrelevant to actual practice and
so much waste of webspace and bandwidth
F) No matter how long, or with which fervency you’ll compare apples to
oranges, they won’t taste equally good to all people
Now, is there any chance the general audience of this mailing list will
ever be able to mention other programming languages for the sake of
comparison without in some way indicating revilement of such or
reluctance to do so?
David V.
PS: I wonder how many people will see this considering points B and C
are likely to send spam filters into a hissy fit.
It uses libxml2 for the parsing, and as such is quite speedy.
I can vouch for that. I changed a bit of slow code from using REXML to
libxml, with fairly minor alterations. The work didn’t take long, and
it made a huge difference: