Fast XML parser, other than libxml

splattael · April 4, 2007, 5:01am

Hello all,

I am looking for a fast XML parser, other than libxml (REXML is not fast
enough, and Hpricot won’t do this time - I need ‘real’ XPaths etc).

Some time ago I read about xaggly, nut now the site seems to be dead.

Any other suggestions?

Cheers,
Peter
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

splattael · April 4, 2007, 5:31am

On 4/3/07, Peter S. [email protected] wrote:

I am looking for a fast XML parser, other than libxml (REXML is not fast
enough, and Hpricot won’t do this time - I need ‘real’ XPaths etc).

libxml is a mature C library and quite fast, but is (by default)
DOM-based (as is REXML). If you’re really looking for speed, you’ll go
with a streaming approach (SAX or otherwise, potentially from libxml).
What sort of “real” XPaths do you need? XPath 1.0? 2.0?
Deep-lookahead/behind? Do you have huge source documents?

Keith

splattael · April 4, 2007, 10:53am

Keith F. wrote:

libxml is a mature C library and quite fast, but is (by default)
DOM-based (as is REXML).

Sorry, I did not express myself clearly. I definitely need a DOM-based
approach, but REXML is a lot slower than libxml, and libxml can be a
PITA to install on some platforms/distros (e.g. it took quite some time
on my ubuntu box, because neither gem install nor apt-get wanted to
install the newest version which I needed).

The catch is that I would like to use this in my web scraping framework,
scRUBYt! - and of course dependency on libxml would mean that everybody
who would like to install sRUBYt!, would have to install libxml too. I
got tons of support requests from ubuntu users who have had problems
installing mechanize on ubuntu (it is depending on libssl-ruby there),
so I guess this number would be much higher in the case of libxml which
has much more funky dependencies.

If there is no better possibility, I will go with libxml despite of this
(this is my only concern, otherwise libxml is fine) - but it would be
better to have something install-friendly…

What sort of “real” XPaths do you need? XPath 1.0? 2.0?
Real in the sense that it is not Hpricot XPath, which ATM can not even
do

/my/stuff/is/@cool

not to talk about more complex expressions.

I guess XPath 1.0 would be completely enough (maybe even Hpricot’s, with
a few additions) - I really don’t need anything complicated.

Deep-lookahead/behind? Do you have huge source documents?
Well, I am actually first building this document from what I have
scraped, so I have pretty much control over it (if is too big, I just
say stop and put the other records to a new doc etc.) so this is not
really the problem.

I really just need a fast XML parser which is easy to install, that’s
all. scRUBYt! is a high-level framework, aimed also at non-programmers,
so I can not expect that all my potential users are handy with debian’s
package policy and the joys of libxml installing on win32

Cheers,
Peter
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

splattael · April 4, 2007, 11:41am

On 04.04.2007 10:53, Peter S. wrote:

I really just need a fast XML parser which is easy to install, that’s
all. scRUBYt! is a high-level framework, aimed also at non-programmers,
so I can not expect that all my potential users are handy with debian’s
package policy and the joys of libxml installing on win32

Maybe then you’ll simply have to decide whether ease of use or
performance is more important to you.

Kind regards

robert

splattael · April 4, 2007, 12:01pm

Robert K. wrote:

On 04.04.2007 10:53, Peter S. wrote:

I really just need a fast XML parser which is easy to install, that’s
all. scRUBYt! is a high-level framework, aimed also at
non-programmers, so I can not expect that all my potential users are
handy with debian’s package policy and the joys of libxml installing
on win32

Maybe then you’ll simply have to decide whether ease of use or
performance is more important to you.

Should I interpret this as ‘decide between REXML and libxml’?
There are really no other alternatives?

Cheers,
Peter
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

splattael · April 4, 2007, 2:48pm

On Apr 4, 2007, at 6:15 AM, Robert K. wrote:

performance is more important to you.
Should I interpret this as ‘decide between REXML and libxml’?
There are really no other alternatives?

AFAIK REXML is the only pure Ruby XML parser - and it comes with
the standard distribution.

Sounds like it is time for FasterXML.

James Edward G. II

splattael · April 4, 2007, 3:35pm

On 4/4/07, Peter S. [email protected] wrote:

libxml is a mature C library and quite fast, but is (by default)
DOM-based (as is REXML).

Sorry, I did not express myself clearly. I definitely need a DOM-based
approach, but REXML is a lot slower than libxml, and libxml can be a
PITA to install on some platforms/distros (e.g. it took quite some time
on my ubuntu box, because neither gem install nor apt-get wanted to
install the newest version which I needed).

Yeah, you’re right about libxml being a pain to install. If you hadn’t
cared about installability, I was going to suggest JRuby + (some Java
parser)…

I guess XPath 1.0 would be completely enough (maybe even Hpricot’s, with
a few additions) - I really don’t need anything complicated.

Yeah, sorry that I don’t know of any others.

JEG II wrote:

Sounds like it is time for FasterXML.

Know of any good starting points? All the XPath 1.0 work I do is off
of libxml and all of the XPath 2.0 is off of Saxon (Java), so I’m not
sure what should be copied.

Keith

splattael · April 4, 2007, 1:16pm

On 04.04.2007 12:00, Peter S. wrote:

Should I interpret this as ‘decide between REXML and libxml’?
There are really no other alternatives?

AFAIK REXML is the only pure Ruby XML parser - and it comes with the
standard distribution. All others will likely have similar issues as
libxml I guess.

robert

splattael · April 4, 2007, 4:03pm

James Edward G. II wrote:

Maybe then you’ll simply have to decide whether ease of use or
performance is more important to you.
Should I interpret this as ‘decide between REXML and libxml’?
There are really no other alternatives?

AFAIK REXML is the only pure Ruby XML parser - and it comes with the
standard distribution.

Sounds like it is time for FasterXML.

One pointer: REXML comes with quite a fast pullparser, and it should be
possible to base some lightweight xml document lib on that. (The
documentation says that the API should not be considered stable, but I’m
sure that could be resolved with the REXML author.)

As a proof of concept, see the attached code. We use it in our company
to load and process XML files generated by our tools and OpenOffice
Calc.
I just tested it on a 1MB XML from an .ods file, which it loaded
successfully in < 2 seconds.

Writing a fast XPath implementation to match this might be quite a
challenge, though.

Dennis

splattael · April 4, 2007, 4:00pm

On Apr 4, 2007, at 8:34 AM, Keith F. wrote:

JEG II wrote:

Sounds like it is time for FasterXML.

Know of any good starting points? All the XPath 1.0 work I do is off
of libxml and all of the XPath 2.0 is off of Saxon (Java), so I’m not
sure what should be copied.

Not really. I was mostly just making a joke about FasterCSV’s name
and how it was born.

I do think it’s possible to get better performance than REXML offers
without resorting to C, though C would be faster still, naturally. I
do have some ideas about this, but I haven’t actually spent the time
to see if I could get a prototype running to prove them.

James Edward G. II

splattael · April 5, 2007, 12:07am

On Apr 4, 12:00 pm, Peter S. [email protected] wrote:

Should I interpret this as ‘decide between REXML and libxml’?
There are really no other alternatives?

You may find Tim B.'s recent in-depth experiments in attempting to
write a fast pure-Ruby XML parser instructive and informative:

http://www.tbray.org/ongoing/When/200x/2006/11/09/Optimizing-Ruby
http://www.tbray.org/ongoing/When/200x/2006/11/15/RS-Redux

splattael · April 5, 2007, 2:48am

On Apr 4, 2007, at 5:05 PM, Arto Bendiken wrote:

On Apr 4, 12:00 pm, Peter S. [email protected] wrote:

Should I interpret this as ‘decide between REXML and libxml’?
There are really no other alternatives?

You may find Tim B.'s recent in-depth experiments in attempting to
write a fast pure-Ruby XML parser instructive and informative:

ongoing by Tim Bray · An RX for Ruby Performance
ongoing by Tim Bray · RX Redux

And:

http://www.tbray.org/ongoing/When/200x/2006/11/23/RX-plus-YARV

The series is an interesting read. Tim’s pretty focused on the
character based parsing and in my experience that’s always death in
Ruby. It’s the primary reason the standard CSV library is so slow,
for example.

He says it’s because Ruby’s regex engine isn’t really up to the task
of handling non-UTF-8 input. I’m pretty sure I understand why that
is, but he also basically admits that at least the lexing stage of
XML reading is just looking for < and &. I guess the problem becomes
that a UTF-16 document actually encodes that as two bytes? Well,
surely the regex could be adapted to handle that. In fact, the key
expressions could be swapped out for encoding-aware replacements.
Then we can keep playing to Ruby’s strengths, I hope. Or is it true
that there are some encodings we can’t effectively build expressions
for?

Sorry for thinking out loud here. I’m just trying to better
understand Tim’s logic. It’s interesting stuff.

I’ll go try to read his code now and see what else I can learn…

James Edward G. II

splattael · April 5, 2007, 2:53am

On 4/4/07, James Edward G. II [email protected] wrote:

The series is an interesting read. Tim’s pretty focused on the
character based parsing and in my experience that’s always death in
Ruby. It’s the primary reason the standard CSV library is so slow,
for example.

Is the inverse the reason that FasterCSV is so fast (because it uses
regular expressions)?

Thanks,
Keith

splattael · April 5, 2007, 3:35am

On Apr 4, 2007, at 7:53 PM, Keith F. wrote:

On 4/4/07, James Edward G. II [email protected] wrote:

The series is an interesting read. Tim’s pretty focused on the
character based parsing and in my experience that’s always death in
Ruby. It’s the primary reason the standard CSV library is so slow,
for example.

Is the inverse the reason that FasterCSV is so fast (because it uses
regular expressions)?

That is one of the two key reasons, yes:

This first one is summarized by this comment from Aristotle
Pagaltzis in Tim’s RX article series, “The fastest way to do
something in Perl is frequently the one that implements the most
costly step in the fewest ops. You can substitute Ruby, Python or
the like for Perl; the basic statement holds in any case. For string
processing, it generally means doing as much work as possible with
pattern matching. The more time you spend inside the VM’s
implementation of its opcodes rather than inside the opcode loader/
dispatcher, the faster the code will go.” For a comparison, have a
peak at CSV::parse_body(). It’s CSV’s primary parser and it has a
lot of steps.
Method calls are expensive in Ruby. You can see that CSV is
calling things all over the place. For example, if you call
CSV::parse() the primary call chain is something like:

CSV::parse()
CSV::Reader::create()
CSV::IOReader::new() # or StringReader
CSV::Reader#each()
CSV::IOReader#get_row()
CSV::parse_row()
CSV::parse_body()

The same call chain for FasterCSV is:

FasterCSV::parse()
FasterCSV::new()
FasterCSV::each()
FasterCSV::shift()

The object construction doesn’t much matter, because it’s one-time
cost stuff. But look at each() down in both examples. CSV is
iterating over a three method call chain. FasterCSV is just
iterating over one. That adds up.

There are many other little tricks to speed up FasterCSV. But those
two easily bring us 90% of the distance.

Just to be clear, I’m not trying to attack the standard CSV library.
It’s pretty proven and has more users than FasterCSV does. All
of those calls make its interface more flexible and some prefer its
design.

I’m just trying to share what I learned in my process of speeding it up.

James Edward G. II

splattael · April 5, 2007, 3:51am

James Edward G. II wrote:

The series is an interesting read. Tim’s pretty focused on the
character based parsing and in my experience that’s always death in
Ruby. It’s the primary reason the standard CSV library is so slow, for
example.

Pardon my naïveté, but why isn’t the answer to make libxml, wrapped in a
Rubyonic high-level API, part of stdlib? Stable, fast, and it wouldn’t
be the first Ruby extension in stdlib.

Devin
(Not that I’m criticizing anybody for not doing this. I’m certainly not
stepping up.)

splattael · April 5, 2007, 2:58am

On 4/4/07, James Edward G. II [email protected] wrote:

I’ll go try to read his code now and see what else I can learn…

For the lazy: http://www.tbray.org/code/rx-yarv.tgz

splattael · April 5, 2007, 5:41am

On Apr 4, 2007, at 8:50 PM, Devin M. wrote:

James Edward G. II wrote:

The series is an interesting read. Tim’s pretty focused on the
character based parsing and in my experience that’s always death
in Ruby. It’s the primary reason the standard CSV library is so
slow, for example.

Pardon my naïveté, but why isn’t the answer to make libxml, wrapped
in a Rubyonic high-level API, part of stdlib? Stable, fast, and it
wouldn’t be the first Ruby extension in stdlib.

Apple is taking this road. They needed XML for some of their
projects and REXML didn’t meet their needs. They plan to bundle
libxml, with Ruby bindings, in Leopard to get around this.

I think Matz tries to weight changes to the standard library very
careful as they can break a lot of code. It’s a tough balance to
strike, for sure.

James Edward G. II

splattael · April 5, 2007, 6:12pm

Peter S. [email protected] wrote:

Any other suggestions?

http://code.google.com/p/roxi/

Don’t know how fast it is compared to libxml tough

splattael · April 6, 2007, 5:51pm

On 5-Apr-07, at 10:40 AM, James Edward G. II wrote:

Is it legal for a well behaved XML processor to expose character
data to an application in an encoding other than the actual
document encoding? I didn’t see anything in the specification to
suggest it wasn’t.

This is routine, character encoding shouldn’t be the application’s
problem.

James Edward G. II

Bob H. – tumblelog at <http://
www.recursive.ca/so/>
Recursive Design Inc. – http://www.recursive.ca/
xampl for Ruby – http://rubyforge.org/projects/xampl/

splattael · April 5, 2007, 4:41pm

On Apr 4, 2007, at 7:45 PM, James Edward G. II wrote:

ongoing by Tim Bray · An RX for Ruby Performance

He says it’s because Ruby’s regex engine isn’t really up to the
task of handling non-UTF-8 input. I’m pretty sure I understand why
that is, but he also basically admits that at least the lexing
stage of XML reading is just looking for < and &. I guess the
problem becomes that a UTF-16 document actually encodes that as two
bytes? Well, surely the regex could be adapted to handle that.

Or, another thought, introduce an Iconv filter that normalizes the
input to UTF-8. This probably degrades the performance against non-
UTF-8 documents, but Tim’s code had trouble in that area too.

Is it legal for a well behaved XML processor to expose character data
to an application in an encoding other than the actual document
encoding? I didn’t see anything in the specification to suggest it
wasn’t.

James Edward G. II