Keith F. wrote:
libxml is a mature C library and quite fast, but is (by default)
DOM-based (as is REXML).
Sorry, I did not express myself clearly. I definitely need a DOM-based
approach, but REXML is a lot slower than libxml, and libxml can be a
PITA to install on some platforms/distros (e.g. it took quite some time
on my ubuntu box, because neither gem install nor apt-get wanted to
install the newest version which I needed).
The catch is that I would like to use this in my web scraping framework,
scRUBYt! - and of course dependency on libxml would mean that everybody
who would like to install sRUBYt!, would have to install libxml too. I
got tons of support requests from ubuntu users who have had problems
installing mechanize on ubuntu (it is depending on libssl-ruby there),
so I guess this number would be much higher in the case of libxml which
has much more funky dependencies.
If there is no better possibility, I will go with libxml despite of this
(this is my only concern, otherwise libxml is fine) - but it would be
better to have something install-friendly…
What sort of “real” XPaths do you need? XPath 1.0? 2.0?
Real in the sense that it is not Hpricot XPath, which ATM can not even
do
/my/stuff/is/@cool
not to talk about more complex expressions.
I guess XPath 1.0 would be completely enough (maybe even Hpricot’s, with
a few additions) - I really don’t need anything complicated.
Deep-lookahead/behind? Do you have huge source documents?
Well, I am actually first building this document from what I have
scraped, so I have pretty much control over it (if is too big, I just
say stop and put the other records to a new doc etc.) so this is not
really the problem.
I really just need a fast XML parser which is easy to install, that’s
all. scRUBYt! is a high-level framework, aimed also at non-programmers,
so I can not expect that all my potential users are handy with debian’s
package policy and the joys of libxml installing on win32
Cheers,
Peter
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby