Keith F. wrote:
libxml is a mature C library and quite fast, but is (by default)
DOM-based (as is REXML).
Sorry, I did not express myself clearly. I definitely need a DOM-based
approach, but REXML is a lot slower than libxml, and libxml can be a
PITA to install on some platforms/distros (e.g. it took quite some time
on my ubuntu box, because neither gem install nor apt-get wanted to
install the newest version which I needed).
The catch is that I would like to use this in my web scraping framework,
scRUBYt! - and of course dependency on libxml would mean that everybody
who would like to install sRUBYt!, would have to install libxml too. I
got tons of support requests from ubuntu users who have had problems
installing mechanize on ubuntu (it is depending on libssl-ruby there),
so I guess this number would be much higher in the case of libxml which
has much more funky dependencies.
If there is no better possibility, I will go with libxml despite of this
(this is my only concern, otherwise libxml is fine) - but it would be
better to have something install-friendly…
What sort of “real” XPaths do you need? XPath 1.0? 2.0?
Real in the sense that it is not Hpricot XPath, which ATM can not even
not to talk about more complex expressions.
I guess XPath 1.0 would be completely enough (maybe even Hpricot’s, with
a few additions) - I really don’t need anything complicated.
Deep-lookahead/behind? Do you have huge source documents?
Well, I am actually first building this document from what I have
scraped, so I have pretty much control over it (if is too big, I just
say stop and put the other records to a new doc etc.) so this is not
really the problem.
I really just need a fast XML parser which is easy to install, that’s
all. scRUBYt! is a high-level framework, aimed also at non-programmers,
so I can not expect that all my potential users are handy with debian’s
package policy and the joys of libxml installing on win32
_ :: Ruby and Web2.0 blog :: Ruby web scraping framework :: The indexed archive of all things Ruby