Search large XML file -- REXML slower than a slug, regex instantaneous

ntk · August 5, 2010, 5:51pm

Got a question hopefully someone can answer -

I am working on functionality to match on certain nodes of a largish
(65mb)
xml file. I implemented this with REXML and was 2 minutes and counting
before I killed the process. After this, I just opened the console and
loaded the file into a string and did a regex search for my data – the
result was almost instantaneous.

The question is, if I can get away with it, am I better off just going
the
regex route, or is it really worth my while to investigate a faster XML
parser (I know REXML is notorious for being slow, but given how fast it
was
to call a regex on the file, I am thinking that this will still be
faster
than all parsers).

Any comments or suggestions appreciated.

David

ntk · August 5, 2010, 5:55pm

David K. wrote:

Got a question hopefully someone can answer -

I am working on functionality to match on certain nodes of a largish
(65mb)
xml file. I implemented this with REXML and was 2 minutes and counting
before I killed the process. After this, I just opened the console and
loaded the file into a string and did a regex search for my data – the
result was almost instantaneous.

The question is, if I can get away with it, am I better off just going
the
regex route, or is it really worth my while to investigate a faster XML
parser (I know REXML is notorious for being slow,

Then why the heck are you even bringing it up in this situation? I
think Nokogiri is supposed to be much faster.

but given how fast it
was
to call a regex on the file, I am thinking that this will still be
faster
than all parsers).

Who cares how fast it is if it’s inaccurate? Regular expressions are
the wrong tool for parsing XML, because they can’t cope easily (or at
all) with lots of valid XML constructs. If you’re parsing XML, use an
actual XML parser, or you risk serious errors.

Any comments or suggestions appreciated.

David

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

ntk · August 5, 2010, 8:58pm

Actually I and my client care how fast, even if it means more work and
tests
to hedge accuracy. I did try Nokogiri - which I liked getting to know,
but
it also plods in at ~ 150 seconds which is just unacceptable for someone
waiting at a browser. That’s what I was trying to get at with my
original
post and should have provided more data, i.e. am I wasting time with
unrealistic expectations for any XML parser in this endeavor.

Unless anyone can point out a more efficient search (code and example
xml
below), it seems practical in absence of other ideas, to go the way of
regex
at least to triangulate the data before throwing it to an xml parser to
get
the details or put the data into a db (which I am trying to avoid).

Below, the second line is what takes forever, understandably.
gsa_epls_xml_doc = Nokogiri::HTML(doc_xml)
@gsa_epls_xml_doc.xpath("//records/record[last=’#{last_name}’ and
first=’#{first_name}’]").each do |possible_match_record| …

File structure - with a lot (65mb) of nodes.

Vr A C Individual Reciprocal R 11576 NY 22-Apr-2004 Indef. Z2 OPM 19-Feb-2004 Indef. Z1 HHS . . . n

On Thu, Aug 5, 2010 at 11:55 AM, Marnen Laibow-Koser

ntk · August 5, 2010, 9:28pm

Please quote when replying. It is very hard to follow the discussion if
you don’t.

David K. wrote:

Actually I and my client care how fast, even if it means more work and
tests
to hedge accuracy.

And by the time you do that extra work for correctness, you will have
developed a system equivalent to REXML or Nokogiri, and likely with
similar or worse performance. You’re fighting a losing battle here.

I did try Nokogiri - which I liked getting to know,
but
it also plods in at ~ 150 seconds which is just unacceptable for someone
waiting at a browser.

Waiting at a browser? Let me get this straight – your app is trying to
process a 65MB file in real time? That’s insane. Do some of the
processing in advance, or tell the user that he can expect a 2-minute
wait (which is absolutely reasonable for that much data).

That’s what I was trying to get at with my
original
post and should have provided more data, i.e. am I wasting time with
unrealistic expectations for any XML parser in this endeavor.

Unless anyone can point out a more efficient search (code and example
xml
below), it seems practical in absence of other ideas, to go the way of
regex
at least to triangulate the data before throwing it to an xml parser to
get
the details or put the data into a db (which I am trying to avoid).

Why are you trying to avoid putting the data into a DB? Databases are
designed for quick searches through lots of data – in other words,
exactly what you are doing. XML really is not. (You could try eXistDB,
though.)

Below, the second line is what takes forever, understandably.
gsa_epls_xml_doc = Nokogiri::HTML(doc_xml)
@gsa_epls_xml_doc.xpath(“//records/record[last=‘#{last_name}’ and
first=‘#{first_name}’]”).each do |possible_match_record| …

I’m assuming gsa is Google Search Appliance. Can’t it do the searching
itself and give you back only the records you need?

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

ntk · August 5, 2010, 10:41pm

Jeffrey L. Taylor wrote:

Quoting David K. [email protected]:

parser (I know REXML is notorious for being slow, but given how fast it was
to call a regex on the file, I am thinking that this will still be faster
than all parsers).

Look at using LibXML::XML::Reader

http://libxml.rubyforge.org/rdoc/index.html

What most XML parsing libraries are doing is reading the entire XML file
into
memory, probably storing the raw text, parsing it, and creating an even
bigger
data structure for the whole file, then searching over it. Nokogiri at
least
does some of the searching in C, instead of Ruby (it uses libxml2).

With LibXML::XML::Reader is possible (with some not very pretty code) to
make
one pass thru the XML file, parsing as you go, and create data
structures for
just the information of interest. Enormously faster.

Interesting; that seems worth knowing about. But wouldn’t Reader still
have to create a DOM tree to do the searching in the first place?

HTH,
Jeffrey

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

ntk · August 5, 2010, 10:59pm

On Aug 5, 9:41 pm, Marnen Laibow-Koser [email protected] wrote:

Interesting; that seems worth knowing about. But wouldn’t Reader still
have to create a DOM tree to do the searching in the first place?

Not necessarily - that’s essentially the difference between a SAX type
parse and a document based one.

Fred

ntk · August 5, 2010, 11:33pm

Frederick C. wrote:

On Aug 5, 9:41ï¿½pm, Marnen Laibow-Koser [email protected] wrote:

Interesting; that seems worth knowing about. ï¿½But wouldn’t Reader still
have to create a DOM tree to do the searching in the first place?

Not necessarily - that’s essentially the difference between a SAX type
parse and a document based one.

I used to know that, back when I actually worked regularly with XML.
Thanks for the reminder.

Fred

Best,
–Â
Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Sent from my iPhone

ntk · August 5, 2010, 10:13pm

Quoting David K. [email protected]:

parser (I know REXML is notorious for being slow, but given how fast it was
to call a regex on the file, I am thinking that this will still be faster
than all parsers).

Look at using LibXML::XML::Reader

http://libxml.rubyforge.org/rdoc/index.html

What most XML parsing libraries are doing is reading the entire XML file
into
memory, probably storing the raw text, parsing it, and creating an even
bigger
data structure for the whole file, then searching over it. Nokogiri at
least
does some of the searching in C, instead of Ruby (it uses libxml2).

With LibXML::XML::Reader is possible (with some not very pretty code) to
make
one pass thru the XML file, parsing as you go, and create data
structures for
just the information of interest. Enormously faster.

HTH,
Jeffrey