Stream Parsing with REXML

lasastard · January 12, 2008, 5:58pm

Hi (again, sort of)

I am still on my quest to write a program that parses a large XML file.
After having tried to do it in tree mode, I had to realize that the
performance was simply abysmal. So back to the drawing board. But, and
here is the thing…I could find a good straight-forward tutorial on how
to write a stream parser using REXML. The official tutorial is pretty
much mute on that part and the only other example I found (or rather was
pointed to -
http://www.janvereecken.com/2007/4/11/event-driven-xml-parser-in-ruby)
was way too complex for someone like me who is still pretty much a
beginner in ruby.

So, what I am looking for is either a brief description of how to write
an event driven parser or else a link to a good and simple tutorial.

For the former, this is what the parser should do:

Find the element “Gene-ref”, allow me to access its children and then
close and repeat for the next "Gene-ref entry. In xml code, that would
look like

... ... ... ...

I understand that I need a Listener class, like
classListener
def tag_start(name, attrs)
end
def text(text)
end
def tag_end(name)
end
end

But I havent really worked with classes all that much and maybe someone
could just put down the basics for the script from where I can start
experimenting? Would be very much appreciated. Let’s say for each
element “Gene-ref” I want to puts the name, start and end in one line,
or something along those lines.

Cheers,

Marc

lasastard · January 12, 2008, 7:56pm

I have found that hpricot works very well with large xml files. It
streams it automatically. I use it to parse a 5M xml file and it does
it rather quickly.

http://code.whytheluckystiff.net/doc/hpricot/
http://code.whytheluckystiff.net/hpricot/

lasastard · January 13, 2008, 4:51am

Hi Marc and Dusty,

On 12-Jan-08, at 1:55 PM, dusty wrote:

I have found that hpricot works very well with large xml files. It
streams it automatically. I use it to parse a 5M xml file and it does
it rather quickly.

Dusty, you sure hpricot is streaming?

At any rate, it is a lot faster than any pure Ruby XML parser. Marc,
if performance is your only problem hpricot is worth looking at. It’ll
only take a few seconds to know, it’s something like 4 lines of Ruby
to find out:

require ‘rubygems’
require ‘hpricot’

doc = open() do |f|
Hpricot.XML(f)
end

If that works well enough you’re probably set. Hpricot’s search and
query stuff is pretty nice.

If you still have problems, let us know. If you really need streaming
you have a few options.

http://code.whytheluckystiff.net/doc/hpricot/
http://code.whytheluckystiff.net/hpricot/

Cheers,
Bob

Bob H. – tumblelog at
http://www.recursive.ca/so/
Recursive Design Inc. – weblog at
http://www.recursive.ca/hutch
http://www.recursive.ca/ – works on
http://www.raconteur.info/cms-for-static-content/home/

lasastard · January 13, 2008, 6:40am

On Jan 12, 9:58 am, Marc H. [email protected] wrote:

...

Here’s a more targeted example, since I’m feeling helpful:

class Gene
attr_accessor :name, :start, :end
def initialize( name=nil, start=nil, endd=nil )
@name = name
self.start = start
self.end = endd
end
def start=( start )
@start = start.to_i
end
def end=( endd )
@end = endd.to_i
end
def duration
@end - @start
end
def to_s
"<Gene ‘#{@name}’ #{@start}-#{@end} (#{duration})> "
end
end

class GeneParser
GENE_TAG_NAME = “Gene-ref”
def tag_start( name, attributes )
case name
when “root”
# ignore
when GENE_TAG_NAME
@current_gene = Gene.new
else
@current_property = name
end
end

def text( str )
if @current_gene && @current_property
@current_gene.send( @current_property.to_s + “=”, str )
end
end

def tag_end( name )
if name == GENE_TAG_NAME
puts @current_gene
@current_gene = nil
else
@current_property = nil
end
end
end

require ‘rexml/document’
REXML::Document.parse_stream( DATA, GeneParser.new )
#=> OUTPUT:
#=> <Gene ‘foo’ 17-42 (25)>
#=> <Gene ‘bar’ 43-50 (7)>

END

foo
17
42

bar
43
50

lasastard · January 13, 2008, 1:37pm

Thanks for the great responses!

Just for clarification though:

tag_start identifies an element like “gene” ala

def tag_start(name, attrib)
if name==“gene”
do something here
end
end

So tag_end is then used if I want to puts everything that was done with
the element and its children like storing some values in an array or
something?

/Marc

lasastard · January 13, 2008, 6:26am

On Jan 12, 9:58 am, Marc H. [email protected] wrote:

beginner in ruby.

So, what I am looking for is either a brief description of how to write
an event driven parser or else a link to a good and simple tutorial.

xml = <<ENDXML
oh hai

I ate <raw> sausages&tm;!

<![CDATA[ I ate sausages&tm;! ]]>

whoa!
ENDXML

class Listener
def initialize
# Use instance variables to mantain information
# between different callbacks.
@nest_level = 0
end

def tag_start( name, attr_hash )
@nest_level += 1
print " " * @nest_level
puts “<#{name} …> #{attr_hash.inspect}”
end

def text( str )
print " " * @nest_level
p str
end

Treat CDATA sections just like text

alias_method :cdata, :text

def tag_end( name )
print " " * @nest_level
puts “</#{name}>”
@nest_level -= 1
end
end

require ‘rexml/document’
require ‘stringio’ # for this demo
stream_handler = Listener.new
pseudo_file = StringIO.new( xml )
REXML::Document.parse_stream( pseudo_file, stream_handler )

#=> OUTPUT:
#=> “oh hai\n”
#=> <root …> {}
#=> "\n "
#=> <kid1 …> {“jim”=>“jam”, “foo”=>“bar”}
#=>
#=> "\n "
#=> <kid2 …> {“foo”=>“sausage”}
#=> "\n "
#=> <grandkid1 …> {}
#=> “I ate sausages&tm;!”
#=>
#=> "\n "
#=> <grandkid2 …> {}
#=> "\n "
#=> "\n I ate sausages&tm;!\n "
#=> "\n "
#=>
#=> "\n "
#=>
#=> “\n”
#=>
#=> “\nwhoa!\n”

lasastard · January 13, 2008, 7:05pm

On 13.01.2008 13:37, Marc H. wrote:

end

So tag_end is then used if I want to puts everything that was done with
the element and its children like storing some values in an array or
something?

Yeah, you could do that. Basically it’s completely up to you. The
parser just hands off events for parsed XML items (like starting tags,
closing tags, text data etc.) in the order found in the document.

I have attached another example of an idiom that I use frequently. This
may not be needed in your case but who knows? Basic idea is that the
event listener creates nested listeners (one per element) and hands
processing off to them while maintaining a stack of listeners. That way
you can do different processing based on element name, attributes or
whatever. Might be overkill for your simple example but OTOH if you do
need to do complex processing steps based on elements this might be
exactly what you need. For example, you can store any information you
need in a nested listener and do all the processing in end tag.

Kind regards

robert

lasastard · January 13, 2008, 8:55pm

Hi Marc, Robert:

On 13-Jan-08, at 1:04 PM, Robert K. wrote:

I have attached another example of an idiom that I use frequently.
This may not be needed in your case but who knows? Basic idea is
that the event listener creates nested listeners (one per element)
and hands processing off to them while maintaining a stack of
listeners. That way you can do different processing based on
element name, attributes or whatever. Might be overkill for your
simple example but OTOH if you do need to do complex processing
steps based on elements this might be exactly what you need. For
example, you can store any information you need in a nested listener
and do all the processing in end tag.

That idiom is really frequent in the stuff that I do with XML. It also
happens to be a case where a pull parser can make things clearer. Pull
parsers are event based like SAX parsers are, they do not build up
trees in memory, and they are very fast. The code below is almost
equivalent to yours Robert, as far as I could make it anyway. The
parents method is different, and I habitually add a node level
corresponding to the document since there are things (comments,
processing instructions, etc) that are allowed outside of the root
element of an XML file.

I wrote that xampl-pp parser in May 2002. It is on source forge
<xampl: XML pull parsers download | SourceForge.net

but not as a gem. Hmm, maybe I should do something about that, I’ve
made some fixes to very obscure problems over the years, but never
released anything. Anyway the point isn’t xampl-pp so much as what a
pull parser looks like. Many people don’t know about them and if they
do they often don’t know why they are useful. I think the libxml2
library does pull parsing (and that’ll be very fast) and I think REXML
does too (it has a few problems with namespaces last I looked).
Anyway…

All the interesting stuff happens in the work method.

#!/bin/env ruby

require “xampl-pp”

class Thing

attr_accessor :tag_name
attr_accessor :parent
attr_accessor :text

def initialize(parent, name)
@parent = parent
@tag_name = name
@text = “”
end

def Thing.start(xml)
pp = Xampl_PP.new
pp.input = xml

 Thing.new(nil, 'doc').work(pp)

end

def parents
p = self
while p
yield p
p = p.parent
end
end

def work(pp)
parents do |par|
print par.tag_name, " "
end
puts

 while not pp.endDocument? do
   case pp.nextEvent
     when Xampl_PP::START_ELEMENT
       Thing.new(self, pp.name).work(pp)
     when Xampl_PP::END_ELEMENT
       print "  ", text.inspect, "\n"
       return
     when Xampl_PP::TEXT || Xampl_PP::CDATA_SECTION ||

Xampl_PP::ENTITY_REF then
text << pp.text
end
end
end
end

Thing.start(DATA)

END

foo
17
42

bar
43
50

Bob H. – tumblelog at
http://www.recursive.ca/so/
Recursive Design Inc. – weblog at
http://www.recursive.ca/hutch
http://www.recursive.ca/ – works on
http://www.raconteur.info/cms-for-static-content/home/

lasastard · January 14, 2008, 1:07pm

Ok, I am still on this and am starting to feel a bit stupid…I guess
there is something wrong with the following example, but maybe I should
step back for a while. Anyhow - what’s missing to make it work:

The xml file:

Gene1 1 10 Gene2 11 20 Gene3 21 30 Gene4 31 40 Gene5 41 50

The code:

data = File.new(ARGV.shift)

class Listener

def tag_start( name, attr_hash )
if name == ‘Gene-ref’
@name = name
end
end

def text( str )

end

def tag_end( name )
if name == ‘Gene-ref’
puts @name
@name = nil
end
end
end

require ‘rexml/document’
REXML::Document.parse_stream( data, Listener.new )

lasastard · January 14, 2008, 8:51am

On 13.01.2008 20:54, Bob H. wrote:

simple example but OTOH if you do need to do complex processing
steps based on elements this might be exactly what you need. For
example, you can store any information you need in a nested listener
and do all the processing in end tag.

That idiom is really frequent in the stuff that I do with XML. It also
happens to be a case where a pull parser can make things clearer. Pull
parsers are event based like SAX parsers are, they do not build up
trees in memory, and they are very fast. The code below is almost
equivalent to yours Robert, as far as I could make it anyway. The
parents method is different,

I considered that as well but wanted to be able to traverse parent nodes
top down for the example. After all it was just a quick hack. For
more robustness and library fitness I would certainly change a few
things.

and I habitually add a node level
corresponding to the document since there are things (comments,
processing instructions, etc) that are allowed outside of the root
element of an XML file.

If you look closely that happens in my version as well (see
StreamListener#initialize and #tag_start).

All the interesting stuff happens in the work method.
 Thing.new(nil, 'doc').work(pp)
def work(pp)
print " ", text.inspect, “\n”
return
when Xampl_PP::TEXT || Xampl_PP::CDATA_SECTION ||
Xampl_PP::ENTITY_REF then
text << pp.text
end
end
end
end

Thing.start(DATA)

Yep, looks pretty similar. To be honest I never used an XML pull parser
myself. Personally I prefer the push parser a bit because it avoids the
looping and decision logic based on event and element type (in #work).
Are there major advantages of pull parsers over push parsers that I have
overlooked so far?

Kind regards

robert

lasastard · January 14, 2008, 1:45pm

Hi Robert,

On 14-Jan-08, at 2:50 AM, Robert K. wrote:

Yep, looks pretty similar. To be honest I never used an XML pull
parser myself. Personally I prefer the push parser a bit because it
avoids the looping and decision logic based on event and element
type (in #work). Are there major advantages of pull parsers over
push parsers that I have overlooked so far?

I use both and I’m comfortable with both. The combination of sax and
pull gives you a much more complete toolkit for streaming XML.

The biggest advantage (for me) to a pull parser is when you are
parsing an XML file and using those events to drive some kind of
algorithm. Either the parser or the algorithm is going to have to be
able to suspend processing (and I mean save state by that) and resume
when the other needs it to. This can be really hard to do for some
algorithms, in the case of some it isn’t possible without
fundamentally changing the algorithm. If you read about pull parsers
they will talk about where the flow of control (the ‘outermost’ loop)
is in the program: in the parser or in the application. With sax the
parser controls the flow, with pull parsers the application does. A
subset of that situation is that the pull parser is just an object
that you can pass around, and that can be really handy. This also
means you can have more than one pull parser operating at a time, and
that’s incredibly confusing with sax.

You can fake a pull parser by putting a sax parser in a different
thread and streaming the events over some kind of channel with some
kind of blocking protocol. Some implementations of pull parsers do
exactly that, and that is slow, yet if you need it people will put
up with it. If you’ve ever had the temptation to do that then you
might want to think pull parser.

It is pretty much trivial to write a sax parser based on a pull parser
at very small cost (xampl-pp ships with an example that does that). So
one trick that you can play is switching back and forth between sax
and pull where appropriate.

Junior programmers have a much easier time with pull parsers. It seems
to provide a gentler route to sax based software. XML is a big enough
step. There is some advantage to avoiding the need for a beginner to
learn both XML and event based programming at the same time.

Historically pull parsers tend to be very fast compared to sax
parsers. This is not going to last, there’s no technical reason for
such a difference. In the Java world the speed difference is eroding
for the good of all. Anyway, speed isn’t the issue it used to be. If
you poke around the web you’ll still see a bunch of references to
performance advantages, this doesn’t apply anymore if you choose your
parser correctly.

Cheers,
Bob

Bob H. – tumblelog at
http://www.recursive.ca/so/
Recursive Design Inc. – weblog at
http://www.recursive.ca/hutch
http://www.recursive.ca/ – works on
http://www.raconteur.info/cms-for-static-content/home/

lasastard · January 14, 2008, 1:24pm

2008/1/14, Marc H. [email protected]:

The code:
def text( str )

You need to place the @name storing code here. But to make it work
you need to remember the current tag’s name because you want to only
store “str” in @name if you are in . This makes it necessary
that you add a stack to your listener class which gets pushed and
popped whenever tag_start and tag_end are invoked.

require ‘rexml/document’
REXML::Document.parse_stream( data, Listener.new )

Cheers

robert

lasastard · January 14, 2008, 2:18pm

2008/1/14, Bob H. [email protected]:

On 14-Jan-08, at 2:50 AM, Robert K. wrote:

Yep, looks pretty similar. To be honest I never used an XML pull
parser myself. Personally I prefer the push parser a bit because it
avoids the looping and decision logic based on event and element
type (in #work). Are there major advantages of pull parsers over
push parsers that I have overlooked so far?

I use both and I’m comfortable with both. The combination of sax and
pull gives you a much more complete toolkit for streaming XML.

I read that in the light of your following remarks. Other than that I
do not see a difference with regard to the XML parsed, i.e. both
approaches are equally powerful.

The biggest advantage (for me) to a pull parser is when you are
parsing an XML file and using those events to drive some kind of
algorithm. Either the parser or the algorithm is going to have to be
able to suspend processing (and I mean save state by that) and resume
when the other needs it to. This can be really hard to do for some
algorithms, in the case of some it isn’t possible without
fundamentally changing the algorithm. If you read about pull parsers
they will talk about where the flow of control (the ‘outermost’ loop)
is in the program: in the parser or in the application. With sax the
parser controls the flow, with pull parsers the application does.

I see, that’s definitively a situation I was not aware of. I guess I
just haven’t been in a situation where I needed to employ such
algorithms. And yes, the instance in control (parser vs. application
code) seems to be the biggest difference. I can see how this makes a
difference for some algorithms.

A
subset of that situation is that the pull parser is just an object
that you can pass around, and that can be really handy. This also
means you can have more than one pull parser operating at a time, and
that’s incredibly confusing with sax.

and pull where appropriate.
Yep. Basically a push parser can be implemented as thin wrapper
around a pull parser.

Junior programmers have a much easier time with pull parsers. It seems
to provide a gentler route to sax based software. XML is a big enough
step. There is some advantage to avoiding the need for a beginner to
learn both XML and event based programming at the same time.

I haven’t thought of that either but sounds like a valid point.

Historically pull parsers tend to be very fast compared to sax
parsers. This is not going to last, there’s no technical reason for
such a difference. In the Java world the speed difference is eroding
for the good of all. Anyway, speed isn’t the issue it used to be. If
you poke around the web you’ll still see a bunch of references to
performance advantages, this doesn’t apply anymore if you choose your
parser correctly.

Bob, thanks for taking the time and put together this elaborate
explanation!

Kind regards

robert