One more way to parse XML

I thought I’d put this out to see if there’s any interest.

Recently I wanted to do some XML reading with Ruby, but looking
(not deeply) at REXML and other packages like xmlcodec, I couldn’t
find anything that seemed to fit the way I thought about things,
so I [of course… (:-)] wrote a little wrapper to REXML that
fit me a little better.

First, I wanted to have a ‘stream’ parser, rather than reading the
whole tree into memory and then working on it. The documents I was
interested in (XML representation of midifiles) are not very deep,
but can get very lengthy, and the processing I wanted to do was
mostly sequential.

However, the stream parsers I’ve seen – including REXML::StreamListener

simply pass the pieces of the document in turn to the app, without any
real notion of the tree, so the app has to keep track of all that
itself.

In other languages, protocols, and situations (strarting with IFF on
the Amiga, I guess!) I’ve had success with what then would have been
a “table driven” scheme. Now, it’s more a “linked object” approach:
each node of the Document Model gets a node that specifies what is to
be done when an element that it represents is encountered, and also has
a list (the ‘table’) of the possible subordinate nodes. You create
and develop these nodes before reading the document, then with a call to
the stream reader all the appropriate actions get taken as needed.

With Ruby it’s a snap to create such a node structure. I just
formalised
it a bit and provided an ‘XMLStreamListener’ class to extend REXML’s
basic version, which keeps track of the node structure and dispatches
to the appropriate current one. The ‘XMLSpec’ nodes themselves have
methods to handle start, end, and empty tags and of course enclosed
text.

So if anyone is interested in digging deeper, I’ve provided a web
page (with the module, example use, and downloadable archives) at

http://jwgibbs.cchem.berkeley.edu/pete/xmlstreamin/

Cheers,
– Pete –

On Sat, 2006-10-21 at 16:15 +0900, [email protected]
wrote:

I thought I’d put this out to see if there’s any interest.

This looks pretty cool. It has echoes of the Jakarta Commons Digester,
of which I made a Ruby port a while back (http://digestr.rubyforge.org),
though using libxml-ruby rather than REXML.

I quite like the digester model and have found it very handy for dealing
with certain types of XML (mostly XML from the Java world I guess).

In article [email protected],
Ross B. [email protected] wrote:

On Sat, 2006-10-21 at 16:15 +0900, [email protected]
wrote:

I thought I’d put this out to see if there’s any interest.

This looks pretty cool. It has echoes of the Jakarta Commons Digester,
of which I made a Ruby port a while back (http://digestr.rubyforge.org),
though using libxml-ruby rather than REXML.

Hmm, yes. I hadn’t come across the “digester” before, but there
do seem to be parallel trains of thought there. (I looked through the
Jakarta version rather than yours – finding a magazine article to
read is more comfortable than chugging through documentation!)
Looks nice (and much more extensive than mine, of course).

The main difference (in philosophy) seems to be that the digester
describes the tree with complete absolute paths for each node, where
my scheme has each node only knowing about its immediate descendants.

I quite like the digester model and have found it very handy for dealing
with certain types of XML (mostly XML from the Java world I guess).

The article I read did seem to be oriented to building a tree (of Beans)
in memory (so aren’t we sort of back to DOM?) but I gather that you can
provide other custom methods to do other kinds of processing.

Thanks,
– Pete –

On 10/21/06, [email protected]
[email protected] wrote:

whole tree into memory and then working on it. The documents I was
interested in (XML representation of midifiles) are not very deep,
but can get very lengthy, and the processing I wanted to do was
mostly sequential.
[…]

Did you take a look at magic/xml [ http://zabor.org/jrpg/magic_xml/ ] ?

magic/xml provides a very nice interface for xml streams.
It doesn’t need any callbacks, subclassing or any ugly things,
all you need is a single block.

The block gets incomplete nodes.
It can call node.complete! to read whole subtree there,
or simply let the processing move to the first child.

For example to process Wikipedia database dump, which looks like this:

Foo 435 Bar 754

and extract titles and ids, you can use this code:

XML.parse_as_twigs(STDIN) {|node|
next unless node.name == :page
node.complete! # Read all children of … node
t = node[:@title] # :@title is a child
i = node[:@id] # :@id is another child
print “#{i}: #{t}\n”
}

The block first gets called with XML node .
As it does not complete!, the next processed block is .
complete! fills the node: Foo435
Then we can use all convenient tree-based methods.
As children of were already read, the next node is ,
which is completed to Bar754 and so
on.

There is tutorial [ http://zabor.org/jrpg/magic_xml/tutorial.html ]
and collection of solutions to W3C XQuery use cases
[ http://zabor.org/jrpg/magic_xml/xquery_use_cases.html ]

In article
[email protected],
Tomasz W. [email protected] wrote:

Did you take a look at magic/xml [ http://zabor.org/jrpg/magic_xml/ ] ?
Yes, I took a brief look, but it didn’t immediately seem to be what
I wanted.

magic/xml provides a very nice interface for xml streams.
It doesn’t need any callbacks, subclassing or any ugly things,
all you need is a single block.
I’m sure it would do the job, but I guess again it’s a matter of
‘philosophy’. I knew pretty much what I wanted to do and, because
I’ve used that approach before (in C++ for instance), it was actually
easier for me to write a mechanism to do it than try to figure out
somebody else’s way of thinking about things… (:-))

A gentle [though perhaps a little blunt (:-/] suggestion: include
a README in your package! One reason I didn’t probe very deeply
was that I couldn’t find any “easy road in”. RDOC is fine for checking
up on the details of an API, but in my experience it’s almost useless
as a ‘road map’. (This has been a problem for me with Ruby generally.
The “User’s Guide” is nice, but there’s a big gap between that and
the “Documentation”. I suppose I shouldn’t be so stingy, and buy the
“Pickaxe” or something. (:-/)

Cheers,
– Pete –

On Sun, 2006-10-22 at 04:30 +0900, [email protected]
wrote:

Hmm, yes. I hadn’t come across the “digester” before, but there
do seem to be parallel trains of thought there. (I looked through the
Jakarta version rather than yours – finding a magazine article to
read is more comfortable than chugging through documentation!)
Looks nice (and much more extensive than mine, of course).

Yes, it is pretty useful in some cases. The Ruby version is rather
trimmed down by the standards of the Java one, partly because Ruby gets
more done with less code, and partly because I didn’t need everything
when I made the port :slight_smile:

The main difference (in philosophy) seems to be that the digester
describes the tree with complete absolute paths for each node, where
my scheme has each node only knowing about its immediate descendants.

Ahh, I see. That’s an interesting strategy (certainly would be easier to
get on with, esp under refactoring which can be a nightmare). I’ll have
to have a closer look at your code.

I quite like the digester model and have found it very handy for dealing
with certain types of XML (mostly XML from the Java world I guess).

The article I read did seem to be oriented to building a tree (of Beans)
in memory (so aren’t we sort of back to DOM?) but I gather that you can
provide other custom methods to do other kinds of processing.

Yes, most of the standard rules are geared towards building DOM-like
trees, though instead of a tree representing the XML they allow an
arbitrary tree of objects to be built based on the XML, with rules to
take object attribute values from XML attributes, tag bodies, and so on.

You can just plug in your own rule implementations to do pretty much
what you like - they’re basically just SAX handlers (at least in the
Ruby implementation, IIRC there’s a bit more abstraction in the Java
original).

Cheers,