Inverse of stream parser

stickstone · April 30, 2010, 6:26pm

I plan to parse a huge XML document (too big to fit into RAM) using a
stream parser. I can divide the stream into logical chunks which can be
processed individually. If a particular chunk fails, I want to append it
to an output XML file, which will contain all the failed chunks, and can
be patched up and retried.

To do this, I want to be able to regenerate the XML of the failed chunk,
preferably identical to how it was seen.

The options I can think of are:

A stream parser which gives me the raw XML alongside each parsed
item; I can concatenate the raw XML into a string.
A stream parser which gives me the byte pos of the current node, so I
can seek back within the file to fetch the XML again
A stream parser which gives me events to identify the different parts
of XML, together with an inverse process to which I can replay the
events and get the XML back again.

Playing with REXML StreamListener, I can get a series of method calls
like start_tag(…) and end_tag(…), and I can collect these into an
array; is there existing code which would let me squirt that array and
recreate the XML? Any other options I should be looking at?

Thanks,

Brian.

stickstone · April 30, 2010, 7:58pm

On 4/30/10, Brian C. [email protected] wrote:

Playing with REXML StreamListener, I can get a series of method calls
like start_tag(…) and end_tag(…), and I can collect these into an
array; is there existing code which would let me squirt that array and
recreate the XML? Any other options I should be looking at?

From my experience, REXML is far too wimpy to deal with data on this
scale. (Among other things, it was too slow.) I suggest using the
‘stream parser’ (a misnomer, this is really a lexer) in libxml
instead. I don’t know for sure if it can reconstruct the original text
the way you want, but that should be possible.

I think the class you’d want is LibXML::XML::SaxParser. See
http://libxml.rubyforge.org/.

stickstone · April 30, 2010, 8:34pm

Morning,

On Fri, Apr 30, 2010 at 9:26 AM, Brian C. [email protected]
wrote:

I plan to parse a huge XML document (too big to fit into RAM) using a
stream parser. I can divide the stream into logical chunks which can be
processed individually. If a particular chunk fails, I want to append it
to an output XML file, which will contain all the failed chunks, and can
be patched up and retried.

If you aren’t completely against Perl - XML-Twig [1] has a tool called
xml_split [2] which works rather well at splitting xml files. You might
wish
to split up your files into smaller files prior to even beginning the
processing and then if a file fails to process you just have the file in
hand. When finished you could smash the failed files back together using
xml_merge [3] from the same perl package.

If there is some ruby variant of this I couldn’t locate it but that
never
means much

John

[1] - XML-Twig-3.34 - XML, The Perl Way - metacpan.org
[2] -
xml_split - cut a big XML file into smaller chunks - metacpan.org
[3] -
xml_merge - merge back XML files split with xml_split - metacpan.org

stickstone · April 30, 2010, 10:32pm

Would you care to use JRuby?

I don’t mind which stream parser, but Java is out

Since this is a bit of disposable code, I’ve decided to cheat. I
pretty-print the XML, then I can read it line-at-a-time using gets into
a buffer, identify a range of lines which forms a chunk, then parse the
buffer. On error I write out the buffer again.

Thanks for all your suggestions.

stickstone · April 30, 2010, 8:42pm

On Fri, Apr 30, 2010 at 6:26 PM, Brian C. [email protected]
wrote:

Would you care to use JRuby?
That would give you access to top XML Stream parsers IIRC
Just as an example: org.apache.xerces.parsers.SAXParser seems very
suited for your purpose, although it is a little bit of work to
construct your xml fragments it should be rather easy.

HTH
R.

stickstone · April 30, 2010, 10:57pm

Actually, now I think about it, I wrote some code about 5 years ago for
storing XML docs as rows in a SQL database, and one component of that
was indeed a tag stream to XML converter.

I never fully completed or released it, but I’ve just pushed it out to a
git repo anyway.

github.com

candlerb/zml/blob/master/lib/zml/stream.rb

require 'rexml/document'
require 'rexml/streamlistener'


module ZML

# This is the inverse of an REXML StreamParser. You call methods tag_start,
# tag_end, comment etc, and it writes the corresponding XML to the given
# stream. In other words, it converts a method-call stream into an XML
# text stream.
#
# Perhaps REXML should have had this already :-)
#
# FIXME: add prettyprint (which respects xml:space in stream)

class StreamToXML
  include REXML::StreamListener  # in case we forgot any methods

  # 'out' is the stream to which the result is to be written. It can
  # be any object which supports the '<<' method to write a string.

This file has been truncated. show original

stickstone · May 1, 2010, 2:17am

On Sat, May 1, 2010 at 12:04 AM, Tony A. [email protected]
wrote:

On Fri, Apr 30, 2010 at 2:32 PM, Brian C. [email protected] wrote:

Would you care to use JRuby?

I don’t mind which stream parser, but Java is out

Why?
And to add insult to injury, by interfacing J*** with JRuby you do
not even see Java, you see a Ruby API.
( Just wanted to be clear about this )
R.

stickstone · May 1, 2010, 12:06am

On Fri, Apr 30, 2010 at 2:32 PM, Brian C. [email protected]
wrote:

Would you care to use JRuby?

I don’t mind which stream parser, but Java is out

Why?

stickstone · May 1, 2010, 9:53am

On May 1, 2010, at 2:17 AM, Robert D. wrote:

not even see Java, you see a Ruby API.
( Just wanted to be clear about this )

Just to be clear, too: By interfacing Java with JRuby, you get a Ruby
API that feels like its written by a Java consultant struggling on his
first steps to learn Ruby.

While I am impressed how well the integration of JRuby into Java works,
Java libraries without a handwritten layer above them still feel very
alien. So, you do see Java - a lot, actually.

Regards,
Florian

stickstone · May 1, 2010, 10:32am

On Sat, May 1, 2010 at 9:52 AM, Florian G. [email protected]
wrote:

Why?
And to add insult to injury, by interfacing J*** with JRuby you do
not even see Java, you see a Ruby API.
( Just wanted to be clear about this )

Just to be clear, too: By interfacing Java with JRuby, you get a Ruby API that feels like its written by a Java consultant struggling on his first steps to learn Ruby.

While I am impressed how well the integration of JRuby into Java works, Java libraries without a handwritten layer above them still feel very alien. So, you do see Java - a lot, actually.
agreed, I was putting my bold statement to test, when calling into
Java you need to honor the java type checks and there are no block
parameters.
Thus there remains lots of work to be done to adapt a given API to be
“rubyish” my bad.
R.

stickstone · May 1, 2010, 5:07pm

Florian G. wrote:

Just to be clear, too: By interfacing Java with JRuby, you get a Ruby API that feels like its written by a Java consultant struggling on his first steps to learn Ruby.

While I am impressed how well the integration of JRuby into Java works, Java libraries without a handwritten layer above them still feel very alien.

Often true. However, the range of fast, reliable libraries is much
greater for Java than for Ruby.

Don’t spite yourself.

–
James B.

www.jamesbritt.com - Playing with Better Toys
www.ruby-doc.org - Ruby Help & Documentation
www.rubystuff.com - The Ruby Store for Ruby Stuff
www.neurogami.com - Smart application development

stickstone · May 1, 2010, 5:04pm

BTW what about Nokogiri? http://nokogiri.org/Nokogiri/XML/SAX.html I
have never heard about its celerity though.
HTH
R.

stickstone · May 5, 2010, 7:41pm

On 4/30/2010 12:26 PM, Brian C. wrote:

Depending on how complicated the XML is, you may be able to use a
combination of self-parsing and XML libraries.

I’ve needed to handle arbitrarily large XML “streams” before in C,
Smalltalk, and Python. The “outer” XML was really a wrapper around (or
to connect) a bunch of XML fragments that were not large. We parsed the
“outer” XML until we located the fragment we were interested in, then
parsed it as though it were a complete XML document of its own. This
way we were able to handle XML files of infinite size by biting off
individual chunks.

In our case, the XML was coming across a network and there was no
knowing how big it would be. We also had the advantage that if the
/whole/ XML document was not well-formed (maybe a network error
interrupted it) we didn’t lose the fragments.

stickstone · May 4, 2010, 10:57pm

On Sat, May 1, 2010 at 2:52 AM, Florian G. [email protected]
wrote:

Just to be clear, too: By interfacing Java with JRuby, you get a Ruby API that feels like its written by a Java consultant struggling on his first steps to learn Ruby.

I don’t know a lot of struggling Java consultants that have released
Java libraries used on a wide scale. In fact, I don’t know any
struggling Java consultants that have released libraries, period.
Maybe the APIs would be better if they did.

I think you’re overstating the problem. Many Java libraries are
overdesigned, this is true. But JRuby does more than just provide a
means to call them; it provides a lot of other niceities like passing
a block or arbitrary object as the implementation of an interface and
not having to convert or cast values all over.

I also don’t think it’s a whole lot better when people write C
extensions that just wrap a raw C API. If anything, C APIs are usually
underdesigned, and it becomes a mess just to fit them nicely into an
OO language. The truth is that just providing the ability to call from
Ruby a library written in C or Java isn’t always enough; but it’s a
hell of a lot easier to start with the Java library in JRuby, since
you don’t even have to compile anything.

Charlie