Libxml-ruby sax parsing open-uri

Hi there,

I need to connect to an url to download and process an XML document.
Then run through the XML document and save elements in the database.

There are many howto’s on the internet regarding parsing xml files with
SAX opening a file on the filesystem and reading through it. But I could
not find an example of how to read an URL while processing the xml.

SAX will be useless if the content from the URL has to be downloaded
completely before processing it. The RAM will still fill up.

Is there somebody that has a solution for this problem or maybe a sample
snippet on how to deal with this in Ruby or Rails? I don’t care if it is
libxml, rexml or something else as long less RAM will be used.

Thanks for your help.
Chris

On Jan 22, 2011, at 12:39 PM, Chris A. wrote:

SAX will be useless if the content from the URL has to be downloaded
completely before processing it. The RAM will still fill up.

Not sure if this will help, but if you use the open-uri gem, you can
open a file from a URL. I use it in a converter I’m working on right
at this moment:

require ‘rubygems’
require ‘nokogiri’
require ‘open-uri’

#here I’m loading the xsd from W3 directly
xsd =
Nokogiri::XML::Schema(open(‘http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd’))

…etc…

I’m not at all sure that this will save you on RAM, I’m loading temp
files in another part of this script (from the filesystem) and ripping
through them with regular expressions one line at a time, but after
all that’s done, I open the partially-transformed file with Nokogiri
in one large bite and do all sorts of things to it. Some of these
files are 10 - 20MB of XML text. It’s currently working fine inside a
hard limit of 2GB of RAM. I wouldn’t be surprised if Nokogiri does
some very clever things to manage its memory footprint, because it
certainly works much more efficiently than the previous generation of
this system, which used XSLT and Saxon, and crapped out over 6MB of
input.

Walter

Quoting Chris A. [email protected]:

Hi there,

I need to connect to an url to download and process an XML document.
Then run through the XML document and save elements in the database.

There are many howto’s on the internet regarding parsing xml files with
SAX opening a file on the filesystem and reading through it. But I could
not find an example of how to read an URL while processing the xml.

The Ruby wrapper around the libxml2 C library (libxml-ruby) supports the
DOM
(parse whole file into a data structure that can be searched, editted,
written
back out), SAX, and Reader models. All will handle any IO like class.
They
don’t have to have the whole input in memory, but can repeated call
IO#read to
parse data as read rather than all at once. Unless you have a hard
requirement to use SAX, look at the Reader model. IMHO, it is easier to
use.
The documentation (http://libxml.rubyforge.org/rdoc/index.html) is very
good.
Better than libxml2’s documentation, IMHO.

HTH,
Jeffrey