Processing a huge xml file

timperrett · July 23, 2007, 1:27pm

Hey guys

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

Cheers

Tim

timperrett · July 23, 2007, 2:51pm

Tim P. wrote:

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

Run it in windows?

But seriously, 20k lines of XML should not take that much memory unless
the lines are HUGE. How about a simplistic approach? I know that this
is not intensively RUBY but it may help.

What if you were to launch it in a browser? They display XML files in
formatted fashion which means that they must parse them. You could then
parse through the resulting page and see if there is an error message
therein. Just a text search for “XML Parsing Error” and that should
tell you if it worked.

timperrett · July 23, 2007, 4:23pm

Lloyd L. wrote:

Run it in windows?

But seriously, 20k lines of XML should not take that much memory unless
the lines are HUGE. How about a simplistic approach? I know that this
is not intensively RUBY but it may help.

What if you were to launch it in a browser? They display XML files in
formatted fashion which means that they must parse them. You could then
parse through the resulting page and see if there is an error message
therein. Just a text search for “XML Parsing Error” and that should
tell you if it worked.

Thats a very fair point actually, if it runs in the browser, it must be
parsable. Its actually 32,606 lines!
Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
be able to use less i would have thought? Unless its DOM methodology is
just a lot more memory intensive?

What are peoples thoughts? Is it crazy trying to ask libxml to read that
much into memory?

Cheers

Tim

timperrett · July 23, 2007, 4:44pm

Tim P. wrote:

Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
be able to use less i would have thought? Unless its DOM methodology is
just a lot more memory intensive?

I am new to ruby and, as much as I love the language syntax, I have yet
to see how to actually use it in real world applications. I know that
is likely to get me into trouble as everyone else seems to do it but
there it is.

That having been said, it can be seen that I do not know the inner
workings of Ruby well enough to dig that far inside. However, it cannot
be the DOM as the browser uses that to parse. There would have to be
some other thing that is making the difference and finding that goes
beyond my Ruby knowledge.

timperrett · July 23, 2007, 6:08pm

Lloyd L. wrote:

That having been said, it can be seen that I do not know the inner
workings of Ruby well enough to dig that far inside. However, it cannot
be the DOM as the browser uses that to parse. There would have to be
some other thing that is making the difference and finding that goes
beyond my Ruby knowledge.

I wonder if its somthing to do with the XSD includes and imports that it
doesnt like… i might have to ask the libxml core team

Cheers

Tim

timperrett · July 26, 2007, 12:45am

On Jul 23, 4:28 am, Tim P. [email protected] wrote:

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

libxml has some know issues, memory consumption especially. Hopefully
they will get fixed, but in the mean time one can only frown at the
irony – was one of the earliest Ruby web sites around, yet
Ruby’s support of fast XML processing is still dearly lacking.

T.

timperrett · July 26, 2007, 12:46am

On 7/23/07, Tim P. [email protected] wrote:

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Something’s going wrong. 20k lines is a pretty small XML file; we’re
sucking in files that are larger than that (50meg or so - a little
less than a million lines long) many times a day using the Ruby libxml
bindings and not seeing a similar issue. It’s possible that your
average line length is much longer than ours, of course. Our normal
process size is about 400m, but a big chunk of that is the processing
we’re doing on the data; I want to say that the size after loading in
the xml is in the 200m range, but I haven’t looked at that for a
while.

Are you doing stream processing? We never tried to load the whole
document at once, so there may be an issue doing that.

James M.

timperrett · July 26, 2007, 12:46am

2007/7/23, Tim P. [email protected]:

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

The generic answer would be, use a XML stream parser (as opposed to a
DOM parser). Even if you directly fill up a model that contains the
whole document it’s likely less resource intensive than a DOM. Of
course it’s optimal (resource wise) if you can do your validation on
the fly (i.e. while stream parsing).

Kind regards

robert

timperrett · July 26, 2007, 12:46am

That file is too big to use DOM to parse using a reasonable amount of
time/system resources. If the file is really big, I usually write my own
custom parser in C++. It’s usually not that hard to write and it will
take seconds to run as opposed to minutes or hours if I write the same
thing in Ruby or Python.

Tim P. [email protected] wrote: Lloyd L.
wrote:

therein. Just a text search for “XML Parsing Error” and that should
tell you if it worked.

Thats a very fair point actually, if it runs in the browser, it must be
parsable. Its actually 32,606 lines!
Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
be able to use less i would have thought? Unless its DOM methodology is
just a lot more memory intensive?

What are peoples thoughts? Is it crazy trying to ask libxml to read that
much into memory?

Cheers

Tim

Posted via http://www.ruby-forum.com/.

timperrett · July 26, 2007, 6:40am

I wrote a ruby script which parses a 25gb xml file. I used the
XMLParser library from http://www.yoshidam.net/Ruby.html

So parsing a large amount of xml can definitely be accomplished.

-Ray

timperrett · July 27, 2007, 12:57am

Hey all

thanks for your replys!

The file in question is actually an XSD file, so I think your right,
XML::Schema.new() would use DOM parsing. Does lixml even suport stream
parsing? I cant seem to find a great deal on it…

Has anyone ever had any experience with such a large XSD? I cant think
there would be a way of validating the instance xml without the XSD
being held in memory to then check against?

How do things like xerces manage it with java?

I fear i might be wanting the imposible! lol

Cheers

-Tim

timperrett · July 29, 2007, 12:55am

Good point, and thanks for the reply

When you say “known the parser could store an optimized representation
in memory” what exactly do you mean?

Cheers

TP

timperrett · July 29, 2007, 12:06pm

On 29.07.2007 00:55, Tim P. wrote:

When you say “known the parser could store an optimized representation
in memory” what exactly do you mean?

XML is a generic format, so a XML DOM needs to be able to store all
variants. XSD is a specific format (as is every other format defined by
a DTD or even XDS) and so you can craft a specific model that represents
XSD’s object model.

One example: since XML is markup you can have things like

text13blah

Any DOM implementation needs to be able to store “text” and “blah”. But
often, when XML is used to represent data, there is either text in an
element or nested elements but not both. An OO implementation then
would only need to allow for one of the two. Hope that clears it up.

Kind regards

robert

timperrett · July 27, 2007, 8:20am

2007/7/27, Tim P. [email protected]:

The file in question is actually an XSD file, so I think your right,
XML::Schema.new() would use DOM parsing. Does lixml even suport stream
parsing? I cant seem to find a great deal on it…

Has anyone ever had any experience with such a large XSD? I cant think
there would be a way of validating the instance xml without the XSD
being held in memory to then check against?

Yes and no: since the XML (XSD in your case) is known the parser could
store an optimized representation in memory (i.e. does not need the
original DOM).

How do things like xerces manage it with java?

When a colleague testes JDom few years ago, it needed loads of mem.
But of course, that could have changed by now (and also, there’s 64
bit JVMs).

I fear i might be wanting the imposible! lol

“Impossible is nothing - Ruby…”

Kind regards

robert