Rexml

pdg · November 7, 2006, 11:07am

Hi All,

As a first exercise with Ruby, I am going through the Pickaxe book and
creating a jukebox. I haven’t even tried to create an array of songs
yet, because I got distracted and wanted to work this out. I am trying
toi feed in the data from my iTunes xml file to it to get the data, I
can get it to work if I delete most of the xml file, but when it’s 5-6
gig, rexml just seems to die. I have vaguely heard that stream parsing
may be the answer, but am totally unaware of how to use it.

here is the code in my xml reading program so far (saample.rb basically
just creates song items):

require ‘rexml/document’
require “sample.rb”

doc = File.open(“iTunes.xml”)
xml = REXML::Document.new(doc)
name = “name”
artist = “artist”
time = 60
cnt = 0
xml.elements.each("//key") do |k|
if k.text == “Name” then
name = k.next_sibling.text
cnt += 1
end
if k.text == “Artist” then
artist = k.next_sibling.text
end
if k.text == “Total Time” then
time = k.next_sibling.text.to_i/1000.0
song = Song.new(name,artist,time)
song.to_s

end

end
puts cnt

pdg · November 7, 2006, 11:08am

pdg wrote:

Hi All,

As a first exercise with Ruby, I am going through the Pickaxe book and
creating a jukebox. I haven’t even tried to create an array of songs
yet, because I got distracted and wanted to work this out. I am trying
toi feed in the data from my iTunes xml file to it to get the data, I
can get it to work if I delete most of the xml file, but when it’s 5-6
gig,

OMFG. That’s a -huge- XML file. Probably all of my MP3s together would
fit into there with base64-encoded contents

rexml just seems to die. I have vaguely heard that stream parsing
may be the answer, but am totally unaware of how to use it.

Well, time to learn. I probably never even saw a computer that could
handle a XML file that size using straightforward DOM parsing - which
normally “blows up” the original XML document’s size in bytes five times
and more. And REXML definitely doesn’t have performance of any kind
amongst its qualities. (And for completeness’ sake, I never ‘clicked’
with the API either, but I’m a minority there.)

You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither. That means libxml2, expat, or Xerces.
Compiling Required - I think the one-click installer comes with one of
these, buggered if I know which.

After that, Google is your friend. Look at the documentation to
whichever parser you decided to use and use that - personally, I don’t
do much / no non-tree XML parsing at all, so I’m mainly guessing around
on this. The main difference is that while with REXML, you can
arbitrarily look around the XML document, with stream and pull parsing,
you can only process the document in order, and have to keep the state
of that processing (e.g. which track you’re currently “working on”) in
your Ruby code.

David V.

pdg · November 7, 2006, 11:08am

On Tue, Nov 07, 2006 at 08:03:40AM +0900, David V. wrote:

OMFG. That’s a -huge- XML file. Probably all of my MP3s together would
amongst its qualities. (And for completeness’ sake, I never ‘clicked’
with the API either, but I’m a minority there.)

You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither. That means libxml2, expat, or Xerces.
Compiling Required - I think the one-click installer comes with one of
these, buggered if I know which.

Ruby comes with a pull parser in the standard lib:
http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/classes/REXML/Parsers/PullParser.html

I would give it a try on a document that large.

pdg · November 7, 2006, 11:08am

David V. wrote:

Well, time to learn. I probably never even saw a computer that could

Actually, I recently had to rewrite an xml parser to go stream ( SAX )
style … REXML made the task VERY easy …

Yes, it’s not the fastest thing there is, but it was “fast enough” …

Definitely try writing it with REXML before taking the route of anything
heavier.

jd

pdg · November 7, 2006, 11:08am

Is that a mistake? Out of curiosity I took a look on my wife’s computer
(she’s the iPod user) and her XML file was only 231KB. The structure
of it conforms to the code you shared, so I know it’s the right file…

Did you mean to say MB instead of GB?

-Matt

pdg · November 7, 2006, 11:08am

I wish I had the foggiest idea of what you guys were talking about.
(Roobist here)
I’m still working on Y’s book.

On Tue, 2006-11-07 at 08:12 +0900, Jeff W. wrote:

gig,

these, buggered if I know which.
David V.
jd

–
You have a new sung; unsung.
I sing a song falling upon deaf ears,
unsung.

skt
([email protected])
www.freewebs.com/scottygiveshighfives

pdg · November 7, 2006, 11:08am

If you want speed, look at libxml-ruby. It is many many times faster
than
REXML, and it supports SAX parsing as well.

Mark

pdg · November 7, 2006, 11:08am

skt wrote:

I wish I had the foggiest idea of what you guys were talking about.
(Roobist here)
I’m still working on Y’s book.

Wait… You chimed in on an unrelated thread with a “I don’t understand
any of this, FYI” comment?!

The mind, it boggles.

For the record, this isn’t a general chat channel. As such, derailing
threads is to be done more subtly

David V.

pdg · November 7, 2006, 11:08am

On Nov 6, 2006, at 5:03 PM, David V. wrote:

I probably never even saw a computer that could
handle a XML file that size using straightforward DOM parsing

This is off-topic but I have a theory that it’s possible using a
variant of the Flyweight pattern with index offsets into the document
and reparsing individual tags on demand. (I would use weak
referencing to cache them after a parse.)

I’ve been meaning to code up a proof of concept here and just haven’t
had time yet…

You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither.

REXML includes a stream parser.

James Edward G. II

pdg · November 7, 2006, 11:08am

James Edward G. II wrote:

REXML includes a stream parser.

So it does, my bad.

David V.

pdg · November 7, 2006, 11:08am

Best to lean towards a database approach when you get to large files.
Neat thing working with XML & REX.
Then you can go to SleepyCat DBxml.
Though the routines are different, that’s fer sure.
Someone has a neat Ruby lib for it out there.
Away from my machines for details.

Markt

pdg · November 7, 2006, 11:08am

Assuming I go with the Ruby pull parser, how do I use this in my code.
I see from the link the code sample, but I have no idea how to throw
that into my code and make it work. Any suggestions.

Thanks for the discussion so far.

PS: idiot (slaps head). Yes it was 5-6meg not gig!

pdg · November 7, 2006, 11:08am

Mark T wrote:

Best to lean towards a database approach when you get to large files.
Neat thing working with XML & REX.
Then you can go to SleepyCat DBxml.
Though the routines are different, that’s fer sure.
Someone has a neat Ruby lib for it out there.
Away from my machines for details.

Markt

He’s not the one creating the file. So unless you can persuade Apple to
use a XML DB to store iTunes playlists…

(PS: The whole concept of XML DBs is an abomination. The XML Infoset
concept looks like a bloated cloudfest compared to relational data
storage…)

David V.

pdg · November 7, 2006, 11:08am

pdg wrote:

Assuming I go with the Ruby pull parser, how do I use this in my code.
I see from the link the code sample, but I have no idea how to throw
that into my code and make it work. Any suggestions.

Generally, you should have some layer between XML input, and processing
the records themselves. E.g. a trivial Song class, or at least a hash.
Personally, I’d make a XMLSongList class that’s enumerable (implements
#each), and rework the REXML code that works for small files into one
that yields a Song object for each of the records in succession by
querying the tree accordingly.

That shouldn’t then be too hard to rework so that while #each is
running, it opens a pull parser, and for each yield, builds up a Song
object going through the record in the order how the elements appear in
the XML file, instead of a random one. Once you isolate the code that
manipulates the XML to the smallest significant unit (a song record in
this case, I presume), it shouldn’t be conceptually that difficult to
rework from a tree parser to a pull parser. The code probably will get a
little messier and verbose, but the main shift of thinking is in not
asking the XML for what your object needs, but feeding an object what
the XML has.

PS: idiot (slaps head). Yes it was 5-6meg not gig!

6MB is still Huge ™ for a XML file.

pdg · November 7, 2006, 9:46pm

Hi David (or others)

I am still not sure I get it, could you explain a little more?

Thanks,
Paul.

pdg · November 7, 2006, 11:08am

I created a sample program to parse the iTunes XML using
Chilkat XML here:
http://www.example-code.com/ruby/ruby-parse-itunes-xml.asp

Unfortunately, it only runs on Windows. (Sorry!) It is freeware
however.

Here’s the example source. I suspect you won’t have memory problems
with it.
If you try it, please let me know how fast it runs and whether it uses
too much memory…

require ‘chilkat’

The Chilkat XML parser for Ruby is freeware.

xml = Chilkat::CkXml.new()
xml.LoadXmlFile(“c:/temp/itunes.xml”)

Search for this node: Tracks

tracksKey = xml.SearchForContent(xml,“key”,“Tracks”)

Assuming it’s found, the node is the next sibling

dict = tracksKey.NextSibling()

Loop over the child nodes…

n = dict.NumChildrenHavingTag(“dict”)
for i in 0…(n-1)
trackRec = dict.GetNthChildWithTag(“dict”,i)
print "Name: " +
trackRec.GetChildExact(“key”,“Name”).NextSibling().content + “\n”
print "Artist: " +
trackRec.GetChildExact(“key”,“Artist”).NextSibling().content + “\n”
print "Time: " + trackRec.GetChildExact(“key”,“Total
Time”).NextSibling().content + “\n”
end

-Matt

pdg · November 7, 2006, 10:34pm

I tested the Chilkat XML parser (an in-memory DOM) on a 21MB XML file
that looks like this:

yuy25uiFfakuA7ZbA3jP48rpfSgWAhn3i3lDp3rfNqf6kzUqlqVZ0b4daYWQVjfXvb0AdxStTEST Ki78Ypx8FlbZ340PK6u2DsZQEqbFawBo0mCifTZK5YT0Tur8EXP29c5Hi2HjsfGB4EzWR3FtTEST ...

(the data is random garbage…)

The XML is parsed in 11.5 seconds on a 18.Ghz Pentium 4. Peak memory
usage is 180MB.
I don’t think the parser would break a sweat on the 6MB file…

I uploaded the XML test data to:
http://www.example-code.com/downloads/bigXml.zip
The code for parsing the iTunes XML is easy:
http://www.example-code.com/ruby/ruby-parse-itunes-xml.asp

-Matt

pdg · November 8, 2006, 4:27am

If you want, send me your test code + file and I’ll have a look…

(tomorrow morning though)

Don’t worry about a large attachment, just send it zipped…

Best Regards,
Matt

pdg · November 8, 2006, 3:27am

Hi thanks for the link, it seems to be working much better, but…

It’s getting to about the 1000th file and doing its job, but then
returning the following error:

undefined method ‘NextSibling’ for nil:NilClass (noMethodError) from
rtunes rb (which is basically just your sample code).

Does this mean I have a broken xml file? Or is something else the
matter?

pdg · November 8, 2006, 5:11am

Thanks,

By the way, the Chilkat XML parser is not better than
REXML, it’s just different. To give you a little history, it was
originally
developed about 7 years ago to handle:

Large XML data files where the MSXML parser was s-l-o-w.
At one point, I remember Chilkat XML parsing files in a few seconds
that took MSXML minutes to parse. However, since then MSXML has
improved to the point where it’s as good or better in speed…
I wanted to create a parser that was forgiving with errors.
Back then, it was a nightmare to have a large XML file with one
small error, perhaps a byte or two that didn’t fit the charset encoding,
that would prevent the entire document from loading.
I wanted a parser that made it easy to do the common tasks
I’m always faced with in XML – such as reading/writing config files.
I wanted to make it easy to do things not normally handled in
an API – sorting, compression, encryption, loading / encoding binary
data, etc.

If you give it a try – let me know what you think. Send me a request
for
an example or two and I’ll be happy to provide what I can…

-Matt