As a first exercise with Ruby, I am going through the Pickaxe book and
creating a jukebox. I haven’t even tried to create an array of songs
yet, because I got distracted and wanted to work this out. I am trying
toi feed in the data from my iTunes xml file to it to get the data, I
can get it to work if I delete most of the xml file, but when it’s 5-6
gig, rexml just seems to die. I have vaguely heard that stream parsing
may be the answer, but am totally unaware of how to use it.
here is the code in my xml reading program so far (saample.rb basically
just creates song items):
require ‘rexml/document’
require “sample.rb”
doc = File.open(“iTunes.xml”)
xml = REXML::Document.new(doc)
name = “name”
artist = “artist”
time = 60
cnt = 0
xml.elements.each("//key") do |k|
if k.text == “Name” then
name = k.next_sibling.text
cnt += 1
end
if k.text == “Artist” then
artist = k.next_sibling.text
end
if k.text == “Total Time” then
time = k.next_sibling.text.to_i/1000.0
song = Song.new(name,artist,time)
song.to_s
As a first exercise with Ruby, I am going through the Pickaxe book and
creating a jukebox. I haven’t even tried to create an array of songs
yet, because I got distracted and wanted to work this out. I am trying
toi feed in the data from my iTunes xml file to it to get the data, I
can get it to work if I delete most of the xml file, but when it’s 5-6
gig,
OMFG. That’s a -huge- XML file. Probably all of my MP3s together would
fit into there with base64-encoded contents
rexml just seems to die. I have vaguely heard that stream parsing
may be the answer, but am totally unaware of how to use it.
Well, time to learn. I probably never even saw a computer that could
handle a XML file that size using straightforward DOM parsing - which
normally “blows up” the original XML document’s size in bytes five times
and more. And REXML definitely doesn’t have performance of any kind
amongst its qualities. (And for completeness’ sake, I never ‘clicked’
with the API either, but I’m a minority there.)
You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither. That means libxml2, expat, or Xerces.
Compiling Required - I think the one-click installer comes with one of
these, buggered if I know which.
After that, Google is your friend. Look at the documentation to
whichever parser you decided to use and use that - personally, I don’t
do much / no non-tree XML parsing at all, so I’m mainly guessing around
on this. The main difference is that while with REXML, you can
arbitrarily look around the XML document, with stream and pull parsing,
you can only process the document in order, and have to keep the state
of that processing (e.g. which track you’re currently “working on”) in
your Ruby code.
On Tue, Nov 07, 2006 at 08:03:40AM +0900, David V. wrote:
OMFG. That’s a -huge- XML file. Probably all of my MP3s together would
amongst its qualities. (And for completeness’ sake, I never ‘clicked’
with the API either, but I’m a minority there.)
You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither. That means libxml2, expat, or Xerces.
Compiling Required - I think the one-click installer comes with one of
these, buggered if I know which.
Is that a mistake? Out of curiosity I took a look on my wife’s computer
(she’s the iPod user) and her XML file was only 231KB. The structure
of it conforms to the code you shared, so I know it’s the right file…
I probably never even saw a computer that could
handle a XML file that size using straightforward DOM parsing
This is off-topic but I have a theory that it’s possible using a
variant of the Flyweight pattern with index offsets into the document
and reparsing individual tags on demand. (I would use weak
referencing to cache them after a parse.)
I’ve been meaning to code up a proof of concept here and just haven’t
had time yet…
You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither.
Best to lean towards a database approach when you get to large files.
Neat thing working with XML & REX.
Then you can go to SleepyCat DBxml.
Though the routines are different, that’s fer sure.
Someone has a neat Ruby lib for it out there.
Away from my machines for details.
Assuming I go with the Ruby pull parser, how do I use this in my code.
I see from the link the code sample, but I have no idea how to throw
that into my code and make it work. Any suggestions.
Thanks for the discussion so far.
PS: idiot (slaps head). Yes it was 5-6meg not gig!
Best to lean towards a database approach when you get to large files.
Neat thing working with XML & REX.
Then you can go to SleepyCat DBxml.
Though the routines are different, that’s fer sure.
Someone has a neat Ruby lib for it out there.
Away from my machines for details.
Markt
He’s not the one creating the file. So unless you can persuade Apple to
use a XML DB to store iTunes playlists…
(PS: The whole concept of XML DBs is an abomination. The XML Infoset
concept looks like a bloated cloudfest compared to relational data
storage…)
Assuming I go with the Ruby pull parser, how do I use this in my code.
I see from the link the code sample, but I have no idea how to throw
that into my code and make it work. Any suggestions.
Generally, you should have some layer between XML input, and processing
the records themselves. E.g. a trivial Song class, or at least a hash.
Personally, I’d make a XMLSongList class that’s enumerable (implements #each), and rework the REXML code that works for small files into one
that yields a Song object for each of the records in succession by
querying the tree accordingly.
That shouldn’t then be too hard to rework so that while #each is
running, it opens a pull parser, and for each yield, builds up a Song
object going through the record in the order how the elements appear in
the XML file, instead of a random one. Once you isolate the code that
manipulates the XML to the smallest significant unit (a song record in
this case, I presume), it shouldn’t be conceptually that difficult to
rework from a tree parser to a pull parser. The code probably will get a
little messier and verbose, but the main shift of thinking is in not
asking the XML for what your object needs, but feeding an object what
the XML has.
PS: idiot (slaps head). Yes it was 5-6meg not gig!
Unfortunately, it only runs on Windows. (Sorry!) It is freeware
however.
Here’s the example source. I suspect you won’t have memory problems
with it.
If you try it, please let me know how fast it runs and whether it uses
too much memory…
require ‘chilkat’
The Chilkat XML parser for Ruby is freeware.
xml = Chilkat::CkXml.new()
xml.LoadXmlFile(“c:/temp/itunes.xml”)
By the way, the Chilkat XML parser is not better than
REXML, it’s just different. To give you a little history, it was
originally
developed about 7 years ago to handle:
Large XML data files where the MSXML parser was s-l-o-w.
At one point, I remember Chilkat XML parsing files in a few seconds
that took MSXML minutes to parse. However, since then MSXML has
improved to the point where it’s as good or better in speed…
I wanted to create a parser that was forgiving with errors.
Back then, it was a nightmare to have a large XML file with one
small error, perhaps a byte or two that didn’t fit the charset encoding,
that would prevent the entire document from loading.
I wanted a parser that made it easy to do the common tasks
I’m always faced with in XML – such as reading/writing config files.
I wanted to make it easy to do things not normally handled in
an API – sorting, compression, encryption, loading / encoding binary
data, etc.
If you give it a try – let me know what you think. Send me a request
for
an example or two and I’ll be happy to provide what I can…