Magic/xml library for easy XML processing

fmwr · August 5, 2006, 3:19am

Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl’s XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

The code and (rather incomplete) documentation
are here → http://zabor.org/taw/magic_xml/

A few examples, so you can quickly see whether you’re interested or not

Parse ATOM feed for my blog and prints post titles and URLs:

doc = XML.from_url “taw's blog”
doc.children(:entry).children(:link) {|c|
print “#{c[:title]}\n#{c[:href]}\n\n” if c[:rel] == “alternate”
}

Get my del.icio.us posts about magic/xml and format them
as a XHTML list (for magic/xml’s website):

deli_passwd = File.read(“/home/taw/.delipasswd”).chomp
url =
“http://taw:#{deli_passwd}@del.icio.us/api/posts/recent?tag=taw+blog+magicxml”
XML.from_url(url).children(:post).reverse.each_with_index {|p,i|
print XML.li("#{i+1}. ", XML.a({:href => p[:href]}, p[:description]))
}

Extract articles and IDs from a Wikipedia dump. It keeps only
small fragments in memory, but provides all convenient access
methods (works like XML::Twig, but with much nicer interface):

XML.parse_as_twigs(STDIN) {|node|
next unless node.name == :page
node.complete!
t = node.children(:title)[0].contents
i = node.children(:id)[0].contents
print “#{i}: #{t}\n”
}

More about stream processing with magic/xml at

The most important thing to do would be to find cases
where other libraries are more expressive than magic/xml
and fix these cases if possible As I don’t know half
of the other libraries, and you certainly do, I need your help here

And I guess I should also add XPath, port to a faster XML parser
(currently
using REXML to get a stream of XML parse events), and add
some interface for accessing fancy XML features like
processing instructions to get it out of alpha

fmwr · August 5, 2006, 5:03pm

On Aug 4, 2006, at 8:17 PM, Tomasz W. wrote:

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like
CDuce
and to some extend by Perl’s XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

This looks really promising. You almost lost me though. I suspected
from the mention of XML::Twig that magic/xml might handle Really Huge
XML ™ more gracefully than our current options and it looks like
it does. Point being, you should mention it handles Really Huge XML
gracefully up front, as that is something I find myself inflicting
ugly hacks upon myself to achieve today.

And I guess I should also add XPath, port to a faster XML parser
(currently
using REXML to get a stream of XML parse events), and add
some interface for accessing fancy XML features like
processing instructions to get it out of alpha

Yes, please, XPath! It could be I’m the only one who likes XPath,
but I find it a great way to pluck data out of XML.

This library looks really good. I’m going to keep it in mind for all
my “some idiot sent me an 1 GB XML file that is a bunch of smaller
files concatenated together” needs.

fmwr · August 5, 2006, 7:36pm

On 8/5/06, Adam K. [email protected] wrote:

This looks really promising. You almost lost me though. I suspected
from the mention of XML::Twig that magic/xml might handle Really Huge
XML ™ more gracefully than our current options and it looks like
it does. Point being, you should mention it handles Really Huge XML
gracefully up front, as that is something I find myself inflicting
ugly hacks upon myself to achieve today.

[…]

This library looks really good. I’m going to keep it in mind for all
my “some idiot sent me an 1 GB XML file that is a bunch of smaller
files concatenated together” needs.

Handling huge XML files is just a bonus. The main reason the library
exists
is its sheer expressive power.

I tried to recode W3C’s XQuery Use Cases (
XML Query Use Cases )
in magic/xml to see how it compares with XQuery on XQuery’s terms,
and they’re very close. For the use cases I translated so far the
results are
(characters with whitespace merged and a few other transformations that
make it more meaningful):

Problem XMP 1: Ruby 187 (114%), XQuery: 164
Problem XMP 2: Ruby 132 (100%), XQuery: 132
Problem XMP 3: Ruby 115 (103%), XQuery: 112
Problem XMP 4: Ruby 400 (101%), XQuery: 398
Problem XMP 5: Ruby 367 (124%), XQuery: 296
Problem XMP 6: Ruby 220 (104%), XQuery: 211
Problem XMP 7: Ruby 232 (135%), XQuery: 172
Problem XMP 8: Ruby 150 (88%), XQuery: 170
Problem XMP 9: Ruby 157 (129%), XQuery: 122
Problem XMP 10: Ruby 298 (142%), XQuery: 210
Problem XMP 11: Ruby 295 (136%), XQuery: 217
Problem XMP 12: Ruby 457 (118%), XQuery: 387
Problem Tree 1: Ruby 166 (61%), XQuery: 270
Problem Tree 2: Ruby 118 (109%), XQuery: 108
Problem Tree 3: Ruby 133 (101%), XQuery: 132
Problem Tree 4: Ruby 75 (93%), XQuery: 81
Problem Tree 5: Ruby 168 (104%), XQuery: 161
Problem Tree 6: Ruby 255 (69%), XQuery: 369
Total: Ruby 3925 (106%), XQuery: 3712
Median ratio: 104%

I don’t think any other Ruby library for XML can get anywhere
close to such results. And efficient processing of large XMLs ?
That’s just a small freebie

fmwr · August 5, 2006, 8:09pm

“Tomasz W.” [email protected] writes:

port to a faster XML parser (currently
using REXML to get a stream of XML parse events)

You probably want to investigate xmlparser (Expat-based), or one of
the Libxml2-based ones, such as XML::Smart.

fmwr · August 6, 2006, 6:13am

Zed S. wrote:

doc = XML.from_url “taw's blog”

Especially considering you overload [] to get attributes, why not / to
get children?

http://cherry.rubyforge.org
http://rubyforge.org/projects/cherry/

T.

fmwr · August 6, 2006, 5:31am

On Sat, 2006-08-05 at 10:17 +0900, Tomasz W. wrote:

Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl’s XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

doc = XML.from_url “taw's blog”
doc.children(:entry).children(:link) {|c|
print “#{c[:title]}\n#{c[:href]}\n\n” if c[:rel] == “alternate”
}

I’d like to have it work like Hpricot as well:

(doc/:entry/:link).each do {|c|
…
}

Especially considering you overload [] to get attributes, why not / to
get children?

fmwr · August 6, 2006, 12:55pm

On 8/6/06, Zed S. [email protected] wrote:

…
}

Especially considering you overload [] to get attributes, why not / to
get children?

Basically because there are three reasonable things to with node[:foo]:

return attribute :foo
return the first child with tag :foo
return list of children with tag :foo
Ruby is not Perl, so we cannot have both 2 and 3 folded into one,
and doing only second or only third doesn’t sound that convincing

Another issue is that I’d have to overload Array#/ to get
(doc/:entry/:link) working,
and that would have much higher mental cost than adding long-named
method like #children to it. Or use something else than an Array
for sequences of XML nodes (hpricot does so with Hpricot::Elements),
but that wouldn’t be nice. I’ll look at it again after I have all W3C
XQuery Use Cases recoded