Magic/xml library for easy XML processing

Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl’s XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

The code and (rather incomplete) documentation
are here → http://zabor.org/taw/magic_xml/

A few examples, so you can quickly see whether you’re interested or not
:slight_smile:

Parse ATOM feed for my blog and prints post titles and URLs:

doc = XML.from_url “taw's blog
doc.children(:entry).children(:link) {|c|
print “#{c[:title]}\n#{c[:href]}\n\n” if c[:rel] == “alternate”
}

Get my del.icio.us posts about magic/xml and format them
as a XHTML list (for magic/xml’s website):

deli_passwd = File.read(“/home/taw/.delipasswd”).chomp
url =
http://taw:#{deli_passwd}@del.icio.us/api/posts/recent?tag=taw+blog+magicxml
XML.from_url(url).children(:post).reverse.each_with_index {|p,i|
print XML.li("#{i+1}. ", XML.a({:href => p[:href]}, p[:description]))
}

Extract articles and IDs from a Wikipedia dump. It keeps only
small fragments in memory, but provides all convenient access
methods (works like XML::Twig, but with much nicer interface):

XML.parse_as_twigs(STDIN) {|node|
next unless node.name == :page
node.complete!
t = node.children(:title)[0].contents
i = node.children(:id)[0].contents
print “#{i}: #{t}\n”
}

More about stream processing with magic/xml at

The most important thing to do would be to find cases
where other libraries are more expressive than magic/xml
and fix these cases if possible :slight_smile: As I don’t know half
of the other libraries, and you certainly do, I need your help here :slight_smile:

And I guess I should also add XPath, port to a faster XML parser
(currently
using REXML to get a stream of XML parse events), and add
some interface for accessing fancy XML features like
processing instructions to get it out of alpha :slight_smile:

On Aug 4, 2006, at 8:17 PM, Tomasz W. wrote:

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like
CDuce
and to some extend by Perl’s XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

This looks really promising. You almost lost me though. I suspected
from the mention of XML::Twig that magic/xml might handle Really Huge
XML ™ more gracefully than our current options and it looks like
it does. Point being, you should mention it handles Really Huge XML
gracefully up front, as that is something I find myself inflicting
ugly hacks upon myself to achieve today.

And I guess I should also add XPath, port to a faster XML parser
(currently
using REXML to get a stream of XML parse events), and add
some interface for accessing fancy XML features like
processing instructions to get it out of alpha :slight_smile:

Yes, please, XPath! It could be I’m the only one who likes XPath,
but I find it a great way to pluck data out of XML.

This library looks really good. I’m going to keep it in mind for all
my “some idiot sent me an 1 GB XML file that is a bunch of smaller
files concatenated together” needs.

On 8/5/06, Adam K. [email protected] wrote:

This looks really promising. You almost lost me though. I suspected
from the mention of XML::Twig that magic/xml might handle Really Huge
XML ™ more gracefully than our current options and it looks like
it does. Point being, you should mention it handles Really Huge XML
gracefully up front, as that is something I find myself inflicting
ugly hacks upon myself to achieve today.

[…]

This library looks really good. I’m going to keep it in mind for all
my “some idiot sent me an 1 GB XML file that is a bunch of smaller
files concatenated together” needs.

Handling huge XML files is just a bonus. The main reason the library
exists
is its sheer expressive power.

I tried to recode W3C’s XQuery Use Cases (
XML Query Use Cases )
in magic/xml to see how it compares with XQuery on XQuery’s terms,
and they’re very close. For the use cases I translated so far the
results are
(characters with whitespace merged and a few other transformations that
make it more meaningful):

Problem XMP 1: Ruby 187 (114%), XQuery: 164
Problem XMP 2: Ruby 132 (100%), XQuery: 132
Problem XMP 3: Ruby 115 (103%), XQuery: 112
Problem XMP 4: Ruby 400 (101%), XQuery: 398
Problem XMP 5: Ruby 367 (124%), XQuery: 296
Problem XMP 6: Ruby 220 (104%), XQuery: 211
Problem XMP 7: Ruby 232 (135%), XQuery: 172
Problem XMP 8: Ruby 150 (88%), XQuery: 170
Problem XMP 9: Ruby 157 (129%), XQuery: 122
Problem XMP 10: Ruby 298 (142%), XQuery: 210
Problem XMP 11: Ruby 295 (136%), XQuery: 217
Problem XMP 12: Ruby 457 (118%), XQuery: 387
Problem Tree 1: Ruby 166 (61%), XQuery: 270
Problem Tree 2: Ruby 118 (109%), XQuery: 108
Problem Tree 3: Ruby 133 (101%), XQuery: 132
Problem Tree 4: Ruby 75 (93%), XQuery: 81
Problem Tree 5: Ruby 168 (104%), XQuery: 161
Problem Tree 6: Ruby 255 (69%), XQuery: 369
Total: Ruby 3925 (106%), XQuery: 3712
Median ratio: 104%

I don’t think any other Ruby library for XML can get anywhere
close to such results. And efficient processing of large XMLs ?
That’s just a small freebie :smiley:

“Tomasz W.” [email protected] writes:

port to a faster XML parser (currently
using REXML to get a stream of XML parse events)

You probably want to investigate xmlparser (Expat-based), or one of
the Libxml2-based ones, such as XML::Smart.

Zed S. wrote:

doc = XML.from_url “taw's blog

Especially considering you overload [] to get attributes, why not / to
get children?

http://cherry.rubyforge.org
http://rubyforge.org/projects/cherry/

T.

On Sat, 2006-08-05 at 10:17 +0900, Tomasz W. wrote:

Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl’s XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

doc = XML.from_url “taw's blog
doc.children(:entry).children(:link) {|c|
print “#{c[:title]}\n#{c[:href]}\n\n” if c[:rel] == “alternate”
}

I’d like to have it work like Hpricot as well:

(doc/:entry/:link).each do {|c|

}

Especially considering you overload [] to get attributes, why not / to
get children?

On 8/6/06, Zed S. [email protected] wrote:


}

Especially considering you overload [] to get attributes, why not / to
get children?

Basically because there are three reasonable things to with node[:foo]:

  • return attribute :foo
  • return the first child with tag :foo
  • return list of children with tag :foo
    Ruby is not Perl, so we cannot have both 2 and 3 folded into one,
    and doing only second or only third doesn’t sound that convincing :wink:

Another issue is that I’d have to overload Array#/ to get
(doc/:entry/:link) working,
and that would have much higher mental cost than adding long-named
method like #children to it. Or use something else than an Array
for sequences of XML nodes (hpricot does so with Hpricot::Elements),
but that wouldn’t be nice. I’ll look at it again after I have all W3C
XQuery Use Cases recoded :slight_smile: