"Database" as a collection of XML docs

Hello everyone,

A new project I’m starting on has a “database” consisting of many 10s
of thousands of XML documents. They all conform to a common schema.
The project consists pretty much exclusively of searching and
presenting existing data - there’s no need (for the forseeable future)
to be able to input or update XML documents in the database. Unlike
(say) blog data, where there’s typically a date, name and a bunch of
text, each document is fairly highly structured - there’s quite a few
separate XML attributes and entities within each doc - and the format
of the XML doc closely relates to how I’ll want to present it.

I could walk through each document, parse it into its component
pieces, load the content into a relational database, then use
ActiveRecord to extract the content and “reassemble” it back into
XHTML for presentation purposes. However, it strikes me that there’s
advantages in keeping it as a collection of XML docs; for example,
when it comes to presenting the data, I could just run an existing XML
doc against a set of CSS definitions and largely make the “views”
trivial. I could also potentially make use of XSL to produce
different report formats. Overall, converting the data from XML into
a relational format, then turning it back into XML/XHTML for
presentation purposes seems a bit dumb.

Questions:

  • has anyone tried using Rails with a set of XML documents as “the
    database”? If it’s possible, what are the limitations?
  • what have you used to index content in the XML docs, and what are
    the pros and cons of the approach you used?
  • is it possible to use ActiveRecord to search XML docs on one or more
    key values? Is it possible to do pattern matches (i.e. something
    equivalent to LIKE in SQL)?
  • is the whole idea dumb, and should I just load the data into e.g.
    Postgres and be done with it?

Thanks in advance for any suggestions. While I’m generally
comfortable with Rails, dealing with a database of XML docs is new for
me and I’m not quite sure how best to approach it.

Dave M.

On 3/14/06, David M. [email protected] wrote:

A new project I’m starting on has a “database” consisting of many 10s
of thousands of XML documents. They all conform to a common schema.

If I were faced with this situation, I’d ditch ActiveRecord. Your models
don’t have to base off of it; I’d write a new model object that can
interface with your XML files.

Good luck!

Josh on Rails wrote:

On 3/14/06, David M. <[email protected]
mailto:[email protected]> wrote:

A new project I'm starting on has a "database" consisting of many 10s
of thousands of XML documents.  They all conform to a common schema.

If I were faced with this situation, I’d ditch ActiveRecord. Your models
don’t have to base off of it; I’d write a new model object that can
interface with your XML files.
I’d take a look at the schema, and see if it could easily be mapped to a
database. If there’s not much need for insertion, then conversion would
be a one-time affair, and there’s a lot to gain by doing it that way -
not least in terms of not having to come up with a whole new query
system. I have a sneaking suspicion that acts_as_tree and polymorphic
associations would be extremely handy in such a situation.

Sounds like you could use a Native XML DB like eXist
(http://exist.sourceforge.net/). Not
sure how you’d interface it with rails, but it does run as a server and
takes XQuery
queries. I’d think you’d just parse the XML results from the query into
REXML and have
your models get information from the REXML.

b

Thanks in advance for any suggestions. While I’m generally
comfortable with Rails, dealing with a database of XML docs is new for
me and I’m not quite sure how best to approach it.

Not dumb. I agree with Josh, all you need is a new model layer. As
for getting LIKE functionality, perhaps you could use calls to grep?

-Derrick S.

I’ve got a similar situation going on with a legacy application I’m
trying to port to Rails. The legacy app is backed by an Oracle
database. Oracle supports XMLTYPE columns for storing XML content in
the database. It also provides powerful querying capabilities based
on XPATH.

I’ve made a patch to the Rails oracle connector that allows AR to
return XMLTYPE columns as strings. So getting the data out is fairly
easy. I’ll be releasing my “as_xml” plugin in a few weeks. It takes
a string return value, and parses it into an XML Document. This makes
displaying it fairly easy as well.

The bit that’s not so trivial is the searching. But if you are using
Oracle I’d recommend writing a stored procedure to do the actual
query, and add a ‘xmlfind’ method to AR. You can then call the stored
procedure using ‘connection.execute()’ and assemble the results.

It should be noted that Oracle is very pricey, so this is not a cheap
solution. They have just started offering a free version of 10g, but
it’s limited to one machine and a gig of memory.

Microsoft’s SQL Server just started offering a similiar feature, so
that may be an option as well.

The big caveat here is, this is not the ‘Rails Way’. It may solve
your problem, but you will lose some elegance.

-wilig

How about importing the documents into an XML database such as Exist
(http://wiki.exist-db.org). It’s a java server but it supports XML-RPC
and REST to talk to it so it should be fairly easy doing it from Ruby
especially since you only want to read (people have built connectors for
PHP, Zope, Cold Fusion and Perl). Querying is done using xpath2 and
xquery (with extensions) and it is pretty powerful.

/Marcus

On Mar 14, 2006, at 6:39 AM, David M. wrote:

of the XML doc closely relates to how I’ll want to present it.
a relational format, then turning it back into XML/XHTML for
presentation purposes seems a bit dumb.

Questions:

  • has anyone tried using Rails with a set of XML documents as “the
    database”? If it’s possible, what are the limitations?

I’ve been doing this for years, I don’t remember precisely but just
after the first SAX parsers were available. Initially in Java, now in
Common Lisp and Ruby/RoR. The biggest limitation to this is the
impact of the number of files in a directory on filesystem
performance… negligible/manageable on linux and OS X, not so sure
on windows. In Java, I ended up using either Perst or JDBM rather
than the filesystem directly. Berkeley DB would be a similar kind of
option (make sure you use a transactional thing or you stand to loose
everything).

One Java application we wrote generates about 600,000 xml documents
per year (from my fallible memory but that order) and including
historical data there is about 6 years in there now.

  • what have you used to index content in the XML docs, and what are
    the pros and cons of the approach you used?

This was tricky. In Java I used an approach that used indexes
(implemented in Perst or JDBM) and text indexes using Lucene.

I’ve not implemented indexing yet in the Ruby version of xampl, but
that is coming fairly soon since I am beginning to wish I had it in a
project I’m working on now.

We did an experiment keeping the indexes in mysql but I wasn’t
particularly happy. I have a couple of ideas that might help. I’ll be
looking into ActiveRecord for indexing in the Ruby version of xampl.

  • is it possible to use ActiveRecord to search XML docs on one or more
    key values? Is it possible to do pattern matches (i.e. something
    equivalent to LIKE in SQL)?

Well, sure. If you’ve only got one or two key values then there are
lots of options.

  • is the whole idea dumb, and should I just load the data into e.g.
    Postgres and be done with it?

No it is not dumb. I don’t know about in Ruby but in Java it didn’t
take a very complex XML document before the file system blew the DB
away in performance (and nothing came close to Perst or JDBM).

The same guy that wrote Perst wrote a similar thing for dynamic
languages including Ruby. I’ve not looked at it because the last time
I tried to compile it I couldn’t (but that could have been me – the
guy that wrote this stuff is really quite good and his documentation
is good… this is the same guy that wrote GOODS and a couple of the
better know main-memory database systems).

Cheers,
Bob

Thanks in advance for any suggestions. While I’m generally
comfortable with Rails, dealing with a database of XML docs is new for
me and I’m not quite sure how best to approach it.

Dave M.


Rails mailing list
[email protected]
http://lists.rubyonrails.org/mailman/listinfo/rails


Bob H. – blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. – http://www.recursive.ca/
Raconteur – http://www.raconteur.info/
xampl for Ruby – http://rubyforge.org/projects/xampl/