As you may or may not be aware, I am currently working on one of the 10
Google Summer of Code projects[1] mentored by Ruby Central. I am being
mentored by Austin Z…
Wrapper generators allow the extraction of information from
semi-structured
documents (like web pages) by using machine learning techniques to
generate
extraction rules based on labelled examples. The library I am creating
to
accomplish this is ARIEL - A Ruby I.rmation Extraction Library[2].
I’m not quite in a position to make a release and encourage you to give
my
project a go for yourself, but I feel it is certainly time to introduce
the
ruby community to some aspects of what I’ve been working on, and what I
hope to produce. I’m soliciting feedback at the end of this post for a
number
of issues related to the way you might interact with my library.
== Project description ==
Wrapper generation in this context is the challenge of automatically
generating rules to extract information from a set of documents. The
most
obvious use case is probably extract information from web pages. If, for
instance, I wanted to be able to extract information on products from a
cafepress store, I would do the following:
) Label the fields I want to extract on several example pages (e.g.
price,
description).
2) The wrapper generation system now reads these example pages, and
searches
for rules that can be used to reliably extract the labeled examples. The
assumption is that these rules will rely on searching for features that
are
part of the document’s structure, and so should work on any similar
page.
3) A wrapper has been generated - it can now be used to extract
Cafepress
product information.
== Progress ==
I’ve had exams up until last week, so I have done much less work on my
project than I expect to do from now until the end of the program. That
said, I am pleased with the progress I have made so far. I have made
good
progress with an implementation based on many ideas from this paper[3].
The
basic rule generation system is working, as is the tokenization process
and
much of the higher level logic needed for the system to “just work”. To
understand my progress, it’s probably easiest if I explain something
about
how the system works. A document’s structure might be defined in the
following way: doc_tree = Ariel::StructureNode.new do |r|
r.title
r.timestamp
r.author
r.post_body
r.comment_list do |c|
c.comment_author
c.comment_title
c.comment_body
c.comment_date
end
end
This example could represent a blog post. Each member of the
comment_list is
prefixed only to make it clear what I’m referring to. Each of the fields
such
as author and post_body are extracted using a pair of rules - a rule to
find
the start of the field, and a rule to locate the end of the field. In
addition to these rules, lists have rules that will decompose them in to
individual list items (in the example above - in to a complete comment).
By using a tree structure to describe the document, rules can build upon
each
other. For instance - the rules in comment_list are applied to the whole
document, but the comment_author rule will then only be applied to an
individual list item. This allows for relatively small, uncomplicated
rules
and flexibility (the order of the fields in the document doesn’t matter,
and
it doesn’t matter if some are missing). This seems obvious, but a lot
of
wrapper generation algorithms I’ve read about have these limitations.
Rules are generated using a sequential covering/separate and conquer
algorithm. This basic logic behind the rule learning process is:
- Generate a rule that correctly matches as many of the training
examples as
possible. - Remove the training examples covered by the generated rule.
- Repeat until all training examples are covered.
At the moment you can use Ariel to describe the structure of a document,
and
to generate a start_rule or an end_rule. I am making good progress on
the code
to use labeled examples to learn start and end rules for a whole
document
tree.
My next milestone will be when the support code is complete for defining
the structure of a document, reading in hand-labeled documents,
generating
and storing rules, and then applying these rules to extract data from
the
document. At this point I will set up a test framework so I can assess
metrics such as recall and precision, so I can implement and refine
aspects
of the rule generation process and measure their effect. My current rule
generation algorithm is only a first try, I’ve read a lot about
different
approaches and there are some different methods I’m interested in
trying.
I’m on holiday from 7th-14th, but I’d expect a usable release some time
during the week after.
Feel free to watch my code during development, but please withhold
judgments on Ariel’s effectiveness until it is in a more complete
state. It’s stored in Rubyforge Subversion:
svn checkout svn://rubyforge//var/svn/ariel
I’ve also got a Trac instance[4] up and running. There’s not loads to
see
right now (an early planning document), but there should be more as the
project progresses.
== Tools I’ve been finding useful ==
- RDoc
- Ruby-breakpoint (only recently started using this, but for most of my
debugging it’s exactly what I’m looking for. I’m looking forward to
seeing
what Florian G. comes up with. - autotest from ZenTest - totally awesome, it’s so useful to see how
much
you’ve broken your code as you write.
== Questions for the Ruby community ==
-
What form would you like extracted data to take?
YAML and XML output shouldn’t be a problem, but I’m thinking about the
outputted Ruby data structure. Supposing that the doc_tree defined above
were applied to a document, the extracted structure could be queried
like:
p root.title.extracted_text
p root.date.year.extracted_text
p root.comment_list[3].author.extracted_text
root.children would produce an array of the title object, author, and so
on. root.comment_list.children[3] == root.comment_list[3]. Any ideas? -
How should a document be labeled?
In order to feed the learner, you must save a copy of the type of
document you
want to extract information from, and then mark up the information you
want
extracted. What markers would be appropriate?
Something such as <l:comment_list>…</l:comment_list> is a
possibility. -
Which is better?
(a). doc_tree = Ariel::StructureNode.new {|r| r.comment_list}
(b). doc_tree = Ariel::StructureNode.new {|r| r.comments :list}
(c) doc_tree = Ariel::StructureNode.new {|r| r.list :comments}
It’s certainly possible for (a) and (b) to both be allowed.
== Contacting me ==
Feel free to send any questions or suggestions about my project or code
either as a response to this thread or off-list to me, as you deem
appropriate.
Finally, I’d like to publicly thank Austin Z. for his support and
guidance thus far. I’d also like to thank my girlfriend for her
continued
support and encouragement in my endeavours.
Alex