Progress on "Automated Wrapper Generation for Information Ex

As you may or may not be aware, I am currently working on one of the 10
Google Summer of Code projects[1] mentored by Ruby Central. I am being
mentored by Austin Z…

Wrapper generators allow the extraction of information from
semi-structured
documents (like web pages) by using machine learning techniques to
generate
extraction rules based on labelled examples. The library I am creating
to
accomplish this is ARIEL - A Ruby I.rmation Extraction Library[2].

I’m not quite in a position to make a release and encourage you to give
my
project a go for yourself, but I feel it is certainly time to introduce
the
ruby community to some aspects of what I’ve been working on, and what I
hope to produce. I’m soliciting feedback at the end of this post for a
number
of issues related to the way you might interact with my library.

== Project description ==
Wrapper generation in this context is the challenge of automatically
generating rules to extract information from a set of documents. The
most
obvious use case is probably extract information from web pages. If, for
instance, I wanted to be able to extract information on products from a
cafepress store, I would do the following:
) Label the fields I want to extract on several example pages (e.g.
price,
description).
2) The wrapper generation system now reads these example pages, and
searches
for rules that can be used to reliably extract the labeled examples. The
assumption is that these rules will rely on searching for features that
are
part of the document’s structure, and so should work on any similar
page.
3) A wrapper has been generated - it can now be used to extract
Cafepress
product information.

== Progress ==
I’ve had exams up until last week, so I have done much less work on my
project than I expect to do from now until the end of the program. That
said, I am pleased with the progress I have made so far. I have made
good
progress with an implementation based on many ideas from this paper[3].
The
basic rule generation system is working, as is the tokenization process
and
much of the higher level logic needed for the system to “just work”. To
understand my progress, it’s probably easiest if I explain something
about
how the system works. A document’s structure might be defined in the
following way: doc_tree = Ariel::StructureNode.new do |r|
r.title
r.timestamp
r.author
r.post_body
r.comment_list do |c|
c.comment_author
c.comment_title
c.comment_body
c.comment_date
end
end

This example could represent a blog post. Each member of the
comment_list is
prefixed only to make it clear what I’m referring to. Each of the fields
such
as author and post_body are extracted using a pair of rules - a rule to
find
the start of the field, and a rule to locate the end of the field. In
addition to these rules, lists have rules that will decompose them in to
individual list items (in the example above - in to a complete comment).

By using a tree structure to describe the document, rules can build upon
each
other. For instance - the rules in comment_list are applied to the whole
document, but the comment_author rule will then only be applied to an
individual list item. This allows for relatively small, uncomplicated
rules
and flexibility (the order of the fields in the document doesn’t matter,
and
it doesn’t matter if some are missing). This seems obvious, but a lot
of
wrapper generation algorithms I’ve read about have these limitations.

Rules are generated using a sequential covering/separate and conquer
algorithm. This basic logic behind the rule learning process is:

  1. Generate a rule that correctly matches as many of the training
    examples as
    possible.
  2. Remove the training examples covered by the generated rule.
  3. Repeat until all training examples are covered.

At the moment you can use Ariel to describe the structure of a document,
and
to generate a start_rule or an end_rule. I am making good progress on
the code
to use labeled examples to learn start and end rules for a whole
document
tree.

My next milestone will be when the support code is complete for defining
the structure of a document, reading in hand-labeled documents,
generating
and storing rules, and then applying these rules to extract data from
the
document. At this point I will set up a test framework so I can assess
metrics such as recall and precision, so I can implement and refine
aspects
of the rule generation process and measure their effect. My current rule
generation algorithm is only a first try, I’ve read a lot about
different
approaches and there are some different methods I’m interested in
trying.
I’m on holiday from 7th-14th, but I’d expect a usable release some time
during the week after.

Feel free to watch my code during development, but please withhold
judgments on Ariel’s effectiveness until it is in a more complete
state. It’s stored in Rubyforge Subversion:

svn checkout svn://rubyforge//var/svn/ariel

I’ve also got a Trac instance[4] up and running. There’s not loads to
see
right now (an early planning document), but there should be more as the
project progresses.

== Tools I’ve been finding useful ==

  • RDoc
  • Ruby-breakpoint (only recently started using this, but for most of my
    debugging it’s exactly what I’m looking for. I’m looking forward to
    seeing
    what Florian G. comes up with.
  • autotest from ZenTest - totally awesome, it’s so useful to see how
    much
    you’ve broken your code as you write.

== Questions for the Ruby community ==

  1. What form would you like extracted data to take?
    YAML and XML output shouldn’t be a problem, but I’m thinking about the
    outputted Ruby data structure. Supposing that the doc_tree defined above
    were applied to a document, the extracted structure could be queried
    like:
    p root.title.extracted_text
    p root.date.year.extracted_text
    p root.comment_list[3].author.extracted_text
    root.children would produce an array of the title object, author, and so
    on. root.comment_list.children[3] == root.comment_list[3]. Any ideas?

  2. How should a document be labeled?
    In order to feed the learner, you must save a copy of the type of
    document you
    want to extract information from, and then mark up the information you
    want
    extracted. What markers would be appropriate?
    Something such as <l:comment_list>…</l:comment_list> is a
    possibility.

  3. Which is better?
    (a). doc_tree = Ariel::StructureNode.new {|r| r.comment_list}
    (b). doc_tree = Ariel::StructureNode.new {|r| r.comments :list}
    (c) doc_tree = Ariel::StructureNode.new {|r| r.list :comments}
    It’s certainly possible for (a) and (b) to both be allowed.

== Contacting me ==
Feel free to send any questions or suggestions about my project or code
either as a response to this thread or off-list to me, as you deem
appropriate.

Finally, I’d like to publicly thank Austin Z. for his support and
guidance thus far. I’d also like to thank my girlfriend for her
continued
support and encouragement in my endeavours.

Alex

  1. http://code.google.com/soc/ruby/about.html
  2. http://rubyforge.org/projects/ariel/
  3. http://www.isi.edu/~muslea/PS/jaamas-2k.pdf
  4. http://72.36.206.122/trac/ariel/

On 7/4/06, A. S. Bradbury [email protected] wrote:

As you may or may not be aware, I am currently working on one of the 10
Google Summer of Code projects[1] mentored by Ruby Central. I am being
mentored by Austin Z…

Thanks for the update - sounds pretty interesting. Some thoughts below:

To

c.comment_body
c.comment_date

end
end

I like the interface, and the “humane” access it gives to the structure
of
the page. It appears to handle single items and lists well.

At the moment you can use Ariel to describe the structure of a document,
and

to generate a start_rule or an end_rule. I am making good progress on the
code
to use labeled examples to learn start and end rules for a whole document
tree.

Will I be able to point Ariel at a set of documents, and have it spit
out a
reusable class which I can include in another program? For example, I
have a
Bible reference parser (i.e. things like Gen 1:1, etc.) that scrapes
web
pages to get the actual verses. Right now I use hand-built regular
expressions and some patterns to get the right page for a given book,
chapter and verse. Could I use Ariel to generate the “lookup” code
instead?

Another very small program I wrote scrapes calculator results from
Google (
e.g. enter “2 + 2” into the google search box and get back “4”). Here
again
I used StringScanner and some regular expressions to get the result and
transform it slightly. Would Ariel be able to help me with that?

== Questions for the Ruby community ==

  1. What form would you like extracted data to take?
    YAML and XML output shouldn’t be a problem, but I’m thinking about the
    outputted Ruby data structure. Supposing that the doc_tree defined above
    were applied to a document, the extracted structure could be queried like:
    p root.title.extracted_text
    p root.date.year.extracted_text
    p root.comment_list[3].author.extracted_text
    root.children would produce an array of the title object, author, and so
    on. root.comment_list.children[3] == root.comment_list[3]. Any ideas?

You could take a page from the Rails “pluralize” methods and also offer:

root.comments[3] == root.comment_list.children[3] ==
root.comment_list[3]

‘root.comments_list’ would also work.

Of the chioces you show, though, I like the less verbose
“root.comment_list[3]”

  • though if comment_list is more than just a simple array the interface
    could get tricky.
  1. How should a document be labeled?

In order to feed the learner, you must save a copy of the type of document
you
want to extract information from, and then mark up the information you
want
extracted. What markers would be appropriate?
Something such as <l:comment_list>…</l:comment_list> is a possibility.

Have you heard of microformats? Essentially, its a way to markup
existing
HTML pages with added attributes to indicate structure.Its more less
intrusive than adding new tags, etc. You can read about them here

About Microformats – Microformats

And see what Yahoo is doing with them on their local results pages:

http://ylocalblog.com/blog/2006/06/21/we-now-support-microformats/

  1. Which is better?

(a). doc_tree = Ariel::StructureNode.new {|r| r.comment_list}
(b). doc_tree = Ariel::StructureNode.new {|r| r.comments :list}
(c) doc_tree = Ariel::StructureNode.new {|r| r.list :comments}
It’s certainly possible for (a) and (b) to both be allowed.

I prefer (a), though (c) seems intriguing. It sort of implies the
interface
for Ariel objects is always the same (get a single item, get a list,
etc)
and you just pass different symbols in. That could make it harder to
introspect against, though. For any of these interfaces you should
strive to
make sure actual methods are implemented and its not just a lot of
method_missing tricks. Actual implementations are a lot easier to deal
with
when meta-programming than hacking around method_missing logic.

Thanks again for the update, hope this feedback helps!

Justin

I’ve had a query via private email that I think others might be
interested in
the answer to.

Hi, I just read the post (and not very carefully cause it’s late) but
I’m wondering if you’re writing a front end to feed a lot of attributes
for each token to Weka, something like this guy (student of
Kushmerick),

http://www.aidanf.net/taxonomy/term/11/9
I’ve come across this project before, and taken a look at ELIE[1]. He’s
using
a different approach that’s more applicable to extracting information
from
different types of document to those I intend to work with. ELIE seems
to be
designed for extracting information from documents such as seminar
announcements, news articles announcing company acquisitions and other
documents of that sort. The approach I’m implementing relies on finding
some
sort of regular structure to the documents. It should be very effective
for
extracting information from semi-structured documents, that is documents
such
as web pages (price listings, auctions, search result pages) but not for
retrieving information from free-text such as news articles. ELIE seems
to
try to learn at a higher level some common features shared by all
lecture
announcements (for instance), based on the forms they tend to take.

I might not be being all too clear here, but what I’m trying to say is
that
there seems to be a distinct divide between systems like ELIE, RAPIER
(which
I’m more familiar with), Kushmerick’s BWI and systems such as STALKER
(which
I take most of my inspiration from) or SoftMealy. A good source for a
wide-ranging survey of some approaches in the field is Information
Extraction
from World Wide Web A Survey (1999) - Line Eikvil[2].

or your’e building a sequence matching HTML parser like python
webstemmer

Webstemmer - How it works?

Maybe it’ll be clear to me when i reread this tomorrow.
I don’t do any HTML parsing. I’ve checked out webstemmer before too,
it’s
really quite different to what I’m aiming for. It uses some heuristics
to
compare documents from the same site to try to work out which HTML
blocks
represent the main content. It’s different in that it’s not designed to
learn
how to extract specific information, just to employ heuristics that
happen to
work when trying to separate content from advertisements and navigation.
I
like the idea of layout patterns as a sort of fingerprint for similar
documents. I think the explanation of how webstemmer works is
excellent,
especially the use of diagrams. I hope to produce some similarly
effective
document(s) for the end of SoC.

The key parts of my approach are describing the information to be
extracted in
terms of a tree, and building rules that consume all tokens until a
match is
found.
e.g. the rule skip_to “” or the rule skip_to :html_tag (a
wild
card) would skip to the first in the document "This is an
example.
Be bold.

Rules can be composed of multiple skip_to statements (consume until you
find
this, then consume more tokens until you find that), and skip_to
statements
may have multiple parameters to match a particular sequence of tokens. I
have
decided not to explain these ideas in depth at this stage, Section 3 of
the
STALKER paper referenced in my previous post should help you out if I’m
not
making sense. Any further questions or clarifications, don’t hesitate to
ask.
I’ll be away and probably without internet access from 7th-14th, so
don’t be
offended if I don’t reply in that period.

Alex

http://www.aidanf.net/software/elie-an-adaptive-information-extraction-system
2. http://sherry.ifi.unizh.ch/eikvil99information.html

On 7/5/06, Justin B. [email protected] wrote:

I like the interface, and the “humane” access it gives to the structure of
the page. It appears to handle single items and lists well.

Minor clarification here, because a lot of examples for this stuff is
referencing the “page”; strictly speaking, it’s “the ‘humane’ access
it gives to the structure of the document”. Logically, this could be
used to go through any semi-structured document (YAML, OOo files,
etc.) although some formats (e.g., OOo) may require additional work to
have the markup be clean.

Will I be able to point Ariel at a set of documents, and have it spit out a
reusable class which I can include in another program? For example, I have a
Bible reference parser (i.e. things like Gen 1:1, etc.) that scrapes web
pages to get the actual verses. Right now I use hand-built regular
expressions and some patterns to get the right page for a given book,
chapter and verse. Could I use Ariel to generate the “lookup” code instead?

That is my understanding of Alex’s project goal. Remeber that there’s
a training phase involved.

About Microformats – Microformats

Microformats are interesting, but not 100% applicable. One of the
reasons I pushed as hard as I did to make sure that Alex’s project was
included in Ruby Central’s project list was that I saw it as more than
just web scraping.

-austin

On Wednesday 05 July 2006 16:50, Justin B. wrote:

Thanks for the update - sounds pretty interesting. Some thoughts below:

Will I be able to point Ariel at a set of documents, and have it spit out a
reusable class which I can include in another program? For example, I have
a Bible reference parser (i.e. things like Gen 1:1, etc.) that scrapes web
pages to get the actual verses. Right now I use hand-built regular
expressions and some patterns to get the right page for a given book,
chapter and verse. Could I use Ariel to generate the “lookup” code instead?

I think I understand what you’re asking here. Given a page such as
http://www.blueletterbible.org/Gen/Gen001.html - you might define the
structure such as:
doc_tree = Ariel::StructureNode.new do |r|
r.verse_list do |v|
v.reference
v.text
end
end
end

After training, given a chapter from that site, the generated rules
could be
applied to produce a list of reference and verse pairs. So you could
then
just search that list for the member of verse_list with reference==“Gen
1:1”.
Is that basically what you’re thinking? If so, Ariel should be very
suited to
that task. That actually looks like quite a nice page to use in
testing…

Another very small program I wrote scrapes calculator results from Google (
e.g. enter “2 + 2” into the google search box and get back “4”). Here again
I used StringScanner and some regular expressions to get the result and
transform it slightly. Would Ariel be able to help me with that?

Certainly.

on. root.comment_list.children[3] == root.comment_list[3]. Any ideas?

You could take a page from the Rails “pluralize” methods and also offer:

root.comments[3] == root.comment_list.children[3] == root.comment_list[3]

‘root.comments_list’ would also work.
It works ok for Rails, but I’m not sure that automatic pluralization is
worth
the hassle. This is why it probably makes sense to let people do
something to
the effect of Ariel::StructureNode.new {|r| r.comments :list}.

Of the chioces you show, though, I like the less verbose
“root.comment_list[3]”

  • though if comment_list is more than just a simple array the interface
    could get tricky.
    In the cases of lists, I’m thinking that comment_list[3] is just a more
    convenient form of comment_list.children[3] - as each extracted comment
    is a
    child of the extracted list.

HTML pages with added attributes to indicate structure.Its more less
intrusive than adding new tags, etc. You can read about them here

About Microformats – Microformats

And see what Yahoo is doing with them on their local results pages:

http://ylocalblog.com/blog/2006/06/21/we-now-support-microformats/

Yes, I’m familiar with microformats. As Austin says, perhaps they’re not
right
for my project as it takes an approach that should be capable of dealing
with
a wide variety of semi-structured documents. Is being intrusive a
problem?
Perhaps using non-XML like labels has advantages, they’d stick out from
the
other tags. Perhaps something such as
<<comment_list>>…<</comment_list>>
would stick out more, and I suppose people could specify their own
separators: comment_list/comment_list. I guess I just need a
sensible convention that people should be able to modify if they need to
(if,
for instance, parts of the document look like valid labeled examples).

introspect against, though. For any of these interfaces you should strive
to make sure actual methods are implemented and its not just a lot of
method_missing tricks. Actual implementations are a lot easier to deal with
when meta-programming than hacking around method_missing logic.

I’m also intrigued by (c) (Austin suggested it before I posted this
report to
the list). If I stick with form (a), then I think I prefer allowing (c)
as an
alternative form rather than (b). r.comment_list and r.list :comments
seem
more closely linked.

Thanks again for the update, hope this feedback helps!

Justin

It certainly does, thank you.

Alex

On 7/5/06, A. S. Bradbury [email protected] wrote:

end

end

After training, given a chapter from that site, the generated rules could
be
applied to produce a list of reference and verse pairs. So you could then
just search that list for the member of verse_list with reference==“Gen
1:1”.
Is that basically what you’re thinking? If so, Ariel should be very suited
to
that task. That actually looks like quite a nice page to use in testing…

Yes, that’s even more than I expected. I especially like how it rolls up
the
verses.

Yes, I’m familiar with microformats. As Austin says, perhaps they’re not

right
for my project as it takes an approach that should be capable of dealing
with
a wide variety of semi-structured documents. Is being intrusive a problem?

If you did have a “microformats” (or similary non-intrusive means) of
marking up a page, it could allow anyone to provid an “ariel” interface
to
their data right in the page. Now, I don’t know if anyone would do that

but it sounds cool to just point Ariel at a page/site that is compliant
and
have basically at a data-access layer for it. Pretty blue-sky though.

Justin

On Wednesday 05 July 2006 18:59, Justin B. wrote:

If you did have a “microformats” (or similary non-intrusive means) of
marking up a page, it could allow anyone to provid an “ariel” interface to
their data right in the page. Now, I don’t know if anyone would do that -
but it sounds cool to just point Ariel at a page/site that is compliant and
have basically at a data-access layer for it. Pretty blue-sky though.

Certainly it would be great if I could extract data from any web page in
that
fashion, but it sounds like we’re really dreaming of the semantic web. I
seem
to remember enjoying listening to the webcast of this Tim Berners-Lee’s
presentation on the future of the web last time I was on holiday:
http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20060314_139

Alex

On Wednesday 05 July 2006 16:50, Justin B. wrote:

On 7/4/06, A. S. Bradbury [email protected] wrote:

introspect against, though. For any of these interfaces you should strive
to make sure actual methods are implemented and its not just a lot of
method_missing tricks. Actual implementations are a lot easier to deal with
when meta-programming than hacking around method_missing logic.

I sent my last email a little early, I was going to mention
method_missing.
The way I’m using method_missing right now is to check if it’s a normal
node
or a list, and then add the correct type of child accordingly.
StructureNode
is inheriting from OpenStruct which will very handily take care of
adding
methods.

On a side note, my initial implementation was using instance_eval to
allow you
to do:
doc_tree = Ariel::StructureNode.new do
title
timestamp
author
post_body
comment_list do
comment_author
comment_title
comment_body
comment_date
end
end

Kind of like Markaby. Using instance_eval makes the implementation
somewhat
messier and harder to debug, but this form does seem a little prettier.
On
the other hand, defining a context such as r.comment_list do |c| might
make
it more likely people will write what they mean when describing a
structure?
Perhaps not.

Alex