= Ariel release 0.0.1
gem install ariel (if it’s not yet propagated either wait or grab the
from my rubyforge page and install that).
This is the first public release of Ariel - A Ruby I.rmation
Library. See my previous post, ruby-talk:20014
background information. This release supports defining a tree document
structure and learning rules to extract each node of this tree. Handling
list extraction and learning is not yet implemented, and is the next
immediate priority. See the examples directory included in this release
below for discussion of the included examples. Rule learning is
and appears to work well, but many refinements are possible. Look out
more updates and a new release shortly.
== About Ariel
Ariel intends to assist in extracting information from semi-structured
documents including (but not in any way limited to) web pages. Although
may use libraries such as Hpricot or Rubyful Soup, or even plain Regular
Expressions to achieve the same goal, Ariel approaches the problem very
differently. Ariel relies on the user labeling examples of the data they
to extract, and then finds patterns across several such labeled examples
order to produce a set of general rules for extracting this information
any similar document. It uses the MIT license.
This release includes two examples in the example directory (which
be in the directory to which rubygems installed ariel). The first is the
google_calculator directory (inspired by Justin B.'s post to my
progress report). The structure is very simple, a calculation is
from the page, and then the actual result is extracted from that
3 labeled examples are included. Ariel reads each of these, tokenizes
and extracts each label. 4 sets of rules are learnt:
- Rules to locate the start of the calculation in the original
- Rules to locate the end of the calculation in the original document
(applied from the end of the document).
- Rules to locate the start of the result of the calculation from the
- Rules to locate the end of the result of the calculation from the
calculation (applied from the end of the calculation).
Take note of 3 and 4 - this is the advantage of treating a document as a
in this way. Deeply nested elements can be located by generating a
simple rules, rather than generating a rule with complexity that
each level. Sets of rules are generated because it may not be possible
generate a single rule that will catch all cases. A rule is found that
matches as many of the examples as possible (and fails on the rest),
examples are then removed and a rule is found that will match as many of
remaining examples and so on. When it comes to applying these learnt
the rules are applied in order until there is a rule that matches.
To see this example for yourself just execute structure.rb in the
examples/google_calculator directory to create a locally writable
structure.yaml. Then do:
ariel -D -m learn -s structure.yaml -d /examplepath/labeled
You’ll have to wait a while (see my note about performance below). At
the learnt rules will be printed in YAML format, and structure.yaml will
updated to include these rules. Apply these learnt rules to some
documents by doing:
ariel -D -m extract -s structure.yaml -d /examplepath/unlabeled
You should see the results of a successful extraction printed to your
terminal, such as this one:
Results for unlabeled/2:
calculation: 3.5 U.S. dollars = 1.8486241 British pounds
result: 1.8486241 British pounds
The second example (raa) learns rules using just 2 labeled examples.
probably fewer than I’d recommend in most cases, but as it works… This
example consists of project entries in the Ruby Application Archive. The
structure of the page is very flat, so all rules are applied to the full
page. Rules are learnt and applied as shown above. The structure.yaml
included in the examples directories already include rules generated by
Ariel, use these if you just want to see extraction working.
Note: The interface demonstrated by ariel above is not very flexible or
friendly, it’s just to serve as a demonstration for the moment.
Generating rules takes quite a long time. It is always going to be an
intensive operation, but there are some very simple and obvious
in efficiency that can be made. For a start, the rule candidate refining
process currently re-applies the same rules over and over every time the
remaining rule candidates are ranked. This is where most time is spent,
caching these should make a big difference. This will definitely be
implemented. Other performance enhancements are bound to be there, but
focus at this time is to get something that works.
Ariel is developed by Alex Bradbury as a Google Summer of Code project
the mentoring of Austin Z…