How extract data from a web site?

Hi,

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt’ change, but unfortunately it is far from being valid
XHTML or similar.

What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?

Ingo

sender: “Ingo W.” date: “Mon, Apr 17, 2006 at 06:12:00AM +0900” <<<EOQ
Hi,

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt’ change, but unfortunately it is far from being valid
XHTML or similar.

What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?
Hi,

Well for starters you could use your human interface to connect to the
Ruby-talk archives and look there for the answer… :slight_smile:
No offence, but I’m on this list only for about two months, but this
question was already asked about a million times… :slight_smile:

Anyway, Rubyful Soup should make you happy:

All the best,
Alex

Ingo W. wrote:

Hi,

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt’ change, but unfortunately it is far from being valid
XHTML or similar.
In order to parse the page, first you need to push it through some kind
of tidy-up engine, so you can turn invalid to HTML to XML. I recommend
this one:

http://tidy.rubyforge.org/

After this step you have reduced the problem of arbitrary (possibly
invalid) HTML parsing to XML parsing which is definitely easier, e.g.
with REXML.

What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?
Another possibility would be Rubyful soup:

You do not need pre-tidying here, just ‘use it’. Examples:

soup = BeautifulSoup.new(page)

find all

's:

soup.find_all(‘p’)

find all tags that have an attribute align=“left”

soup.find_all { |tag| tag[‘align’] = “left” }

You got the idea.

another possibility (never tried but looks good):

http://rubyforge.org/projects/ruby-htmltools/

There are certainly much more ways to do this, but i think these should
be enough to get you started.

HTH,
Peter

Ingo W. wrote:

Hi,

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt’ change, but unfortunately it is far from being valid
XHTML or similar.

What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?

BTW the cleanest method i have ever seen (and used/using, both at my
full-time job and for my PhD thesis software) is unfortunately not
available in Ruby. For this reason i have to use Java (which i do not
like at all, mainly when compared to Ruby) at both places. There i am
using the following method:

JavaXPCOM is a Java wrapper for mozilla native XPCOM, written by Javier
Pedemonte.
W3CConnector (a piece of software i made) generates a wrapper around
JavaXPCOM, implementing W3C interfaces.

This way, using my W3CConnector, you have a full access to the mozilla
DOM through wrapper classes implementing W3C interfaces. (The
W3CConnector wrapping is optional, just this way i can use Xerces and
other packages talking to W3C DOM interfaces.)

Access to the full Mozilla DOM tree means you can write full XPath
queries to extract nodes, and XPath is a really powerful language for
such purposes.

So, the missing layer in Ruby is RbXPCOM. Although it exists, it is
quite unfinished and abandoned since 2001. Does somebody know something
about it?

My PhD is a next generation web extraction engine, which heavily relies
on Mozilla DOM which can not be acquired (AFAIK) in any other way than i
have described here (For example i am using the coordinates of rendered
elements etc).

So does anyone have any info whether RbXPCOM will be finished in the
future?

Cheers,
Peter

On Apr 16, 2006, at 23:12, Ingo W. wrote:

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt’ change, but unfortunately it is far from being valid
XHTML or similar.

Some people already suggested HTML parsers. I have done a lot of
crawling and wanted to add that if you need to extract data from a
single page that is machine generated a simple regex is often enough.
You need to make a choice depending on the real page and the kind of
stuff to extract.

– fxn

Peter S. wrote:

of tidy-up engine, so you can turn invalid to HTML to XML.
That depends on what data you are after, and where you want to look for
it.

If, for example, you just want to get a list of css files referenced in
a page, then regexen would likely be simpler and faster than the
tidy-up approach.

I recommend
this one:

http://tidy.rubyforge.org/

After this step you have reduced the problem of arbitrary (possibly
invalid) HTML parsing to XML parsing which is definitely easier, e.g.
with REXML.

Sort of. I’ve seen tidy make some odd assumptions about what the
“correct” output should be, based on surreal HTML input. And this can
throw off the XML manipulation code.

soup = BeautifulSoup.new(page)

I’ve just been trying out BeautifulSoup to parse some nasty del.icio.us
markup (it has an XHTML DOCTYPE, but is painfully broken).

I had been using some simple regex iteration over the source, but they
changed that page layout, my app broke, and I thought perhaps I’d give
BeautifulSoup another shot. But I realized why I stopped using it in
the first place: it’s way too slow. (Or at least way slower than my
hand rolled hacks.)

I’ve tried a number of ways, over various applications, to extract stuff
from HTML. If I can get predictable XML right off, then that’s a big
help; I can pass it into a stream parser, or use a DOM if the file isn’t
too large.

When handed broken markup, I’ve found that many times the problem is in
only one or two places, most often the header (with malformed empty
elements). Much time can be saved by grabbing a subset of the raw HTML
(with some simple stateful line-by-line iteration) and cleaning up what
I actually need (and often that extracted subset is proper XML all by
itself).

There is a real cost to making the parsing/cleaning code highly robust,
and if you can make certain assumptions about the source text (and live
with the risks that things can change), you can often make the app
faster/simpler.


James B.

Judge a man by his questions, rather than his answers.

  • Voltaire

That depends on what data you are after, and where you want to look for it.

If, for example, you just want to get a list of css files referenced in
a page, then regexen would likely be simpler and faster than the
tidy-up approach.
Sure. But the original poster mentioned some HTML parsing so i have
thought
regexps are not enough.

Sort of. I’ve seen tidy make some odd assumptions about what the
“correct” output should be, based on surreal HTML input. And this can
throw off the XML manipulation code.

Of course. tidy is just ‘better than nothing’. I did not mean it will
work everywhere (it certainly won’t) - but at least you can get closer
to your goal (in some cases)

There is a real cost to making the parsing/cleaning code highly robust,
and if you can make certain assumptions about the source text (and live
with the risks that things can change), you can often make the app
faster/simpler.
Well, for a really robust something, take a look at my earlier mail
in this thread. You can not get nowhere near to that with any other
tool/technique (if you think yes, LMK).
I am in the web extraction business, usually we are extracting data from
hundreds of thousands of pages on a daily basis so i have some
experience with this stuff. Our wraper generator solutions are usable
for, well, most of the pages out there (say 95%), utilizing adaptive
techniques if the page changes and other stuff for robustness etc. But
this software took 5 years to develop for a medium-sized team, and now
that it is finished we can almost rewrite it from sracth because it is
nearly unusable on some of the web2.0 pages… (due to AJAX etc)

cheers,
Peter

Thanks so much for all your replies!

I ended up using simple regex and so far it works just fine.

Ingo W.