Simple screen scraper using scrAPI

doog · November 29, 2006, 12:15am

I’m a Ruby novice. Does anyone have an example of a simple screen
scraper in Ruby that uses scrAPI (and works on Mac OS X)?

All I need it to do is:

Go to a specified web page
Use a CSS selector to grab and print out any section of the page

It does not need to find links on the page or crawl.

I tried the eBay example at
http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/
and have tried the recommended “require” and Tidy.path statements,
but couldn’t find a combination that works.

-Doug

doog · November 29, 2006, 2:45am

doog wrote:

http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/

and have tried the recommended “require” and Tidy.path statements,
but couldn’t find a combination that works.

Please tell me something. Do you want to:

Parse a Web page using scrAPI, or
Parse a Web page.

If you are more concerned with parsing content from a Web page than
using
scrAPI, then I can help you.

doog · November 29, 2006, 3:53am

Thanks so much. Parsing a web page is sufficient, and would
be a great starting point.

-Doug

doog · November 29, 2006, 4:29am

On 11/28/06, doog [email protected] wrote:

Thanks so much. Parsing a web page is sufficient, and would
be a great starting point.

If parsing a web page is sufficient, I definitely recommend Hpricot.
It’s simple, easy, and does the job very well.

http://code.whytheluckystiff.net/hpricot/

Cheers,
Alvim.

doog · November 29, 2006, 7:16am

doog wrote:

Thanks so much. Parsing a web page is sufficient, and would
be a great starting point.

Okay, here is a simple parser in ordinary Ruby, it will give you some
ideas
about what is involved in parsing.

There are many libraries that do much more than this script does, some
of
them have steep learning curves, many offer exotic ways to acquire
particular kinds of content.

This is a simple parser that returns an array containing all the table
content in the target Web page. I wrote it earlier today for someone who
wanted to scrape a yahoo.com financial page, which explains the target
page, something easy to change:

#!/usr/bin/ruby -w

require ‘net/http’

read the page data

http = Net::HTTP.new(‘finance.yahoo.com’, 80)
resp, page = http.get(‘/q?s=IBM’, nil )

BEGIN processing HTML

def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.?>(.?)</#{tag}>}im).flatten
end

out_tables = []
table_data = parse_html(page,“table”)
table_data.each do |table|
out_rows = []
row_data = parse_html(table,“tr”)
row_data.each do |row|
out_cells = parse_html(row,“td”)
out_cells.each do |cell|
cell.gsub!(%r{<.*?>},“”)
end
out_rows << out_cells
end
out_tables << out_rows
end

END processing HTML

examine the result

def parse_nested_array(array,tab = 0)
n = 0
array.each do |item|
if(item.size > 0)
puts “#{”\t" * tab}[#{n}] {"
if(item.class == Array)
parse_nested_array(item,tab+1)
else
puts “#{”\t" * (tab+1)}#{item}"
end
puts “#{”\t" * tab}}"
end
n += 1
end
end

parse_nested_array(out_tables)

This program emits an indexed, indented listing of the table content
that it
extracted, so you can then customize it by acquiring particular table
cells
through use of the provided index numbers.

It should work with any Web page that has the interesting content
embedded
in tables, and whose syntax is reliable.

The primary value of this program is to show you how easy it is to
scrape
pages using Ruby, and give you a starting point you can customize to
meet
your own requirements.

doog · November 29, 2006, 8:56am

doog wrote:

I’m a Ruby novice. Does anyone have an example of a simple screen
scraper in Ruby that uses scrAPI (and works on Mac OS X)?

Though I don’t seem to understand the intensity of the holy war Paul is
leading against anything that is not hand-coded on the fly, this time I
will have to agree with him: the request ‘I would like to write a screen
scraper in scrAPI (or Hpricot, or xxx)’ is not always the right way.
Screen scraping is can be very tedious and complex, and it really
depends on the input page, the type of the actions you would like to
perform (fetching the page is trivial? do you need to navigate? (i.e.
fill forms, lick links) how complex is the parsing?) quality you would
like to achieve, robustness (i.e. if the underlying page changes, the
scraper should still perform well) and another 10k things. Some time ago
I wrote a small article on this:

http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/

It is a bit outdated now (I am planning to beef it up with FireWatir,
Hpricot and other sections) but it can help you as a starting point.

Conclusion: it depends on the page and task at hand what/how should be
used. I suggest that if you have a concrete problem, drop us a mail and
we will figure out something.

Cheers,
Peter

__
http://www.rubyrailways.com

doog · November 29, 2006, 10:30am

I Paul ! Instead of talking about scrAPI would you tell us the magic
that was inside GrafForth ???

As a teen, I was fan of the bass player of Iron Maiden metal band and
YOU
Kidding a little but not that much…

Sorry for the offtopism !

doog · November 29, 2006, 6:26pm

Zouplaz wrote:

I Paul ! Instead of talking about scrAPI would you tell us the magic
that was inside GrafForth ???

Ha! A reference to a different era, a distant voice.

For the other readers, GraForth was a Forth embodiment I cooked up about
25
years ago, at a time when most things were written in assembly. It
supported a kind of graphics that would be embarrassingly crude by
modern
standards.

It was basically a way to get around the fact that there were almost no
high-level languages, and none that mere mortals could either afford or
support with the small HDD and RAM sizes of the era.

As a teen, I was fan of the bass player of Iron Maiden metal band and
YOU

I’m glad to see you had your priorities straight.

doog · November 30, 2006, 1:05pm

le 29/11/2006 18:01, Paul L. nous a dit:

Ha! A reference to a different era, a distant voice.

We should not forget these times (and I didn’t lived the 70’s - that was
certainly something else) even if there’s no VW van anywhere

Art is a performance and performance comes from constraints…

For the other readers, GraForth was a Forth embodiment I cooked up about 25
years ago, at a time when most things were written in assembly. It
supported a kind of graphics that would be embarrassingly crude by modern
standards.

Hey ! I remember a demo of GraForth showing a 3D rotating cube (maybe
color filled) - Not that bad for the only mhz of my Apple IIc (no I
didn’t had that wonderful II+, I was a little late)

It was basically a way to get around the fact that there were almost no
high-level languages, and none that mere mortals could either afford or
support with the small HDD and RAM sizes of the era.

Did you used any cross compilation systems to code GraForth or
AppleWriter ? (you know that kind of systems that most early 80s teen
geek dreamed to have an access onto)

As a teen, I was fan of the bass player of Iron Maiden metal band and
YOU

I’m glad to see you had your priorities straight.

:-))

doog · November 30, 2006, 6:38pm

Thanks to this group for helping me get various screen scrapers
up and running. I had made a couple of silly typos that held me
back. It will take me a weekend or so of spare time to digest what
I have and actually write the code I want.

Thanks Peter, Paul and Alvim

Thanks Alvim, for the pointer to hpricot – I"ve got the demo script
working and will study it.

Thanks Peter for your article – I had read it before, but re-reading
it at this point helps quite a bit.

Thanks Paul for the code below. I know regex, so it will just be
a matter of me learning the flow/expression syntax.

-Doug

doog · November 29, 2006, 6:04pm

Peter S. wrote:

doog wrote:

I’m a Ruby novice. Does anyone have an example of a simple screen
scraper in Ruby that uses scrAPI (and works on Mac OS X)?

Though I don’t seem to understand the intensity of the holy war Paul is
leading against anything that is not hand-coded on the fly,

For purposes of clarification, I simply want newbies to see how easy it
is
to write these things in ordinary Ruby code.

And, lest there be any confusion on this point, I always say that, or
something like it – seemingly to no effect.

doog · November 30, 2006, 7:12pm

Zouplaz wrote:

/ …

For the other readers, GraForth was a Forth embodiment I cooked up about
25 years ago, at a time when most things were written in assembly. It
supported a kind of graphics that would be embarrassingly crude by modern
standards.

Hey ! I remember a demo of GraForth showing a 3D rotating cube (maybe
color filled)

How ironic that you should mention that. I was recently deposed by a
group
of lawyers defending all the big game-software players (Microsoft,
Nintendo, et.al.) against a patent lawsuit that claimed they had
patented
the idea of using a joystick or keyboard to control an onscreen 3D
graphics
display. If they had prevailed in their claim, it would have been a gold
mine.

MIllions of dollars of royalties were at stake. Then a researcher
discovered
I had written GraForth and an earlier program called Apple World that
did
what the patent claimed, before the date of the patent. Basically my
testimony took the wind out of their sails.

Not that bad for the only mhz of my Apple IIc (no I
didn’t had that wonderful II+, I was a little late)

It was basically a way to get around the fact that there were almost no
high-level languages, and none that mere mortals could either afford or
support with the small HDD and RAM sizes of the era.

Did you used any cross compilation systems to code GraForth or
AppleWriter ? (you know that kind of systems that most early 80s teen
geek dreamed to have an access onto)

GraForth, yes, Apple Writer, no. I ported GraForth over to the early IBM
PC,
but Apple Writer was mired in assembly language. I had to completely
rewrite Apple Writer for the PC (under a different name, of course)
because
it was plain assembly, no abstractions. GraForth was, after all, Forth,
so
it was more portable.

doog · November 30, 2006, 7:24pm

From: “Paul L.” [email protected]

def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.?>(.?)</#{tag}>}im).flatten
end

Are <' and>’ characters legal inside quoted attribute values?

E.g. a>b

Also, is the closing tag allowed to have whitespace between the
tag name and the ending bracket?

E.g.

The latter would be trivial to accomodate with a \s* obviously;
but the former would be a shade trickier (though certainly still
possible with a regexp.)

There’s a lot of foul, cruel, and bad-tempered HTML out there
in the wild. Depending on the needs of the Original Poster,
death could await a simplistic HTML lexer with nasty big pointy
teeth.

TIM: I warned you! But did you listen to me? Oh, no, you knew it
all, didn’t you? Oh, it’s just a harmless little markup language,
isn’t it? Well, it’s always the same, I always–
ARTHUR: Oh, shut up!
TIM: --But do they listen to me?–
ARTHUR: Right!
TIM: -Oh, no–
KNIGHTS: Charge!

All in fun,

Bill

doog · November 30, 2006, 8:57pm

Bill K. wrote:

From: “Paul L.” [email protected]

def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.?>(.?)</#{tag}>}im).flatten
end

Are <' and>’ characters legal inside quoted attribute values?

I don’t think so. I think they have to be escaped, like most things in
HTML
syntax.

E.g.

Also, is the closing tag allowed to have whitespace between the
tag name and the ending bracket?

E.g.

Not syntactically correct, but the question might be “will it happen?”
In
which case the answer is “probably”.

The latter would be trivial to accomodate with a \s* obviously;

Yep.

but the former would be a shade trickier (though certainly still
possible with a regexp.)

I don’t think that one needs to be addressed. It isn’t syntactically
correct
as well as being strange. I know when I create relatively free-form
attributes like the content of “title,” I always escape the HTML tag
delimiters. I am reasonably sure it is a requirement.

If we allowed bare “<” and “>” between quotes in attributes, we would
have
to scan the tags character by character to be sure to have a valid
parse.
In nearly all cases involving delimiters like quotes and any relaxed,
permissive syntax, you end up scanning with a state machine.

There’s a lot of foul, cruel, and bad-tempered HTML out there
in the wild.

Yeah, and I wrote some of it personally, or it was written with my
editor
Arachnophilia.

Depending on the needs of the Original Poster,
death could await a simplistic HTML lexer with nasty big pointy
teeth.

Yes, as I have said.

TIM: I warned you! But did you listen to me? Oh, no, you knew it
all, didn’t you? Oh, it’s just a harmless little markup language,
isn’t it? Well, it’s always the same, I always–
ARTHUR: Oh, shut up!
TIM: --But do they listen to me?–
ARTHUR: Right!
TIM: -Oh, no–
KNIGHTS: Charge!

Not at all fair to a helpless attack-rabbit.

doog · November 30, 2006, 9:29pm

Hi –

On Fri, 1 Dec 2006, Paul L. wrote:

Also, is the closing tag allowed to have whitespace between the
tag name and the ending bracket?

E.g.

Not syntactically correct, but the question might be “will it happen?” In
which case the answer is “probably”.

I believe it is actually legal. In the XML 1.1 spec, an end-tag is:

ETag ::= ‘</’ Name S? ‘>’

(where S is any non-zero amount of whitespace, and ? indicates zero or
more of that), and if I’m reading the ISO 8859 roadmap in Neil
Bradley’s “Concise SGML Companion” correctly, it’s legal in SGML
generally.

David