Parsing HTML / following links etc

thechak · January 23, 2008, 12:36pm

Hello all,

I’ve was pushed towards ruby from by a friend. I’m used to the usual
shell scripting and was told this will be much more powerful /
gracefully / easier.

It does look all very exciting but I’m having a problem looking for the
easiest way to implement something quite simple.

I’m used to using wget to crawl some of my sites to a certain layer.

wget http://www.digg.com -r -l 2 will digg down two layers from the
front page (follow the links).

I can’t find an easy way of doing this. Open-uri doesn’t seem to
supporting recursive following. I’ve looked at pulling down the HTML and
parsing it back to open-uri but there doesn’t seem to be an easy way of
doing this.

Another thing I would like to do it pull down other elements from the
html such as images so I explored html-parsing but they all seemed to
be geared towards manipulation rather than downloading information for
manipulation later

Thanks for your help if you can

Dan

thechak · January 23, 2008, 12:42pm

What I meant by other elements is the --page-requisites in wget
(GNU Wget 1.21.1-dirty Manual)

Thanks

Dan Cuddeford wrote:

Hello all,

I’ve was pushed towards ruby from by a friend. I’m used to the usual
shell scripting and was told this will be much more powerful /
gracefully / easier.

It does look all very exciting but I’m having a problem looking for the
easiest way to implement something quite simple.

I’m used to using wget to crawl some of my sites to a certain layer.

wget http://www.digg.com -r -l 2 will digg down two layers from the
front page (follow the links).

I can’t find an easy way of doing this. Open-uri doesn’t seem to
supporting recursive following. I’ve looked at pulling down the HTML and
parsing it back to open-uri but there doesn’t seem to be an easy way of
doing this.

Another thing I would like to do it pull down other elements from the
html such as images so I explored html-parsing but they all seemed to
be geared towards manipulation rather than downloading information for
manipulation later

Thanks for your help if you can

Dan

thechak · January 23, 2008, 12:50pm

Mmmm thanks for your advice. One thought - is it possible to get ruby is
run wget externally and pull into a directory and then set ruby off to
do its magic once this is done?

Florian Ebeling wrote:

wget http://www.digg.com -r -l 2 will digg down two layers from the
html such as images so I explored html-parsing but they all seemed to
be geared towards manipulation rather than downloading information for
manipulation later

you might want to try the hpricot (!) gem, but that covers only
html parsing. then you use the standard html client and pull
the documents you fancy.

a regular http client library does not typically include getting
all referenced objects, this is rather a higher-level ‘application’
feature.

thechak · January 23, 2008, 12:47pm

wget http://www.digg.com -r -l 2 will digg down two layers from the
html such as images so I explored html-parsing but they all seemed to
be geared towards manipulation rather than downloading information for
manipulation later

you might want to try the hpricot (!) gem, but that covers only
html parsing. then you use the standard html client and pull
the documents you fancy.

a regular http client library does not typically include getting
all referenced objects, this is rather a higher-level ‘application’
feature.

thechak · January 23, 2008, 1:04pm

Er… Mechanize anybody?

Mechanize[1] is easy to use, Hpricot is intergrated to make it easier to
parse HTML. It allows you to easily click links too.

[1] => http://mechanize.rubyforge.org/mechanize/

Hope that Helps

Regards,
Lee

thechak · January 23, 2008, 3:23pm

Thanks for the advice guys.

It’s a shame there isn’t an easy way to use the -r -p switches from wget
but I will try to learn how to use these other gems to get the job done.

thechak · January 23, 2008, 1:17pm

Alle Wednesday 23 January 2008, Dan Cuddeford ha scritto:

Mmmm thanks for your advice. One thought - is it possible to get ruby is
run wget externally and pull into a directory and then set ruby off to
do its magic once this is done?

In ruby, you can execute a command in a subshell using the system method
or
the backticks (`) operator. This way, you can create a ruby script which
downloads what you need using wget, then goes on doing whatever you want
with
those files. To run the command in your original post from ruby, you’d
use

system ‘wget http://www.digg.com -r -l 2’

or

wget http://www.digg.com -r -l 2

The difference between the two methods is that system returns true or
false
depending on wether the command exited correctly or with an error
status,
while cmd returns the standard output of cmd (which is not displayed
on
screen).

I hope this helps

Stefano

thechak · January 23, 2008, 4:49pm

It’s a shame there isn’t an easy way to use the -r -p switches from wget

But there is.

system ‘wget -r -p http://blabla.whatever/lalala/yooo.avi’

You can mimic wget’s behaviour in pure ruby too, but wget is quite big,
it is not a quick job to implement all of it’s features in ruby. I
myself only use open-uri to download a single file, but if anyone writes
a happy wget-ruby class that includes recursive downloads I’d happily
switch to use that too.

thechak · January 23, 2008, 4:10pm

Alle Wednesday 23 January 2008, Dan Cuddeford ha scritto:

Thanks for the advice guys.

It’s a shame there isn’t an easy way to use the -r -p switches from wget
but I will try to learn how to use these other gems to get the job done.

As I explained in my previous post, you can run any command you in a
shell
from ruby using cmd or system(‘cmd’). This includes calling wget with
any
options. If I understood you correctly, this is what you’re looking for.
Am I
missing something?

Stefano

thechak · January 25, 2008, 5:05pm

Stefano C. wrote:

In ruby, you can execute a command in a subshell using the system method
or
the backticks (`) operator. This way, you can create a ruby script which
downloads what you need using wget, then goes on doing whatever you want
with
those files.

wget http://www.digg.com -r -l 2

How to I wack a variable in here? It doesn’t seem to want a string

thechak · January 26, 2008, 1:40am

Dan Cuddeford wrote:

Stefano C. wrote:

In ruby, you can execute a command in a subshell using the system method
or
the backticks (`) operator. This way, you can create a ruby script which
downloads what you need using wget, then goes on doing whatever you want
with
those files.

wget http://www.digg.com -r -l 2
How to I wack a variable in here? It doesn’t seem to want a string

The backticks operator – like most other operators in Ruby – isn’t
actually an operator, it’s a method. In this case, it’s a method named
‘`’ that lives on the Kernel module and takes a String as its
parameter.

So, wget http://www.digg.com -r -l 2 is actually just syntactic
sugar for `(“wget http://www.digg.com -r -l 2”). Well, except that’s
not valid Ruby syntax. But this is:

send(:`, “wget http://www.digg.com -r -l 2”)

So, now that you know that the argument is, in fact, a String, you
can probably guess what you can do with that String: String
Interpolation!

wget #{uri} -r -l 2

jwm

thechak · January 26, 2008, 3:57am

jwm,

That was an absolute joy to read. Well said!