How to extract links of a particular class type


#1

I have a web page which has n number of links.
The only i can differentiate links is with their class attribute.
I need the extract the set of links and their titles of a particular
class
type.

I tried using scrubyt exractor, dont have idea where to specify the
class
type.

google_data = Scrubyt::Extractor.define do
fetch ‘http://www.google.com/
fill_textfield ‘q’, ‘ruby’
submit
link “Ruby programming language” do
url “href”, :type => :attribute
end
junk = google_data.to_xml

And how to get the output in text/string format.


#2

On 2008.11.17., at 19:17, Sita Rami R. wrote:

google_data = Scrubyt::Extractor.define do
fetch ‘http://www.google.com/
fill_textfield ‘q’, ‘ruby’
submit
link “Ruby programming language” do
url “href”, :type => :attribute
end
junk = google_data.to_xml

And how to get the output in text/string format.

btw. you should get the newest scRUBYt! , 0.4.05 which does not
depend on RubyInline, Ruby2Ruby and ParseTree etc.

What would you like to do exactly?

  1. class: use an xpath like this: stuff “//td[@class=‘red’]”
  2. text/string: use to_hash instead of to_xml.

HTH,
Peter


http://www.rubyrailways.com
http://scrubyt.org


#3

require ‘rubygems’
require ‘scrubyt’

google_data = Scrubyt::Extractor.define do
fetch ‘http://www.google.com/search?hl=en&q=gap+inc

link_title “//a[@class=‘l’]”, :write_text => true do
link_url
end
next_page “Next”, :limit => 3
end

output_file = open(“google_results.txt”, ‘w’) do |f|
google_data.to_hash.each do |result|
f.puts “#{result[:link_title]} - #{result[:link_url]}”
end
end

produces:

Shop clothes for women, men, maternity, baby, and kids at gap.com

HTH,
Peter


http://www.rubyrailways.com
http://scrubyt.org


#4

My program need to do the following
Navigate to google site, providing “ruby” as search text, clicked the
search
button
Now we get the results page showing 1st 10 results.

I like to collect those 10 links and titles of those links and log them
in
an output file
using scrubyt extractor, i achived some thing, got all those 10 links
captured…but i am unable to get the titles.
And also i know how to extract in XML format…

but i need in this way .each Title and its Link in a single line

My scripts goes here…

require ‘rubygems’
require ‘scrubyt’

google_data = Scrubyt::Extractor.define do
#Perform the action(s)
fetch ‘http://www.google.com/
fill_textfield ‘q’, ‘Gap Inc’
submit
#Construct the wrapper
link “gap” do
url “href”, :type => :attribute
end
next_page “Next”, :limit => 10
end
junk = google_data.to_xml
puts junk

Please help me out…
Suggest anyother way, if this doesn’t work out

Thanks,
Sita.


#5

Thanq very much peter…it surved my purpose

That’s great to hear :slight_smile: If you have any scRUBYt!/scraping related
questions, don’t hesitate to ask.

Cheers,
Peter


http://www.rubyrailways.com
http://scrubyt.org


#6

Thanq very much peter…it surved my purpose


#7

Peter,
Where can i find some good stuff relating to scruby/Ruby …any
preferred
sites…

Thanks,
Sita.


#8

http://scrubyt.org - check out the older posts dealing with creating
scrapers for different pages
check out the examples:
http://rubyforge.org/frs/download.php/46812/scrubyt-examples-0.4.05.tgz

more is on the way…

Cheers,
Peter


http://www.rubyrailways.com
http://scrubyt.org


#9

See my other post…

Cheers,
Peter


http://www.rubyrailways.com
http://scrubyt.org


#10

Hi Peter,

I need to fetch some information from http://www.ebay.in.
My required fields are : Name of the product, Image, Price and the link
to that product.

am able to get the data using this method.
require ‘rubygems’
require ‘scrubyt’

google_data = Scrubyt::Extractor.define do
fetch ‘http://www.ebay.in
fill_textfield ‘satitle’, ‘ipod shuffle’
submit

record 

“/html/body/div[2]/div[4]/div[2]/div/div/div[2]/div[2]/div/div/div[3]/div/div/table/tr”
do
name “/td[2]/div/a”
price “/td[5]”
image “/td/a/img” do
url “src”, :type => :attribute
end
link “/td[2]/div/a” do
url “href”, :type => :attribute
end
end

end

google_data.to_xml.write($stdout, 1)

but my problem is for some products its not working properly. (div may
be changed). is there any better solution for this?

Thanks in advance,
Vipin


#11

I also want to store the position of the resultpage on Google. Example:
rank 1 - Title - url

How can i fix this the code?

grtz…remco