How to extract links of a particular class type

Sita_Rami_R · November 17, 2008, 7:21pm

I have a web page which has n number of links.
The only i can differentiate links is with their class attribute.
I need the extract the set of links and their titles of a particular
class
type.

I tried using scrubyt exractor, dont have idea where to specify the
class
type.

google_data = Scrubyt::Extractor.define do
fetch ‘http://www.google.com/’
fill_textfield ‘q’, ‘ruby’
submit
link “Ruby programming language” do
url “href”, :type => :attribute
end
junk = google_data.to_xml

And how to get the output in text/string format.

Sita_Rami_R · November 17, 2008, 8:03pm

On 2008.11.17., at 19:17, Sita Rami R. wrote:

google_data = Scrubyt::Extractor.define do
fetch ‘http://www.google.com/’
fill_textfield ‘q’, ‘ruby’
submit
link “Ruby programming language” do
url “href”, :type => :attribute
end
junk = google_data.to_xml

And how to get the output in text/string format.

btw. you should get the newest scRUBYt! , 0.4.05 which does not
depend on RubyInline, Ruby2Ruby and ParseTree etc.

What would you like to do exactly?

class: use an xpath like this: stuff “//td[@class=‘red’]”
text/string: use to_hash instead of to_xml.

HTH,
Peter

http://www.rubyrailways.com
http://scrubyt.org

Sita_Rami_R · November 17, 2008, 8:59pm

require ‘rubygems’
require ‘scrubyt’

google_data = Scrubyt::Extractor.define do
fetch ‘gap inc - Google Search’

link_title “//a[@class=‘l’]”, :write_text => true do
link_url
end
next_page “Next”, :limit => 3
end

output_file = open(“google_results.txt”, ‘w’) do |f|
google_data.to_hash.each do |result|
f.puts “#{result[:link_title]} - #{result[:link_url]}”
end
end

produces:

Shop clothes for women, men, maternity, baby, and kids at gap.com …

http://www.gap.com/
Gap Inc. - http://www.gapinc.com/
Gap Inc. - Careers - http://www.gapinc.com/public/Careers/careers.shtml
The Gap Inc. News - The New York Times -
Gap Inc. - The New York Times
Gap (clothing retailer) - Wikipedia, the free encyclopedia -
Gap Inc. - Wikipedia
GPS: Summary for GAP INC - Yahoo! Finance -
The Gap, Inc. (GPS) Stock Price, News, Quote & History - Yahoo Finance
GPS - BloggingStocks - http://gps.bloggingstocks.com/
…
…

HTH,
Peter

http://www.rubyrailways.com
http://scrubyt.org

Sita_Rami_R · November 17, 2008, 8:35pm

My program need to do the following
Navigate to google site, providing “ruby” as search text, clicked the
search
button
Now we get the results page showing 1st 10 results.

I like to collect those 10 links and titles of those links and log them
in
an output file
using scrubyt extractor, i achived some thing, got all those 10 links
captured…but i am unable to get the titles.
And also i know how to extract in XML format…

but i need in this way .each Title and its Link in a single line

My scripts goes here…

require ‘rubygems’
require ‘scrubyt’

google_data = Scrubyt::Extractor.define do
#Perform the action(s)
fetch ‘http://www.google.com/’
fill_textfield ‘q’, ‘Gap Inc’
submit
#Construct the wrapper
link “gap” do
url “href”, :type => :attribute
end
next_page “Next”, :limit => 10
end
junk = google_data.to_xml
puts junk

Please help me out…
Suggest anyother way, if this doesn’t work out

Thanks,
Sita.

Sita_Rami_R · November 17, 2008, 9:48pm

Thanq very much peter…it surved my purpose

That’s great to hear If you have any scRUBYt!/scraping related
questions, don’t hesitate to ask.

Cheers,
Peter

http://www.rubyrailways.com
http://scrubyt.org

Sita_Rami_R · November 17, 2008, 9:43pm

Thanq very much peter…it surved my purpose

Sita_Rami_R · November 18, 2008, 12:22am

Peter,
Where can i find some good stuff relating to scruby/Ruby …any
preferred
sites…

Thanks,
Sita.

Sita_Rami_R · November 18, 2008, 12:58am

http://scrubyt.org - check out the older posts dealing with creating
scrapers for different pages
check out the examples:
http://rubyforge.org/frs/download.php/46812/scrubyt-examples-0.4.05.tgz

more is on the way…

Cheers,
Peter

http://www.rubyrailways.com
http://scrubyt.org

Sita_Rami_R · December 5, 2008, 10:13am

See my other post…

Cheers,
Peter

http://www.rubyrailways.com
http://scrubyt.org

Sita_Rami_R · December 5, 2008, 7:55am

Hi Peter,

I need to fetch some information from http://www.ebay.in.
My required fields are : Name of the product, Image, Price and the link
to that product.

am able to get the data using this method.
require ‘rubygems’
require ‘scrubyt’

google_data = Scrubyt::Extractor.define do
fetch ‘http://www.ebay.in’
fill_textfield ‘satitle’, ‘ipod shuffle’
submit

record

“/html/body/div[2]/div[4]/div[2]/div/div/div[2]/div[2]/div/div/div[3]/div/div/table/tr”
do
name “/td[2]/div/a”
price “/td[5]”
image “/td/a/img” do
url “src”, :type => :attribute
end
link “/td[2]/div/a” do
url “href”, :type => :attribute
end
end

end

google_data.to_xml.write($stdout, 1)

but my problem is for some products its not working properly. (div may
be changed). is there any better solution for this?

Thanks in advance,
Vipin

Sita_Rami_R · February 5, 2009, 10:32pm

I also want to store the position of the resultpage on Google. Example:
rank 1 - Title - url

How can i fix this the code?

grtz…remco