I’m considering a side-project where I’d have to scrape a couple of
websites and aggregate the results. So far I’ve done a couple of
experiments with html-parser, but I’m not really happy with it. I’d
hate to just throw regexps at the html, so I’m really looking for an
elegant way to select the correct data from the page.
def read
Net::HTTP.start(“news.bbc.co.uk”, 80) do |h|
response =
h.get(“/sport1/hi/football/eng_prem/fixtures/default.stm”) #p response
s = BeautifulSoup.new response.body ← fails
p s.find_all(‘div’, :attrs => { ‘class’ => ‘mvb’ })
end
end
end
I’m not sure what I’m doing wrong and all the documentation doesn’t
refer to gem usage
Sorry if this is me being thick at the end of a Friday…
I know how to use watir - it’s great, but it’s not the correct approach
for this application - I want to request a page from a remote source and
extract data from it - rubyful soup seems like the way to go, but for
some reason I’m having difficulty with the code. Watir is good for
driving a browser - I’m not interested in that for this application
Look at SWExplorerAutomation (www.webunittesting.com)
The program creates an automation API for any Web application which
uses HTML and DHTML and works with Microsoft Internet Explorer. The Web
application becomes programmatically accessible from any .NET language.
SWEA API provides access to Web application controls and content. The
API is generated using SWEA Visual Designer. SWEA Visual Designer helps
create programmable objects from Web page content.
Perhaps you’ve misunderstood my intentions. I want to scrape a website
(BBC News for example) and extract some data from the HTML returned. I
want to use Ruby to do this and I also want to avoid using regular
expressions to manually parse the HTML myself.
Forgive me if I’m wrong, but your response seems to be an advert for an
automation product for .Net.
Someone else has already suggested RubyfulSoup which I’ve had some
success with and I’m moving ahead with this for now.
You don’t have to use regular expressions to extract data. SWEA
works with XML and have XpathDataExtractor and TableDataExtractor to
simplify the data extraction.
You can visually define the the extraction rules using them.
You can use Ruby.Net for automation scripts and I like .Net.
SWEA supports frames, javascript, popup windows, windows and html
dialog boxes, file and image downloads with cookies and etc.Also SWEA
can work from windows service account.
SWEA have been used in many data scraping solutions with a great
success. Look at SWJobSearch. I have wrote it in a few days. Try to
write it using RubyfulSoup.
Good luck with RubyfulSoup!
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.