Mechanize and XPath


#1

Is there a way to select links in a scraped mechanize page using XPath
selectors ?

For example…all links on the second TABLE on the page.

I know it is possible with hpricot but i need the links to be used by
mechanize.


#2

On 2008.10.15., at 19:08, Ruby N. wrote:

Is there a way to select links in a scraped mechanize page using XPath
selectors ?

For example…all links on the second TABLE on the page.

I know it is possible with hpricot but i need the links to be used by
mechanize.

From the Mechanize guide
(http://mechanize.rubyforge.org/mechanize/files/GUIDE_txt.html
):

Mechanize uses hpricot to parse html. What does this mean for you? You
can treat a mechanize page like an hpricot object. After you have used
Mechanize to navigate to the page that you need to scrape, then scrape
it using hpricot methods:
agent.get(‘http://someurl.com/’).search("//p[@class='posted’]")
HTH,
Peter


#3

Peter S. wrote:

On 2008.10.15., at 19:08, Ruby N. wrote:

Is there a way to select links in a scraped mechanize page using XPath
selectors ?

For example…all links on the second TABLE on the page.

I know it is possible with hpricot but i need the links to be used by
mechanize.

From the Mechanize guide
(http://mechanize.rubyforge.org/mechanize/files/GUIDE_txt.html
):

Mechanize uses hpricot to parse html. What does this mean for you? You
can treat a mechanize page like an hpricot object. After you have used
Mechanize to navigate to the page that you need to scrape, then scrape
it using hpricot methods:
agent.get(‘http://someurl.com/’).search("//p[@class='posted’]")
HTH,
Peter

Wait a minute, it says the total opposite on the Mechanize page. But it
definately explains why it’s not being friendly with nokogiri…

http://mechanize.rubyforge.org/mechanize/

Mechanize uses nokogiri to parse html. What does this mean for you? You
can treat a mechanize page like an nokogiri object. After you have used
Mechanize to navigate to the page that you need to scrape, then scrape
it using nokogiri methods:

agent.get(‘http://someurl.com/’).search(".//p[@class='posted’]"


#4

.search("//a")