Parse both string and url using Nokogiri xpath

soujiro0725 · May 12, 2013, 1:37am

ruby 1.9.3
nokogiri 1.5.5

Say, a web page has a link,

reference

I would like to get both the url and text, “http://example.com” and
“reference”.

First, access to the page that contains this link.

doc = Nokogiri::HTML(open(url))

then,

name = doc.xpath(‘//div…/a’).text
url = doc.xpath('//div…/a/@href).text

It works. But the problem is this is parsing twice separately.
If you want to apply the same procedure to many links that exist in a
single page, it seems inefficient.

Is there anyway to produce both url and text by single parse? like

def parse_link_and_text (xpath)
…
end

p parse_link_and_text(‘//div…’)

gives a hash

=> {‘reference’ => ‘http://example.com’}

?

soujiro0725 · May 12, 2013, 10:46am

On Sun, May 12, 2013 at 1:37 AM, Soichi I. [email protected]
wrote:

First, access to the page that contains this link.
single page, it seems inefficient.

=> {‘reference’ => ‘http://example.com’}

?

Just search for and go from there.

$ irb -r nokogiri
irb(main):001:0> dom = Nokogiri.HTML(‘text’)
=> #<Nokogiri::HTML::Document:0x434197c name=“document”
children=[#<Nokogiri::XML::DTD:0x43411d4 name=“html”>,
#<Nokogiri::XML::Element:0x433df20 name=“html”
children=[#<Nokogiri::XML::Element:0x433daac name=“body”
children=[#<Nokogiri::XML::Element:0x433d48a name=“x”
children=[#<Nokogiri::XML::Element:0x433cfee name=“a”
attributes=[#<Nokogiri::XML::Attr:0x433b086 name=“href” value=“link”>]
children=[#<Nokogiri::XML::Text:0x433be5a “text”>]>]>]>]>]>
irb(main):002:0> node = dom.at_xpath ‘//a’
=> #<Nokogiri::XML::Element:0x433cfee name=“a”
attributes=[#<Nokogiri::XML::Attr:0x433b086 name=“href” value=“link”>]
children=[#<Nokogiri::XML::Text:0x433be5a “text”>]>
irb(main):003:0> node[:href]
=> “link”
irb(main):004:0> node.text
=> “text”
irb(main):005:0>

Now, what is so difficult about that? You can easily find out more via
documentation.

Cheers

robert

soujiro0725 · May 12, 2013, 11:14am

Mayby using a temp variable ?

links = doc.xpath('//div/a[@href]')
links.map do |x| [x.text,x['href']] end    => [["reference", "

http://example.com"]]

2013/5/12 Soichi I. [email protected]

soujiro0725 · May 16, 2013, 7:51am

Thanks both replies are helpful!