Hpricot html parsing

dhanasekara · December 13, 2006, 2:04pm

hi all,
I have the following html fragment
I want to get the inner html content inside the

tag , not the between the

tag. for example in the following example i want to get the result as "this is fun". I dont want to get the result including "NO FUN". how to do with Hpricot

example html fragment:

this is fun

NO FUN

thanks in advance,
dhanasekaran

dhanasekara · December 13, 2006, 2:12pm

Dhanasekaran V. wrote:

this is fun

NO FUN

I did not quite get you. You want the text of the first

because it
has an image?
Or what is the exact criterion to accept/reject

's?

Peter

__
http://www.rubyrailways.com

dhanasekara · December 13, 2006, 2:41pm

yes, I want the text of the first

because it
has an image. and reject if

has no image.
thanks,
Dhanasekaran

dhanasekara · December 13, 2006, 3:40pm

You can try something like this:

if p.search(“img”).length > 0
puts p.inner_html
end

dhanasekara · December 13, 2006, 3:00pm

Dhanasekaran V. wrote:

yes, I want the text of the first

because it
has an image. and reject if

has no image.
thanks,
I see. Try this:
===============================================
require ‘rubygems’
require ‘hpricot’

doc = Hpricot %q{

this is fun

NO FUN

fun again!

NO FUN AT ALL!

}

paragraphs = doc/‘p’

good_elems = paragraphs.map.reject {|elem| ((elem/“img”).empty?) }
good_elems.each { |elem| puts elem.inner_text.strip }

output:

this is fun
fun again!

You will need hpricot 0.4.84 because of inner_text - if you don’t want
to install it (I did not experience any difficulties, so I can recommend
it) then you have to roll your own inner_text, but I guess this is not a
big problem.

Cheers,
Peter

dhanasekara · December 14, 2006, 12:30am

Peter S. wrote:

paragraphs = doc/‘p’

good_elems = paragraphs.map.reject {|elem| ((elem/“img”).empty?) }

Which once again makes me wish paragraphs = doc/’//p[img]/text()’
worked. This could be doable if you asked Hpricot to provide you with
the REXML document (it’s probably out of scope for the intendedly simple
XPath engine Hpricot uses natively), but unfortunately I can’t for the
heck of it figure out how to make REXML accept the final /text(), even
though the parser claims to support XPath 1.0 except a few exceptions,
that one not being noted.

David V.

dhanasekara · December 13, 2006, 6:20pm

Dhanasekaran V. wrote:

yes, I want the text of the first

because it
has an image. and reject if

has no image.

Hpricot might be able to do this, but you can also do it on your own,
and
know why the solution works.

#!/usr/bin/ruby -w

data = File.read(“test.html”)

array = data.scan(%r{

([^<]+?)<img .*?/>

})

p array

Input text:

don't want this text

want this text

don't want this text either

want this text too

Output:

[[“want this text”], [“want this text too”]]

dhanasekara · December 15, 2006, 7:55pm

Ask:

http://code.whytheluckystiff.net/hpricot/ticket/32

text in xpath should return a text node if present. For example:
(doc/“/html/body/div[1]//table[0]/tr[0]//b[9]/text”)

Currently I am using the search and next_node:

doc.search("/html/body/div[1]/*/table[0]/tr[0]/td/b"){|x|

@movie_plot=x.next_node.to_s.strip if x.inner_html==“Plot Outline:” }

And receive

Author:
why
Message:

    * lib/hpricot/elements.rb: added support for selecting text

nodes with text(): //p/text(), //p[a]//text(), etc.
* lib/hpricot/traverse.rb: ditto.
* lib/hpricot/tag.rb: the pathname method reports the path
fragment needed to get to this node.
* lib/hpricot/parse.rb: handle possible empty processing
instruction.
http://code.whytheluckystiff.net/hpricot/changeset/87

dhanasekara · December 16, 2006, 9:16pm

On 12/16/06, David V. [email protected] wrote:

something that doesn’t cause a mental namespace clash?

Sorry, I have been archiving ruby talk at [email protected] since
10/14/04.

Stephen B. IV

dhanasekara · December 18, 2006, 8:48am

Thanks Peter ,
Your solution worked. and I just wanted to know , where can I find the
syntax for Hpricot like the one you gave as a solution,

thanks,
dhanasekaran

dhanasekara · December 16, 2006, 6:19pm

ruby talk wrote:

Ask:

http://code.whytheluckystiff.net/hpricot/ticket/32

text in xpath should return a text node if present. For example:
(doc/“/html/body/div[1]//table[0]/tr[0]//b[9]/text”)

Well, it’s ‘text()’ not ‘text’. Luckily _why noticed.

   * lib/hpricot/elements.rb: added support for selecting text
nodes with text(): //p/text(), //p[a]//text(), etc.

W00t

Thanks for pointing this out.

David V.

PS: Your email address name confuses the heck out of me. Please use
something that doesn’t cause a mental namespace clash?

dhanasekara · December 18, 2006, 10:22am

Dhanasekaran V. wrote:

Thanks Peter ,
Your solution worked. and I just wanted to know , where can I find the
syntax for Hpricot like the one you gave as a solution,

Hmm, except of what can be found on the Hpricot page, I am using

rdoc, ri
p SomeHpricotClass.methods.sort
my kind-of-decent XPath knowledge
source code browsing (you don’t have to be a pro (I am a newbie
myself) and you can get a surprisingly lot from there))
common sense
ruby mailing list

Roughly in this order… A cheatsheet or something would be handy…
maybe there is already one somewhere?

Cheers,
Peter

__
http://www.rubyrailways.com

dhanasekara · December 18, 2006, 12:30pm

Citát Peter S. [email protected]:

Roughly in this order… A cheatsheet or something would be handy…
maybe there is already one somewhere?

http://code.whytheluckystiff.net/hpricot/wiki/SupportedXpathExpressions
?

Also, I’d take that in preference to point 3, using an XPath -ish sort
of query
and then using a syntax element the implementation happens to not
understand is
rather infuriating. (Aght REXML not supporting text() in a POLS way, if
at
all.)

David V.

Hpricot html parsing

example html fragment:

good_elems = paragraphs.map.reject {|elem| ((elem/“img”).empty?) } good_elems.each { |elem| puts elem.inner_text.strip }

good_elems = paragraphs.map.reject {|elem| ((elem/“img”).empty?) }
good_elems.each { |elem| puts elem.inner_text.strip }