Need help parsing HTML with Hpricot

I’m having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I’m trying to parse HTML of the following nature:

This is one line of text

This is another line of text

It keeps going on like this



Until a new paragraph is started

Otherwise, it’s just more of the same

I know, it looks simple but, frankly, I have no clue how to parse 

this
with Hpricot. Particularly, I don’t know how to single out the lines of
text in between the “br” tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
Does anyone know how to do this with Hpricot?
Thank you…

You can try each_child.

I will use each_child_with_index to show you what I mean:

Put your raw HTML text into @text

@parsed_html = Hpricot(@text)
@parsed_html.each_child_with_index do |c,i|
puts “Line #{i}: #{c.to_s.strip}”
end

Produces:

Line 0: This is one line of text
Line 1:

Line 2: This is another line of text
Line 3:

Line 4: It keeps going on like this
Line 5:

Line 6:
Line 7:

Line 8: Until a new paragraph is started
Line 9:

Line 10: Otherwise, it’s just more of the same
Line 11:

Line 12:

Hope that helps.

Mikel

On 10/25/07, Just Another Victim of the Ambient M.

2007/10/25, Just Another Victim of the Ambient M.
[email protected]:

I'm having trouble understanding Hpricot (thanks to an abominable lack

of documentation). I’m trying to parse HTML of the following nature:

Try http://code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase
for examples and some better documentation. It helped me a lot to
solve my problems.

Of course… you could also do:

require ‘rubygems’
require ‘hpricot’

text =<<HERE
This is one line of text

This is another line of text

It keeps going on like this



Until a new paragraph is started

Otherwise, it’s just more of the same

HERE

class String
def not_needed?
self.strip == “
” ? true : false
end
end

@parsed_html = Hpricot(text)
@paragraphs = Array.new
@parsed_html.each_child_with_index do |c,i|
line = c.to_s.strip
if line == “”
puts “

#{@paragraphs}


@paragraphs.clear
else
@paragraphs << "#{line} " unless line.not_needed?
end
end

Which produces:

This is one line of text This is another line of text It keeps going on like this

Until a new paragraph is started Otherwise, it's just more of the same

Now… don’t pick on my favorite HTML parser again! :smiley: Just ask nicely
:slight_smile:

Mikel