Need help parsing HTML with Hpricot

maz · October 25, 2007, 9:00am

I’m having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I’m trying to parse HTML of the following nature:

This is one line of text

This is another line of text

It keeps going on like this

Until a new paragraph is started

Otherwise, it’s just more of the same

I know, it looks simple but, frankly, I have no clue how to parse

this
with Hpricot. Particularly, I don’t know how to single out the lines of
text in between the “br” tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
Does anyone know how to do this with Hpricot?
Thank you…

maz · October 25, 2007, 9:37am

You can try each_child.

I will use each_child_with_index to show you what I mean:

Put your raw HTML text into @text

@parsed_html = Hpricot(@text)
@parsed_html.each_child_with_index do |c,i|
puts “Line #{i}: #{c.to_s.strip}”
end

Produces:

Line 0: This is one line of text
Line 1:

Line 2: This is another line of text
Line 3:

Line 4: It keeps going on like this
Line 5:

Line 6:
Line 7:

Line 8: Until a new paragraph is started
Line 9:

Line 10: Otherwise, it’s just more of the same
Line 11:

Line 12:

Hope that helps.

Mikel

On 10/25/07, Just Another Victim of the Ambient M.

maz · October 25, 2007, 9:48am

2007/10/25, Just Another Victim of the Ambient M.
[email protected]:

I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I’m trying to parse HTML of the following nature:

Try http://code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase
for examples and some better documentation. It helped me a lot to
solve my problems.

maz · October 25, 2007, 9:50am

Of course… you could also do:

require ‘rubygems’
require ‘hpricot’

text =<<HERE
This is one line of text

This is another line of text

It keeps going on like this

Until a new paragraph is started

Otherwise, it’s just more of the same

HERE

class String
def not_needed?
self.strip == “
” ? true : false
end
end

@parsed_html = Hpricot(text)
@paragraphs = Array.new
@parsed_html.each_child_with_index do |c,i|
line = c.to_s.strip
if line == “”
puts “

#{@paragraphs}

”
@paragraphs.clear
else
@paragraphs << "#{line} " unless line.not_needed?
end
end

Which produces:

This is one line of text This is another line of text It keeps going on like this

Until a new paragraph is started Otherwise, it's just more of the same

Now… don’t pick on my favorite HTML parser again! Just ask nicely

Mikel