Html parsing using regular expressions

shinkaku · October 25, 2006, 4:57am

I’m new to Ruby and trying to use regular expressions to parse an html
file. The page is a large table with no spaces in the html code. I want
to count the number of times

or <tr ‘anything’> occurs. I’m stuck
on trying to match every variety of

I’ve tried

op_file = File.read(htmlfile)
if op_file =~ /(<tr(.*?)>)+/

but it catches the first <tr and matches all the way to the end of the
file. Anyone have any advice on matching and counting?

-Shinkaku

shinkaku · October 25, 2006, 7:03am

On 10/24/06, Anthony W. [email protected] wrote:

I’m new to Ruby and trying to use regular expressions to parse an html
file.

Don’t. Use Hpricot instead. Your brain will thank you for it.

I haven’t used Hpricot, but I’ve heard great things about it; I’ve
tried to do HTML parsing with regexen, and it’s a mook’s game.

-austin

shinkaku · October 25, 2006, 7:21am

Anthony W. wrote:

but it catches the first <tr and matches all the way to the end of the
file. Anyone have any advice on matching and counting?

You need to tell us whether you have read the replies you received to
this
same question when you asked it eight hours ago. I answered your
question,
several others did also, you have not given any indication that you saw
the
replies.

Here is one answer:

#!/usr/bin/ruby -w

path=“path-to-HTML-page”

data = File.read(path)

array = data.scan(%r{<tr.*?>})

puts array.size # gives a count of occurrences

puts array # shows the matches

Please read replies before posting again.