Parsing HTML code with regex

I’m trying to parse through some html code and count the number of times
a match happens. The file is a large table with a ton of

and <tr
‘something’>. There are no spaces in the file. I’m trying to count and
print each and <tr ‘something’>.

I haven’t even gotten to counting my matches. I’m still working on
matching with

or <tr ‘anything’>

I’ve done:

op_file = HTML_CODE
if op_file =~ /(<tr(.*?)>)+/

but it catches everything on from the first <tr to the end of the line.
Any ideas?

-Shinkaku

Anthony W. wrote:

op_file = HTML_CODE
if op_file =~ /(<tr(.*?)>)+/

You want if op_file =~ /<tr.*?>/

But see below.

but it catches everything on from the first <tr to the end of the line.

Also, try scanning for matches, like this:

#!/usr/bin/ruby -w

path=“path-to-HTML-page”

data = File.read(path)

array = data.scan(%r{<tr.*?>})

puts array

Anthony W. wrote:

I’m trying to parse through some html code and count the number of times
a match happens.

If the code is not yet XHTML, use Tidy to upgrade it.

Then parse it with XPath, looking for your match.

(Tip: All HTML that you control should be XHTML, of the highest quality.
Don’t rely on sloppy HTML and “browser forgiveness”!)

Hi –

On Wed, 25 Oct 2006, Michael P. wrote:

op_file = HTML_CODE
if op_file =~ /(<tr(.*?)>)+/

You are parsing always one line only.
Perhaps you mean a Regular Expression like

/(<tr([^>]*?>)+/m

The /m doesn’t make any difference there, because you’re not using the
wildcard dot. /m just adds \n to the dot class.

David

Also, try scanning for matches, like this:

#!/usr/bin/ruby -w

path=“path-to-HTML-page”

data = File.read(path)

array = data.scan(%r{<tr.*?>})

puts array

Thanks, this worked.

Anthony W. wrote:

op_file = HTML_CODE
if op_file =~ /(<tr(.*?)>)+/

You are parsing always one line only.
Perhaps you mean a Regular Expression like

/(<tr([^>]*?>)+/m

Anyway I am not sure if the if… is the right
construct. Don’t you want to get the return value
of the match, which delivers you a MatchData
object from which you can get the results as
an array or so.

MP

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs