Parsing HTML using regexes and arrays

soldier.coder · November 7, 2008, 10:10pm

I have a nice little regex to pull the information rich guts from a
table…

%r{</thead.?>(.?)}m =~html

$1 now contains all the rows of the table as one long string.

I’d like to turn that into an array of rows, but I am not exactly sure
how.

Additionally, I’d like to process the rows so that i can get data from
between the nth

pair.

Any help?

soldier.coder · November 7, 2008, 11:36pm

On Fri, Nov 7, 2008 at 3:08 PM, soldier.coder
[email protected] wrote:

between the nth pair.

Any help?

If you have a string with a repeating pattern that you want an array
of, String#scan is your man.

irb(main):001:0> html = “foobar”
=> “foobar”
irb(main):002:0> a = html.scan(/(.+?)</td>/)
=> [[“foo”], [“bar”]]

Hmmm, that’s sort of ugly.

irb(main):003:0> a = html.scan(/(.+?)</td>/).flatten
=> [“foo”, “bar”]

Much better.

Ad hoc regexes are fine for quick-n-dirty scripting. But if you’re
serious about parsing HTML you might want to look into Hpricot or
Nokogiri.

-Michael L.