Forum: Ruby Parsing HTML using regexes and arrays.

Announcement (2017-05-07): is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see and for other Rails- und Ruby-related community platforms.
soldier.coder (Guest)
on 2008-11-07 23:10
(Received via mailing list)
I have a nice little regex to pull the information rich guts from a

%r{</thead.*?>(.*?)</table>}m =~html
 # $1 now contains all the rows of the table as one long string.

I'd like to turn that into an array of rows, but I am not exactly sure

Additionally, I'd like to process the rows so that i can get data from
between the nth <td></td> pair.

Any help?
Michael L. (Guest)
on 2008-11-08 00:36
(Received via mailing list)
On Fri, Nov 7, 2008 at 3:08 PM, soldier.coder
<removed_email_address@domain.invalid> wrote:
> between the nth <td></td> pair.
> Any help?

If you have a string with a repeating pattern that you want an array
of, String#scan is your man.

irb(main):001:0> html = "<td>foo</td><td>bar</td>"
=> "<td>foo</td><td>bar</td>"
irb(main):002:0> a = html.scan(/<td>(.+?)<\/td>/)
=> [["foo"], ["bar"]]

Hmmm, that's sort of ugly.

irb(main):003:0> a = html.scan(/<td>(.+?)<\/td>/).flatten
=> ["foo", "bar"]

Much better.

Ad hoc regexes are fine for quick-n-dirty scripting. But if you're
serious about parsing HTML you might want to look into Hpricot or

-Michael L.
This topic is locked and can not be replied to.