Scraping <table> from a website

cskilbeck November 19, 2007, 7:45pm 1

Hi,

I need to extract everything between

and

on a website
(there’s only one table on the page. So far I have:

require ‘open-uri’
page = open(‘http://xxx.html’).read
page.gsub!(/\n/,“”)
page.gsub!(/\r/,“”)
inner = page.scan(%r{.<table.>(.).}m)
print inner

but inner is empty - any ideas?

If I substitute line 2 with

page = '123

456

789

I get inner = 456, which is correct.

cskilbeck November 19, 2007, 7:56pm 2

On Nov 19, 2007 1:45 PM, cskilbeck [email protected] wrote:

Hi,

I need to extract everything between
and
on a website
(there’s only one table on the page. So far I have:

require ‘open-uri’
page = open(‘http://xxx.html’).read
page.gsub!(/\n/,“”)
page.gsub!(/\r/,“”)
inner = page.scan(%r{.<table.>(.).}m)

Untested, but try:

inner = page.scan(%r{.<table[^>]>(.).}m)

print inner

but inner is empty - any ideas?

If I substitute line 2 with

page = '123
456
789

I get inner = 456, which is correct.

If you try page = ‘123

456

789’, it
will fail again.

You only want to capture up to the next closing angle bracket. What’s
happening is that the second .* is matching the contents of the entire
table, up to the closing angle bracket of the last tag (probably

) right before the , and inner gets only the leftover whitespace inbetween. So only capture characters that are NOT a closing angle bracket.

-Alex

cskilbeck November 19, 2007, 8:01pm 3

On Nov 19, 2007, at 3:45 PM, cskilbeck wrote:

print inner

but inner is empty - any ideas?

If I substitute line 2 with

page = '123
456
789

I get inner = 456, which is correct.

use the right tools for the right job

require ‘hpricot’
require ‘open-uri’

doc = Hpricot(open(‘http://xxx.html’))
table = doc.at(‘table’)
puts table.inner_html

(not tested)
regards,

cskilbeck November 19, 2007, 8:17pm 4

On Nov 19, 12:41 pm, cskilbeck [email protected] wrote:

print inner

but inner is empty - any ideas?

If I substitute line 2 with

page = '123
456
789

I get inner = 456, which is correct.

inner = page[ %r{<table.?>(.?)}mi, 1]

cskilbeck November 19, 2007, 10:01pm 5

On Nov 19, 7:14 pm, William J. [email protected] wrote:

page = open(‘http://xxx.html’).read

I get inner = 456, which is correct.

inner = page[ %r{<table.?>(.?)}mi, 1]

Thanks all for your help. non greedy matching is the key.

cskilbeck November 20, 2007, 8:15am 6

On Tue, 20 Nov 2007 04:00:35 +0900, Rolando A. wrote:

require ‘hpricot’
require ‘open-uri’

doc = Hpricot(open(‘http://xxx.html’)) table = doc.at(‘table’)
puts table.inner_html

Amazing – I thought that the above would be a massive project, not what
appears to be pseudo-code! Not everything in Ruby is magically easy,
but
the above is pretty good

-Thufir

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs