Bil_K
December 17, 2006, 4:46pm
1
OK, so I haven’t done this in years.
What’s the “modern” way of grabbing the data off
a webpage, e.g.,
http://yorkcountyschools.org/mves/arlist/3-3.4.htm
My initial attempt has been focused on Hpricot,
require ‘rubygems’
require ‘open-uri’
require ‘hpricot’
doc =
Hpricot(open(‘http://yorkcountyschools.org/mves/arlist/3-3.4.htm ’))
and I can find doc/“th” and doc/“tr”, but what’s
the best way to cram them into an array of structs
or something?
Thanks,
Bil_K
December 17, 2006, 5:24pm
2
On 12/17/06, Bil K. [email protected] wrote:
require ‘open-uri’
require ‘hpricot’
doc = Hpricot(open(‘http://yorkcountyschools.org/mves/arlist/3-3.4.htm ’))
and I can find doc/“th” and doc/“tr”, but what’s
the best way to cram them into an array of structs
or something?
I’ve actually been needing to do something like this for work and
haven’t gotten around to it, so I’ll take a stab at it.
require “ruport”
column_names = (doc/“th”)[1…-1].map { |r| (r/“p”).text }
rows = (doc/“tr”)[3…-1]
parsed_rows = rows.inject { |s,a|
s << (a/“td”).map { |r| (r/“td”).text }
}
table = parsed_rows.to_table(column_names)
Now, I’ve pastied some of the things you can do from here, because
they wont translate to email well.
http://pastie.caboo.se/28169
Note, my hpricot code is sort-of hackish, cleaning that up might be a
good idea, but Ruport[0] might still be a good idea for representing
the data.
Hope this helps!
-greg
[0] http://ruport.infogami.com
Bil_K
December 17, 2006, 5:26pm
3
On 12/17/06, Gregory B. [email protected] wrote:
require ‘rubygems’
Parked at Loopia
Yuck, seems to have made a mess of the text output.
Here it is better formatted:
http://pastie.caboo.se/28170/text
Bil_K
December 17, 2006, 6:06pm
4
Hi Bill,
How about:
require ‘rubygems’
require ‘open-uri’
require ‘hpricot’
require ‘enumerator’
Record = Struct.new(“Record”, :id, :title, :author, :book_level,
:points)
records = []
cells =
Hpricot(open(‘http://yorkcountyschools.org/mves/arlist/3-3.4.htm ’))/“/html/body/table/tbody/tr//td”
cells.map { |elem| elem.inner_html }.each_slice(5) do |slice|
records << Record.new(*slice)
end
HTH,
Peter
__
http://www.rubyrailways.com
Bil_K
December 17, 2006, 7:12pm
5
On 12/17/06, Peter S. [email protected] wrote:
records = []
cells =
Hpricot(open(‘http://yorkcountyschools.org/mves/arlist/3-3.4.htm ’))/“/html/body/table/tbody/tr//td”
cells.map { |elem| elem.inner_html }.each_slice(5) do |slice|
records << Record.new(*slice)
end
clever solution peter.
If you wanted to adapt this to use Ruport instead of a Struct (and get
the features I showed)
Try:
records = [].to_table([:id, :title, :author, :book_level, :points])
and then replace the appending code with
records << slice
This would allow struct-like, hash-like, and array-like access as well
as access to Ruport’s data manipulation and formatting tools.
Bil_K
December 17, 2006, 6:06pm
6
Its slow and messy, but i did it in 5 mins
require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
require ‘uri’
require ‘pp’
module Hpricot
module Traverse
# Returns the node neighboring this node to the south: just below
it.
# This method includes text nodes and comments and such.
def next_node(loop=1)
sib = parent.children
sib[sib.index(self) + loop] if parent
end
end
end
class HTMLpage
#We save some things for a single load of the page
#and just because
def initialize()
#html of the whole page. only get this once
@page_html=nil
load_page
end
##Complete html of page
def page_html
load_page.to_html
end
def row(location=0)
doc=load_page
return doc.search(“tbody”).collect{|x| x.search(“tr”)[location]
}.compact
end
def to_struct()
doc=load_page
struct=[]
doc.search(“tbody”).each{|x|
arr=[]
x.search(“td”).each{|xx|
arr.push(xx.inner_html)
}
(0 … arr.size/5).each{|index|
struct.push(Thing.new(arr[(index5)],arr[(index 5)+1],arr[(index5)+2],arr[(index 5)+3],arr[(index*5)+4]))
}
}
return struct
end
private
#loads the page data
def load_page
#check if we have page html if so return
if @page_html
doc=Hpricot(@page_html)
else
doc=Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3.4.htm'))
@page_html=doc.to_html
end
return doc
end
end
class Thing
attr_reader :quiz_id,:title,:author,:booklevel,:points
def initialize(quizID,title,author,bookLevel,points)
@quiz_id=quizID
@title=title
@author=author
@booklevel=bookLevel
@points=points
end
end
page=HTMLpage.new
stuff=page.to_struct
pp stuff[0].title
pp stuff[0].author
Bil_K
December 17, 2006, 7:20pm
7
This would allow struct-like, hash-like, and array-like access as well
as access to Ruport’s data manipulation and formatting tools.
Thx for the pointer Gregory, I did not know about Ruport yet - seems
very interesting, I will definitely check it out.
Cheers,
Peter
__
http://www.rubyrailways.com
Bil_K
December 17, 2006, 7:28pm
8
On 12/17/06, Peter S. [email protected] wrote:
This would allow struct-like, hash-like, and array-like access as well
as access to Ruport’s data manipulation and formatting tools.
Thx for the pointer Gregory, I did not know about Ruport yet - seems
very interesting, I will definitely check it out.
It might be overkill if all you needed was struct like access to your
data, but it would sure come in handy if you had some more complex
needs…
Bil_K
December 17, 2006, 9:41pm
9
Bil K. wrote:
require ‘open-uri’
http://funit.rubyforge.org
require ‘net/http’
http = Net::HTTP.new( “yorkcountyschools.org ” )
resp, data = http.get( “/mves/arlist/3-3.4.htm”, nil )
table = data.scan( %r{(.?)</tr}im ).flatten.
map{|s| s.scan( %r{(. ?)}i ).flatten }.
reject{|ary| ary.size != 5}
p table
Bil_K
December 18, 2006, 1:47am
10
On 12/17/06, Gregory B. [email protected] wrote:
require “open-uri”
body = open(“yorkcountyschools.org/mves/arlist/3-3.4.htm ”).read
whoops… need the http://
Bil_K
December 18, 2006, 1:47am
11
On 12/17/06, William J. [email protected] wrote:
require ‘net/http’
http = Net::HTTP.new( “yorkcountyschools.org ” )
resp, data = http.get( “/mves/arlist/3-3.4.htm”, nil )
require “open-uri”
body = open(“yorkcountyschools.org/mves/arlist/3-3.4.htm ”).read