Hpricot and regexp?

pood · May 14, 2008, 7:43am

I’m trying to grab the “cache date” off of the google search.

using Mechanize (and built in hpricot)

agent = WWW::Mechanize.new
agent.user_agent_alias = ‘Mac Safari’
page = agent.get(“http://www.google.com/”)
search_form = page.forms.with.name(“f”).first
search_form.q = “Hello”
search_results = agent.submit(search_form)
cache_date = agent.click search_results.links.text(‘Cached’)

date = cache_date.search(‘table table > td’).inner_html

How do i grab the date like on this page:
http://209.85.173.104/search?q=cache%3Ashacknews.com&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

the part that’s right after “as retrieved on” (the date)
Is there a built in hpricot method that can search by rexep?
or will I have to use something like gsub?

pood · May 14, 2008, 7:46am

Feng T. wrote:

I’m trying to grab the “cache date” off of the google search.

using Mechanize (and built in hpricot)

agent = WWW::Mechanize.new
agent.user_agent_alias = ‘Mac Safari’
page = agent.get(“http://www.google.com/”)
search_form = page.forms.with.name(“f”).first
search_form.q = “Hello”
search_results = agent.submit(search_form)
cache_date = agent.click search_results.links.text(‘Cached’)

date = cache_date.search(‘table table > td’).inner_html

How do i grab the date like on this page:
http://209.85.173.104/search?q=cache%3Ashacknews.com&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

the part that’s right after “as retrieved on” (the date)
Is there a built in hpricot method that can search by rexep?
or will I have to use something like gsub?

oops, I mean, grep.

oh, i got it down to this:

date = cache_date.search(‘table table > td’).inner_text.grep(/retrieved
on (.+)./)

which outputs:[“This is G o o g l e’s cache of http://www.hello.com/ as
retrieved on May 11, 2008 01:09:29 GMT.\n”]

How do I get rid of everything before the date?

pood · May 14, 2008, 8:01am

Feng T. wrote:

Feng T. wrote:

I’m trying to grab the “cache date” off of the google search.

using Mechanize (and built in hpricot)

agent = WWW::Mechanize.new
agent.user_agent_alias = ‘Mac Safari’
page = agent.get(“http://www.google.com/”)
search_form = page.forms.with.name(“f”).first
search_form.q = “Hello”
search_results = agent.submit(search_form)
cache_date = agent.click search_results.links.text(‘Cached’)

date = cache_date.search(‘table table > td’).inner_html

How do i grab the date like on this page:
http://209.85.173.104/search?q=cache%3Ashacknews.com&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

the part that’s right after “as retrieved on” (the date)
Is there a built in hpricot method that can search by rexep?
or will I have to use something like gsub?

oops, I mean, grep.

oh, i got it down to this:

date = cache_date.search(‘table table > td’).inner_text.grep(/retrieved
on (.+)./)

which outputs:[“This is G o o g l e’s cache of http://www.hello.com/ as
retrieved on May 11, 2008 01:09:29 GMT.\n”]

How do I get rid of everything before the date?

Now I have this:

date = cache_date.search(‘table table > td’).inner_text.grep(/retrieved
on (.+)./).to_s.gsub(/.+as retrieved on /,“”).gsub(/.\n/,“”)

which gives me exactly what i need. is there a better way to doing this?

pood · May 14, 2008, 8:54am

Google cached pages have this structure:

where the first

contains boilerplate cache text and a copy of
the page is in the

.

This is what I would use to clip out the date:

url=“http://64.233.167.104/search?q=cache:hydO8fs-rmQJ:en.wikipedia.org/wiki/Court-martial+court+martial&hl=en&ct=clnk&cd=1&gl=us&client=firefox-a”
doc = Hpricot(open(url))
a=doc.search(“/table”).inner_text
a[/retrieved on (.*?) GMT/,1]
=>May 13, 2008 11:37:34

Feng T. [email protected] wrote: Feng T. wrote:

search_form.q = “Hello”
Is there a built in hpricot method that can search by rexep?

which outputs:[“This is G o o g l e’s cache of http://www.hello.com/ as
retrieved on May 11, 2008 01:09:29 GMT.\n”]

How do I get rid of everything before the date?

Now I have this:

date = cache_date.search(‘table table > td’).inner_text.grep(/retrieved
on (.+)./).to_s.gsub(/.+as retrieved on /,“”).gsub(/.\n/,“”)

which gives me exactly what i need. is there a better way to doing this?

pood · May 14, 2008, 6:08pm

Dan D.,

wow, much shorter, thanks!