I want to scrape something of a webpage. It has a massive content and I
want to find the thing that comes after the first occourance of “a
href=” after the occourance of “id=xxx”. So:
if page.body =~ /<id=xxx>.+?<a href=[’"]?([^’"\s>]*)/m
capture = $1
end
If this seems somewhat perlish, it’s because a perlmonger taught me this
line.
I’m a recovering Perlmonger, and that’s exactly what I’d do in that
situation.
–
Giles B.
I’m running a time management experiment: I’m only checking e-mail
twice per day, at 11am and 5pm. If you need to get in touch quicker
than that, call me on my cell.
If you going to scrape anything more complex than what can be handled by
a
few regular expressions, then you might wan’t to take a look a
whytheluckystiff’s Hpricot library:
If you going to scrape anything more complex than what can be handled by a
few regular expressions, then you might wan’t to take a look a
whytheluckystiff’s Hpricot library: http://code.whytheluckystiff.net/hpricot/
It incorporates Hpricot and gives you both a higher-level approach and
a way to drop down to Hpricot if needed. I think it’s also going to
incorporate FireWatir in the nearish future, or use it somehow (forgot
details).
–
Giles B.
I’m running a time management experiment: I’m only checking e-mail
twice per day, at 11am and 5pm. If you need to get in touch quicker
than that, call me on my cell.