Forum: Ruby Regex problem, probably simple

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Jim K. (Guest)
on 2007-05-14 22:23
Hello!

I'm trying to do the following:

I want to scrape something of a webpage. It has a massive content and I
want to find the thing that comes after the first occourance of "a
href=" after the occourance of "id=xxx". So:

...
<id=xxx>
...
<a href=??????>
...

How can I do this?

Thank you!
Axel E. (Guest)
on 2007-05-14 23:47
(Received via mailing list)
Dear Jim,

assuming that you have the following in webpage "a0.html":

...
<id=xxx>
...
<a href=??????>
...

,

you can run the following script:

my_page=IO.readlines("a0.html").to_s
r1=/<id=xxx>/
r2=/(?=<a href=[^>]+>)/
r3=/<a href=([^>]+)>/
text=my_page.split(r1)
text2=text[1..-1].join.split(r2)[1]
ref=r3.match(text2)
p 'the first link was : ' + ref[1]

I read in the entire page into a string my_page,
split that into an Array at the first occurrence of
regexp r1, join it back again into a string,
then split that into an array using regexp r2, which keeps the
delimiter (of form <a href=[^>]+> ...that's what the (?= .. ) syntax is
for) , rather than dropping it, as in the first split.
If there is text before the first occurrence of r3,
you'll find it in the first element of the splitted string:

 text[1..-1].join.split(r2)[0],

and the first occurrence of r3 is in the second element

text2.

If you want more information about Regexps, you'll might
find this helpful:

http://www.regular-expressions.info/ruby.html

Best regards,

Axel
Dan Z. (Guest)
on 2007-05-15 00:04
(Received via mailing list)
Jim Kronhamn wrote:
> ...
> <a href=??????>
> ...
>
> How can I do this?
>
> Thank you!
>

Jim,

Is the "???" the thing you want to capture? If so, the following should
do the trick:

if page.body =~ /<id=xxx>.+?<a href=['"]?([^'"\s>]*)/m
     capture = $1
end

If this seems somewhat perlish, it's because a perlmonger taught me this
line.

Dan
Jim K. (Guest)
on 2007-05-15 00:23
Thank you for the answers! Worked like a charm.
Giles B. (Guest)
on 2007-05-15 23:57
(Received via mailing list)
> if page.body =~ /<id=xxx>.+?<a href=['"]?([^'"\s>]*)/m
>      capture = $1
> end
>
> If this seems somewhat perlish, it's because a perlmonger taught me this
> line.

I'm a recovering Perlmonger, and that's exactly what I'd do in that
situation.

--
Giles B.

I'm running a time management experiment: I'm only checking e-mail
twice per day, at 11am and 5pm. If you need to get in touch quicker
than that, call me on my cell.

Blog: http://gilesbowkett.blogspot.com
Portfolio: http://www.gilesgoatboy.org
Christian Theil H. (Guest)
on 2007-05-16 01:10
(Received via mailing list)
If you going to scrape anything more complex than what can be handled by
a
few regular expressions, then you might wan't to take a look a
whytheluckystiff's Hpricot library:
http://code.whytheluckystiff.net/hpricot/

It's excellent for scraping web pages..

Best regards,
Christian

blog: http://inferencing.blogspot.com
Giles B. (Guest)
on 2007-05-17 03:41
(Received via mailing list)
On 5/15/07, Christian Theil H. <removed_email_address@domain.invalid> wrote:
> If you going to scrape anything more complex than what can be handled by a
> few regular expressions, then you might wan't to take a look a
> whytheluckystiff's Hpricot library:
> http://code.whytheluckystiff.net/hpricot/
>
> It's excellent for scraping web pages..

There's also scrubyt:

http://scrubyt.org/

It incorporates Hpricot and gives you both a higher-level approach and
a way to drop down to Hpricot if needed. I think it's also going to
incorporate FireWatir in the nearish future, or use it somehow (forgot
details).

--
Giles B.

I'm running a time management experiment: I'm only checking e-mail
twice per day, at 11am and 5pm. If you need to get in touch quicker
than that, call me on my cell.

Blog: http://gilesbowkett.blogspot.com
Portfolio: http://www.gilesgoatboy.org
This topic is locked and can not be replied to.