Forum: Ruby str.scan

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
59d2de0dc2028141540521eb2360c40a?d=identicon&s=25 Colin Summers (Guest)
on 2007-06-15 08:02
(Received via mailing list)
I have a page of html, the usual thing. It has an ordered list. So it
has
  <ol>
     <li>item</li>
     <li>item</li>
     <li>item</li>
     <li>item</li>
     </ol>

Well, I am going through this my usual way, which is just brute force
string manipulation. It's still my first day with Ruby. Then I see

str.scan
Both forms iterate through str, matching the pattern (which may be a
Regexp or a String). For each match, a result is generated and either
added to the result array or passed to the block. If the pattern
contains no groups, each individual result consists of the matched
string, $&. If the pattern contains groups, each individual result is
itself an array containing one entry per group.



And I think, oooh, I bet that would be cool to use here. But my regexp
is rusty and I'm not sure how I would set it up
   items = page.scan('<li>*</li>')
something like that? Then items would be an array of the text in the
items?

Looked cool, anyway. I love how terse it can be.

There's probably also an html/xml parsing library, but I don't have
THAT much of this stuff to do, so I think a little manual work is
probably simpler/easier to learn.

--Colin
F50f5d582d76f98686da34917531fe56?d=identicon&s=25 Peter Szinek (Guest)
on 2007-06-15 08:14
(Received via mailing list)
Colin,

But my regexp
> is rusty and I'm not sure how I would set it up
>   items = page.scan('<li>*</li>')
> something like that? Then items would be an array of the text in the items?

Yes, they will be.

However, first things first:

1) items = page.scan('<li>*</li>')

I believe you want instead is

items = page.scan('<li>.*</li>')

( or maybe items = page.scan('<li>.+</li>') if you are not interested in
empty <li>s)

2) What I really believe you want is

items = page.scan('<li>.*?</li>')

? adds greediness to your regexp - so instead of matching the first
<li>. then matching as much as possible of anything, then matching the
*last* </li>, 2) will match as less as possible.

Let's try:

stuff = <<HTML
<li>aaa</li>
<li>bbb</li>
HTML

 >> stuff.scan(/<li>.*?<\/li>/)
=> ["<li>aaa</li>", "<li>bbb</li>"]

3) Maybe you want even this:

 >> stuff.scan(/<li>(.*?)<\/li>/)
=> [["aaa"], ["bbb"]]

or, even more friendly:

 >> stuff.scan(/<li>(.*?)<\/li>/).flatten
=> ["aaa", "bbb"]

HTH,
Peter
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.
83668557add122451315d38b24a9fe62?d=identicon&s=25 Michael Hollins (Guest)
on 2007-06-15 10:20
(Received via mailing list)
Colin Summers wrote:
> There's probably also an html/xml parsing library,

There are a number of them.

> but I don't have
> THAT much of this stuff to do, so I think a little manual work is
> probably simpler/easier to learn.

I doubt it. The html/xml parsing libraries that are available are very
easy to learn. Here's a few
you could google for:

    hpricot, mechanize, rexml, rubyful_soup

You can literally be up and running within minutes with one of those.

cheers,
mick
F50f5d582d76f98686da34917531fe56?d=identicon&s=25 Peter Szinek (Guest)
on 2007-06-15 10:37
(Received via mailing list)
>    hpricot, mechanize, rexml, rubyful_soup

and if you decide you need something advanced, you could check out
scRUBYt! as well.

Cheers,
Peter
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.
852a62a28f1de229dc861ce903b07a60?d=identicon&s=25 Gavin Kistner (phrogz)
on 2007-06-15 17:06
(Received via mailing list)
On Jun 15, 12:13 am, Peter Szinek <p...@rubyrailways.com> wrote:
> 2) What I really believe you want is
>
> items = page.scan('<li>.*?</li>')
>
> ? adds greediness to your regexp - so instead of matching the first
> <li>. then matching as much as possible of anything, then matching the
> *last* </li>, 2) will match as less as possible.

Minor pedantic correction: .* is greedy (it grabs as much as it can).
The question mark makes it non-greedy (stop as soon as you've found a
match).
59d2de0dc2028141540521eb2360c40a?d=identicon&s=25 Colin Summers (Guest)
on 2007-06-15 17:46
(Received via mailing list)
stuff.scan(/<li>(.*?)<\/li>/).flatten

is exactly what I was hoping for. I peeked at scRUBYt and I know that
I am duplicating work in there, but I am trying to a bunch of things
at once and one is learning Ruby. scRUBYt is doing so much work for me
that I wouldn't learn very much.

The tcl code that stuff.scan(/<li>(.*?)<\/li>/).flatten is so long.
That's great.

Day 2: Have my pickaxe. Bought Pine's book because it was fun to read
on the web and I like having books. Bought another copy of Lenz' Rails
book because a friend like it so much he took it. 115 lines and I am
ahead of where the professional consultant was with the .NET
application (after a month of programming).

Thanks,
--Colin
This topic is locked and can not be replied to.