Str.scan

I have a page of html, the usual thing. It has an ordered list. So it
has

  1. item
  2. item
  3. item
  4. item

Well, I am going through this my usual way, which is just brute force
string manipulation. It’s still my first day with Ruby. Then I see

str.scan
Both forms iterate through str, matching the pattern (which may be a
Regexp or a String). For each match, a result is generated and either
added to the result array or passed to the block. If the pattern
contains no groups, each individual result consists of the matched
string, $&. If the pattern contains groups, each individual result is
itself an array containing one entry per group.

And I think, oooh, I bet that would be cool to use here. But my regexp
is rusty and I’m not sure how I would set it up
items = page.scan(’

  • *
  • ’)
    something like that? Then items would be an array of the text in the
    items?

    Looked cool, anyway. I love how terse it can be.

    There’s probably also an html/xml parsing library, but I don’t have
    THAT much of this stuff to do, so I think a little manual work is
    probably simpler/easier to learn.

    –Colin

    Colin,

    But my regexp

    is rusty and I’m not sure how I would set it up
    items = page.scan(‘

  • *
  • ’)
    something like that? Then items would be an array of the text in the items?

    Yes, they will be.

    However, first things first:

    1. items = page.scan(‘
    2. *
    3. ’)

    I believe you want instead is

    items = page.scan(‘

  • .*
  • ’)

    ( or maybe items = page.scan(‘

  • .+
  • ’) if you are not interested in
    empty
  • s)

    1. What I really believe you want is

    items = page.scan(‘

  • .*?
  • ’)

    ? adds greediness to your regexp - so instead of matching the first

  • . then matching as much as possible of anything, then matching the *last*
  • , 2) will match as less as possible.

    Let’s try:

    stuff = <<HTML

  • aaa
  • bbb
  • HTML

    stuff.scan(/

  • .*?</li>/)
    => [“
  • aaa
  • ”, “
  • bbb
  • ”]

    1. Maybe you want even this:

    stuff.scan(/

  • (.*?)</li>/)
    => [[“aaa”], [“bbb”]]

  • or, even more friendly:

    stuff.scan(/

  • (.*?)</li>/).flatten
    => [“aaa”, “bbb”]

  • HTH,
    Peter
    _
    http://www.rubyrailways.com :: Ruby and Web2.0 blog
    http://scrubyt.org :: Ruby web scraping framework
    http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

    hpricot, mechanize, rexml, rubyful_soup

    and if you decide you need something advanced, you could check out
    scRUBYt! as well.

    Cheers,
    Peter
    _
    http://www.rubyrailways.com :: Ruby and Web2.0 blog
    http://scrubyt.org :: Ruby web scraping framework
    http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

    Colin S. wrote:

    There’s probably also an html/xml parsing library,

    There are a number of them.

    but I don’t have
    THAT much of this stuff to do, so I think a little manual work is
    probably simpler/easier to learn.

    I doubt it. The html/xml parsing libraries that are available are very
    easy to learn. Here’s a few
    you could google for:

    hpricot, mechanize, rexml, rubyful_soup
    

    You can literally be up and running within minutes with one of those.

    cheers,
    mick

    On Jun 15, 12:13 am, Peter S. [email protected] wrote:

    1. What I really believe you want is

    items = page.scan(‘

  • .*?
  • ’)

    ? adds greediness to your regexp - so instead of matching the first

  • . then matching as much as possible of anything, then matching the *last*
  • , 2) will match as less as possible.

    Minor pedantic correction: .* is greedy (it grabs as much as it can).
    The question mark makes it non-greedy (stop as soon as you’ve found a
    match).

    stuff.scan(/

  • (.*?)</li>/).flatten

    is exactly what I was hoping for. I peeked at scRUBYt and I know that
    I am duplicating work in there, but I am trying to a bunch of things
    at once and one is learning Ruby. scRUBYt is doing so much work for me
    that I wouldn’t learn very much.

    The tcl code that stuff.scan(/

  • (.*?)</li>/).flatten is so long.
    That’s great.

    Day 2: Have my pickaxe. Bought Pine’s book because it was fun to read
    on the web and I like having books. Bought another copy of Lenz’ Rails
    book because a friend like it so much he took it. 115 lines and I am
    ahead of where the professional consultant was with the .NET
    application (after a month of programming).

    Thanks,
    –Colin