String.scan failure when match works

luislavena · March 26, 2012, 4:43am

Good Evening All,

I’m very new to coding of any kind, and I’m trying to learn Ruby by
working through some projects that are relevant to me. I’m working on a
small project to pull the headlines out of a news website for me. I’m
able to get match working just fine. I’ve tried switching to string.scan
to be able to iterate and nothing is returned. I can’t figure out why
not, and would appreciate suggestions.

I realize the code is very ugly, and I apologize!

Thank you in advance,
CJ

page_string=‘junk’
open(“http://www.scatoday.net/”) {|f|
page_string = f.read
}
#THIS one works, one time
#rezstring = //node/[0-9]+.title=".“>/.match(page_string)
#THESE Don’t
#page_string.scan(//node/[0-9]+.title=".”>/)
#page_string.gsub(//node/[0-9]+.title=".">/)

cj_m · March 26, 2012, 7:22am

On Mon, 26 Mar 2012 11:43:51 +0900, CJ M. wrote:

page_string=‘junk’
open(“http://www.scatoday.net/”) {|f|

why not this?

thufir@dur:~/ruby/html$
thufir@dur:~/ruby/html$ nl html.rb
1 require ‘rubygems’
2 require ‘htmlentities’
3 require ‘net/http’

 4  uri = URI('http://www.scatoday.net/')
 5  page = Net::HTTP.get(uri)

 6  puts page.scan(/\/node\/[0-9]+.*title=".*">/)

thufir@dur:~/ruby/html$

however, I’m sure there’s an existing gem for that.

HTH,

Thufir

cj_m · March 26, 2012, 9:43am

CJ M. wrote in post #1053266:

I’m very new to coding of any kind, and I’m trying to learn Ruby by

Welcome to the world of Ruby - and programming!

working through some projects that are relevant to me. I’m working on a
small project to pull the headlines out of a news website for me. I’m
able to get match working just fine. I’ve tried switching to string.scan
to be able to iterate and nothing is returned. I can’t figure out why
not, and would appreciate suggestions.

page_string=‘junk’
open(“http://www.scatoday.net/”) {|f|
page_string = f.read
}

You could simply do

page_string = open(“http://www.scatoday.net/”) {|f| f.read}

or even this if you want to get fancy

page_string = open(“http://www.scatoday.net/”, &:read)

#THIS one works, one time
#rezstring = //node/[0-9]+.title=".“>/.match(page_string)
#THESE Don’t
#page_string.scan(//node/[0-9]+.title=".”>/)

What does “does not work” mean? I get 80 matches.

#page_string.gsub(//node/[0-9]+.title=".">/)

This cannot work since you either need a second argument or a block.
Please post error messages or a more clear description of what you think
does not work.

Btw, it seems for what you do Mechanize could be a good tool.
http://mechanize.rubyforge.org/

Kind regards

robert

cj_m · April 1, 2012, 5:57pm

Good morning!

I am not a Ruby programmer (nothing against it; I just happen to use
other languages right now). But I am the publisher of the web site
scatoday.net to which you refer.

My suggestion is that an easier way to retrieve our headlines is to use
our RSS feed, which is XML format and therefore easier to machine-parse.

The URL for the RSS feed: http://www.scatoday.net/node/feed

Feel free to email me at publisher at scatoday dot net if you run into
further
trouble, and good luck with your project!

cj_m · March 28, 2012, 5:55am

Good Evening,

I will try out both your suggestions shortly. To clarify “does not work”
when I say that I mean nothing is returned. No error messages, no output
at all. I had included a puts with jibberish for testing purposes so I
knew it ran at all.

Thank you both for your time!

CJ