Populate an array based on object

Hi,

I have been trying for some time to crack a problem which I am sure if
very
simple! I am learning Ruby and programming in general and having lots of
fun. But help would be appreciated…

I am writing a program to scrape body text from a series of web pages so
that they can be presented in a text file.

The format for the URLs of the series of pages I am interested in is:

www.targeturl.com/episode_x?=page1


www.targeturl.com/episode_x?=page[y]

Basically, “episode_x” has y number of pages, starting at 1.

I am using Nokogiri to grab the text from the page and can quite easily
get
the text from page1, but I want to loop through page2, grab its text,
page3,
grab its text, etc, until I reach page[y] which is where the text ends,
and
to Nokogiri - this means there is no more text on that page (i.e.
body_text
== nil).

Before attempting to grab the body text and append to a text file, my
strategy is to populate an array of ‘valid’ urls, based on a test which
involves Nokogiri finding text in the body tag, starting at page1. I
want
the loop to finish when the test finds body_text == nil, leaving me with
a
collection of URLs which I know to definitely contain body text.

After a lot of playing around, I have got this far, but there is no
looping
going on. I am getting the page okay and am testing for a certain
condition
which results in “Empty!” being appended to the array (essentially, when
body_text == nil). But I can’t work out how to loop.

def get_text(base_url, page_number)

@target_url = base_url + page_number.to_s
@noko_doc = Nokogiri::HTML(open(@target_url))

@text = ''
@noko_doc.css('div.body_recap').each do |text|
   @text << text.content
   @text = @text.strip!
   return @text
 end

end

def collect_urls(base_url, page_number)
@valid_urls = []
text = get_text(base_url, page_number)
if text =~ /\A\s*Previous/
@valid_urls << “END!”
else @valid_urls << @target_url
return @valid_urls
end
end
end

Any help or comments very welcome!

Thanks all.

Matt

M R Lemon wrote in post #1023844:

But I can’t work out how to loop.

def get_text(base_url, page_number)

@target_url = base_url + page_number.to_s
@noko_doc = Nokogiri::HTML(open(@target_url))

@text = ''
@noko_doc.css('div.body_recap').each do |text|
   @text << text.content
   @text = @text.strip!
   return @text
 end

end

Aside: there is no need to use instance variables (e.g. @target_url)
here. Local variables would be fine (e.g. target_url).

Instance variables persist within the object instance even when they
return (e.g. another method could see what was assigned to @target_url
previously). This means that you’re holding on to references which will
prevent these temporary objects from being garbage-collected.

def collect_urls(base_url, page_number)
@valid_urls = []
text = get_text(base_url, page_number)
if text =~ /\A\s*Previous/
@valid_urls << “END!”
else @valid_urls << @target_url
return @valid_urls
end
end
end

Well, I see no loop in there. Also, the formatting is a bit odd, which
makes it hard to read. What you’ve actually written is this:

if text =~ /\A\s*Previous/
@valid_urls << “END!”
else
@valid_urls << @target_url
return @valid_urls
end

So I’m not sure exactly what you’re trying to achieve, and which of
these two conditions is supposed to be the loop termination, but you
could try this as a starting point:

def collect_urls(base_url, page_number)
valid_urls = []
1000.times do # prevent infinite looping
text = get_text(base_url, page_number)
if text =~ /\A\s*Previous/
valid_urls << “END!”
break
else
valid_urls << @target_url
page_number += 1
end
end
return valid_urls
end

Incidentally, you seem to be relying on the instance variable
@target_url being set after get_text being called. It would be cleaner
to return the url and the text as two values.

return target_url, text

...

target_url, text = get_text(base_url, page_number)

HTH,

Brian.

On 26 September 2011 20:22, Brian C. [email protected] wrote:

end

target_url, text = get_text(base_url, page_number)

Brian,

Many thanks for your helpful suggestions. This is working for me now.
Originally, I was trying to avoid ‘hard-coding’ the loop such as
‘1000.times do’ but I see now that this is probably the most elegant
solution. Two really good learning points there for me two on the use
of
instance variables and efficiently returning and referring to
expressions.
Thanks again!

Matt

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs