Weird error using String#[]

luislavena · January 13, 2011, 3:19am

Please check out the attached file. I am writing a script to notify me
when a few select items become available. It hits a web page then parses
the information in order to determine whether the item is available or
not.

When I parse the values out I start seeing some really weird results
when calling the String#[]. What is even weirder is that when I put
these results with something like puts “val: #{weird_val}” it also
replaces part of the string being put, "val: ".

Example:

ret = res[spos, 90]
puts “ret: #{ret}”

^^^^^

Expected live result (works in baseline):

ret: id=“ProdAvailability”><span style="font-weight: bold; color:

#000;">Availability:Ou

Actual live result (missing Ou on end, r in pos 0 replaced with O):

Oet: id=“ProdAvailability”><span style="font-weight: bold; color:

#000;">Availability:

If I pull the contents from the web site, it doesn’t work. If I pull the
contents from a string saved in the script (denoted as baseline in the
file), it works fine.

I have been spinning my wheels for 2 days now and am pretty sure that I
am overlooking something obvious.

Anyone have any idea what is causing this?

jasonmorrison · January 13, 2011, 3:28am

This seems like an encoding issue. But are you hand-scraping an HTML
page? Shouldn’t you use something like Nokogiri or REXML or Hpricot?

-Kedar

jasonmorrison · January 13, 2011, 3:38am

I probably should. I’m still relatively new to Ruby / RoR so tend to do
things “by hand” a lot just so I can learn how they work. 9/10 times I
go back afterwards and replace it with something tried and true. I’ve
seen Nokogiri a lot and it is already on my “to research” list.

I thought encoding might be the culprit here but I haven’t gotten so far
as to figure out how to change the encoding. I tried to use the encode()
method but got a no method found error. I’d assume I’d want to change to
whatever the standard is for OS X (UTF-8?)? How would I do this?

jasonmorrison · January 13, 2011, 5:14pm

On Thu, Jan 13, 2011 at 4:35 AM, Jason M. [email protected]
wrote:

Nokogiri is easier… (see below)

Certainly!

I would still like to know what exactly is causing the weird behavior in
my original post though, if anyone knows. I can understand why encoding
would result in incorrect parsing, but I don’t understand why the
encoding would mess up the hard coded portion of the call to puts still.

Can you provide a small program that exhibits the effect you are
seeing? It is especially important to see how you calculate indexes.

Maybe this can help to illustrate a possible scenario:

Ruby version 1.9.2
irb(main):001:0> s = “a”
=> “a”
irb(main):002:0> s.encoding
=> #Encoding:UTF-8
irb(main):003:0> x = s.dup
=> “a”
irb(main):004:0> x.encoding
=> #Encoding:UTF-8
irb(main):005:0> x.force_encoding “BINARY”
=> “a\xC3\xA4”
irb(main):006:0> x.encoding
=> #Encoding:ASCII-8BIT
irb(main):007:0> x[1,1]
=> “\xC3”
irb(main):008:0> s[1,1]
=> “”
irb(main):009:0>

Kind regards

robert

jasonmorrison · January 13, 2011, 4:34am

Nokogiri is easier… (see below)

I would still like to know what exactly is causing the weird behavior in
my original post though, if anyone knows. I can understand why encoding
would result in incorrect parsing, but I don’t understand why the
encoding would mess up the hard coded portion of the call to puts still.

Working Nokogiri example:

require ‘rubygems’
require ‘nokogiri’
require ‘open-uri’

doc =
Nokogiri::HTML(open(“http://www.pennstateind.com/store/PKPARK-MAG.html”))
#puts doc
ret = doc.at(“div#ProdAvailability”)
puts “ret: #{ret}”

Output:

ret:

Outof Stock / Eta Mid January <a

href=“Shipping and Delivery”
onclick=“link_popup(this,‘width=500,height=600,toolbar=no,scrollbars=yes’);
return false;”>See Shipping Details

jasonmorrison · January 13, 2011, 6:12pm

On Thu, Jan 13, 2011 at 6:01 PM, Jason M. [email protected]
wrote:

Thanks, Robert. The original post has the script with both expected and
unexpected outcomes.

I thought more of a small script which does not need network
connection etc. and rather works with static text.

What you show with the encoding screwing up the
offsets makes total sense.

What I’m at a loss for is why it affects the hard coded portion of the
string passed to puts:

Example:
puts “ret: #{ret}”

Output:
Oet: [part but not all of the expected string - 2 chars too short]

Well, that’s easy:

irb(main):014:0> s = “\rA\tB”
=> “\rA\tB”
irb(main):015:0> puts “ret: #{s}”
Aet: B
=> nil
irb(main):016:0> p “ret: #{s}”
“ret: \rA\tB”
=> “ret: \rA\tB”
irb(main):017:0> s = “\rAet: B”
=> “\rAet: B”
irb(main):018:0> puts “ret: #{s}”
Aet: B
=> nil
irb(main):019:0> p “ret: #{s}”
“ret: \rAet: B”
=> “ret: \rAet: B”

To debug you should use p and not puts.

At this point I plan on using Nokogiri but I am really curious what is
causing what I describe above. This is a weirdness for how strings /
puts works that I’d like to understand and keep in mind going forward.

It’s probably rather about how your terminal works than how strings
work.

Kind regards

robert

jasonmorrison · January 13, 2011, 6:15pm

Hi there,

Normally when I see similiar behaviour it’s because of “hidden”
characters.

Do you have a hidden \r (0x0D, decimal 13) in the text you’re reading ?

jasonmorrison · January 13, 2011, 6:54pm

Robert,

That example shows the same behavior in my console as you show above. So
it is the \r that is causing it, it seems. I suppose the console sees
the \r and tries to create a new line, can’t, and overwrites what is
there? The reason that it is one character too short is because the \r
would count as 1.

Thanks for the example!

jasonmorrison · January 13, 2011, 6:01pm

Thanks, Robert. The original post has the script with both expected and
unexpected outcomes. What you show with the encoding screwing up the
offsets makes total sense.

What I’m at a loss for is why it affects the hard coded portion of the
string passed to puts:

Example:
puts “ret: #{ret}”

Output:
Oet: [part but not all of the expected string - 2 chars too short]

At this point I plan on using Nokogiri but I am really curious what is
causing what I describe above. This is a weirdness for how strings /
puts works that I’d like to understand and keep in mind going forward.

Thanks!

jasonmorrison · January 18, 2011, 8:51am

On Thu, Jan 13, 2011 at 6:54 PM, Jason M. [email protected]
wrote:

That example shows the same behavior in my console as you show above. So
it is the \r that is causing it, it seems. I suppose the console sees
the \r and tries to create a new line, can’t, and overwrites what is
there?

No, \n is newline, \r is carriage return which simply positions the
cursor at the beginning of the line.

The reason that it is one character too short is because the \r
would count as 1.

To see what’s really in the string you should use p or #inspect.

Thanks for the example!

You’re welcome!

Kind regards

robert