Regex breaks when string contains binary data

aishafenton · September 8, 2010, 10:41am

Hi,
I can’t see why the following Regex isn’t working as expected. Maybe a
bug in ruby?

puts “345HI”.match(/HI|BYE/).inspect

HI
puts “\345HI”.match(/HI|BYE/).inspect
nil

What I expected was the second regex to find “HI” in the string too. Why
I need this, is that I’m printing data to stdout that will contain some
binary characters, and I’m using a regex to escape the newline
characters that it maybe contains.

I’m using Ruby 1.8.7

Thanks,
Aish

aishafenton · September 8, 2010, 12:36pm

Aisha F. wrote:

I can’t see why the following Regex isn’t working as expected. Maybe a
bug in ruby?

puts “345HI”.match(/HI|BYE/).inspect

HI
puts “\345HI”.match(/HI|BYE/).inspect
nil

Are you typing this in irb? Are you using any dodgy irb plugins like
wirble or anything else in .irbrc?

You should get a MatchData object as the response. For me it works
exactly as expected: this is under Ubuntu Lucid with the stock 1.8.7
packages.

$ irb --simple-prompt

RUBY_VERSION
=> “1.8.7”

RUBY_DESCRIPTION
=> “ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux]”

puts “345HI”.match(/HI|BYE/).inspect
#<MatchData “HI”>
=> nil

puts “\345HI”.match(/HI|BYE/).inspect
#<MatchData “HI”>
=> nil

puts “\345KI”.match(/HI|BYE/).inspect
nil
=> nil

I’m using Ruby 1.8.7

What exact version of ruby 1.8.7? (Show ruby -v). And where did you get
it from? Did you compile it yourself from source, or did you install a
ready-made binary package - if so, which one?

What operating system are you running under?

What results do you get if you try the following in irb?

“\345HI”.match(/HI|BYE/)
=> #<MatchData “HI”>

“\345HI” =~ /HI|BYE/
=> 1

“\345HI”.bytes.to_a
=> [229, 72, 73]

aishafenton · September 8, 2010, 12:54pm

Hi Brian,
Output those IRB statements for me is:

RUBY_VERSION
=> “1.8.7”

RUBY_DESCRIPTION
=> “ruby 1.8.7 (2009-06-12 patchlevel 174) [universal-darwin10.0]”

puts “345HI”.match(/HI|BYE/).inspect
#<MatchData “HI”>
=> nil

puts “\345HI”.match(/HI|BYE/).inspect
nil
=> nil

puts “\345KI”.match(/HI|BYE/).inspect
nil
=> nil

“\345HI” =~ /HI|BYE/
=> nil

“\345HI”.bytes.to_a
=> [229, 72, 73]

And I’m not running any dodgy irb plugins, such as wirble.

aishafenton · September 8, 2010, 1:15pm

Aisha F. wrote:

Hi Brian,
Output those IRB statements for me is:
…

RUBY_DESCRIPTION
=> “ruby 1.8.7 (2009-06-12 patchlevel 174) [universal-darwin10.0]”
…

“\345HI” =~ /HI|BYE/
=> nil

“\345HI”.bytes.to_a
=> [229, 72, 73]

Hmm. What does $KCODE show? I can reproduce your problem if I turn it
on.

$ irb --simple-prompt

“\345HI” =~ /HI|BYE/
=> 1

$KCODE
=> “NONE”

$ irb -Ku --simple-prompt

“\345HI” =~ /HI|BYE/
=> nil

“345HI” =~ /HI|BYE/
=> 3

$KCODE
=> “UTF8”

Do you have something in RUBYOPT environment variable? (echo $RUBYOPT)

aishafenton · September 8, 2010, 1:29pm

$KCODE is set to “UTF8”, and my RUBYOPT env variable is set to
“rubygems”.

Hmmm could be the problem. I haven’t really got my head around Ruby’s
String encoding, might be time.

aishafenton · September 8, 2010, 2:02pm

Aisha F. wrote:

$KCODE is set to “UTF8”, and my RUBYOPT env variable is set to
“rubygems”.

Try writing your script like this:

#!/usr/bin/ruby -Kn
puts “\345HI”.match(/HI|BYE/).inspect

or put $KCODE=“NONE” at the top.

If it’s not being set in RUBYOPT then perhaps Apple built ruby using
option
–with-default-kcode=UTF8

Hmmm could be the problem. I haven’t really got my head around Ruby’s
String encoding, might be time.

ruby 1.8’s is pretty straightforward; ruby 1.9’s is just boggling.

http://blog.grayproductions.net/categories/character_encodings

Regards,

Brian.

aishafenton · September 9, 2010, 11:19am

On Wed, Sep 8, 2010 at 1:29 PM, Aisha F. [email protected]
wrote:

$KCODE is set to “UTF8”, and my RUBYOPT env variable is set to
“rubygems”.

Hmmm could be the problem. I haven’t really got my head around Ruby’s
String encoding, might be time.

You can check the encoding of the string like this:

irb(main):006:0> s=“\345HI”
=> “\xE5HI”
irb(main):007:0> s.encoding
=> #Encoding:CP850
irb(main):008:0> s.encoding.to_s
=> “CP850”

It’s also interesting to see what byte length and character length you
get.

irb(main):009:0> s.length
=> 3
irb(main):010:0> s.bytesize
=> 3

If you get a value < 3 for character length that would explain the
mismatch. Also, you can check contents of the subsequence that you
think should match:

irb(main):011:0> s[-2…-1]
=> “HI”
irb(main):012:0> s[1…-1]
=> “HI”

Kind regards

robert

aishafenton · September 9, 2010, 11:32am

Robert K. wrote:

You can check the encoding of the string like this:

Nope - the OP is using 1.8.7.