String.strip with UTF-8

Hi

I can’t strip the leading whitespace (or what at least looks like
whitespace) from a Ruby 1.9.2 string

ruby-1.9.2-p0 :002 > d.entity
=> " United Arab Emirates"
ruby-1.9.2-p0 :003 > d.entity.strip
=> " United Arab Emirates"
ruby-1.9.2-p0 :004 > d.entity.class
=> String
ruby-1.9.2-p0 :005 > d.entity.encoding
=> #Encoding:UTF-8
ruby-1.9.2-p0 :006 >

It’s inside the Rails 3.0.3 console…

Erik

On Wednesday, January 12, 2011 03:28:38 pm Erik E. wrote:

ruby-1.9.2-p0 :004 > d.entity.class
=> String
ruby-1.9.2-p0 :005 > d.entity.encoding
=> #Encoding:UTF-8
ruby-1.9.2-p0 :006 >

It’s inside the Rails 3.0.3 console…

Try this:

d.entity[0].ord

I’m not sure how useful that will be, but you can compare it to that of
a
space. It seems to be unicode-aware:

ruby-1.9.2-p136 :020 > ‘:snowman_with_snow:’.ord
=> 9731
ruby-1.9.2-p136 :021 > _.to_s 16
=> “2603”
ruby-1.9.2-p136 :022 > “\u2603”
=> “:snowman_with_snow:

And for good measure:

ruby-1.9.2-p136 :023 > _.ord
=> 9731

(If you’re wondering, that underscore means “The result of the last
command I
entered into IRB.” It’s fantastically useful, though it gets annoying
when you
want to repeat commands using up arrow, etc.)

So, if you get something other than:

ruby-1.9.2-p136 :024 > ’ '.ord
=> 32

…then it’s not a space. At that point, maybe report a bug, but maybe
you’ll
also be able to work around it with a regex or something.

Erik E. wrote in post #974416:

Hi

I can’t strip the leading whitespace (or what at least looks like
whitespace) from a Ruby 1.9.2 string

ruby-1.9.2-p0 :002 > d.entity
=> " United Arab Emirates"
ruby-1.9.2-p0 :003 > d.entity.strip
=> " United Arab Emirates"
ruby-1.9.2-p0 :004 > d.entity.class
=> String
ruby-1.9.2-p0 :005 > d.entity.encoding
=> #Encoding:UTF-8
ruby-1.9.2-p0 :006 >

It’s inside the Rails 3.0.3 console…

Erik

Hi, I made a fresh install with rvm 1.9.2-p0 and rails 3.0.3
and I cannot reproduce your problem. Maybe you could try to
replay what I did and see if you can still reproduce it ?

Also, to examine that first character in detail, what is the
result when you try this:

009:0> d.entity.bytes.to_a[0…5]
=> [32, 85, 110, 105, 116, 101]

I see a “regular” space (character 32 in decimal notation)
as first character.

HTH,

Peter

peterv@ASUS:~/ra/apps/trials$ rvm install 1.9.2-p0
/home/peterv/.rvm/rubies/ruby-1.9.2-p0, this may take a while depending
on your cpu(s)…

ruby-1.9.2-p0 - #fetching

Install of ruby-1.9.2-p0 - #complete

peterv@ASUS:~/ra/apps/trials$ rvm use 1.9.2-p0
Using /home/peterv/.rvm/gems/ruby-1.9.2-p0

peterv@ASUS:~/ra/apps/trials$ rvm gemset create rails3
‘rails3’ gemset created (/home/peterv/.rvm/gems/ruby-1.9.2-p0@rails3).

peterv@ASUS:~/ra/apps/trials$ rvm gemset use rails3
Now using gemset ‘rails3’

peterv@ASUS:~/ra/apps/trials$ gem install rails --no-rdoc --no-ri
Successfully installed activesupport-3.0.3
Successfully installed builder-2.1.2
Successfully installed i18n-0.5.0
Successfully installed activemodel-3.0.3
Successfully installed rack-1.2.1
Successfully installed rack-test-0.5.7
Successfully installed rack-mount-0.6.13
Successfully installed tzinfo-0.3.23
Successfully installed abstract-1.0.0
Successfully installed erubis-2.6.6
Successfully installed actionpack-3.0.3
Successfully installed arel-2.0.6
Successfully installed activerecord-3.0.3
Successfully installed activeresource-3.0.3
Successfully installed mime-types-1.16
Successfully installed polyglot-0.3.1
Successfully installed treetop-1.4.9
Successfully installed mail-2.2.14
Successfully installed actionmailer-3.0.3
Successfully installed thor-0.14.6
Successfully installed railties-3.0.3
Successfully installed bundler-1.0.7
Successfully installed rails-3.0.3
23 gems installed

peterv@ASUS:~/ra/apps/trials$ rails new issue_with_strip
create

create vendor/plugins/.gitkeep
peterv@ASUS:~/ra/apps/trials$ cd issue_with_strip/
peterv@ASUS:~/ra/apps/trials/issue_with_strip$ bundle install
Fetching source index for http://rubygems.org/
Using rake (0.8.7)

Using rails (3.0.3)
Installing sqlite3-ruby (1.3.2) with native extensions
Your bundle is complete! Use bundle show [gemname] to see where a
bundled gem is installed.

peterv@ASUS:~/ra/apps/trials/issue_with_strip$ rails g model D
entity:string
invoke active_record
create db/migrate/20110112222955_create_ds.rb
create app/models/d.rb
invoke test_unit
create test/unit/d_test.rb
create test/fixtures/ds.yml

peterv@ASUS:~/ra/apps/trials/issue_with_strip$ rake db:migrate
(in /home/peterv/data/back/rails-apps/apps/trials/issue_with_strip)
== CreateDs: migrating

– create_table(:ds)
→ 0.0010s
== CreateDs: migrated (0.0011s)

US:~/ra/apps/trials/issue_with_strip$ rails c
Loading development environment (Rails 3.0.3)
001:0> IRB.prompt_mode=:RVM # this is a local patch
=> :RVM
ruby-1.9.2-p0 :002 > d = D.create :entity => " United Arab Emirates"
=> #<D id: 1, entity: " United Arab Emirates", created_at: “2011-01-12
22:31:21”, updated_at: “2011-01-12 22:31:21”>
ruby-1.9.2-p0 :003 > d.entity
=> " United Arab Emirates"
ruby-1.9.2-p0 :004 > d.entity.strip
=> “United Arab Emirates”
ruby-1.9.2-p0 :005 > d.entity.class
=> String
ruby-1.9.2-p0 :006 > d.entity.encoding
=> #Encoding:UTF-8
ruby-1.9.2-p0 :007 > exit

peterv@ASUS:~/ra/apps/trials/issue_with_strip$ rails c
Loading development environment (Rails 3.0.3)
001:0> d = D.find :last
=> #<D id: 1, entity: " United Arab Emirates", created_at: “2011-01-12
22:31:21”, updated_at: “2011-01-12 22:31:21”>
002:0> d.entity
=> " United Arab Emirates"
003:0> d.entity.strip
=> “United Arab Emirates”

Thank you for quick reply David & Peter, I was upgrading Ruby to see if
it made a difference, but I can see it’s not a space now which explains
why it didn’t strip

Loading development environment (Rails 3.0.3)
ruby-1.9.2-p136 :001 > d = Domain.last
=> #<Domain id: 2055, classification: “Internationalized Country Code
Top Level Domain”, dns_name: “xn–mgbaam7a8h”, idn_name: “امارات.”,
entity: " United Arab Emirates", explanation: “imārāt”, notes: nil,
related_id: 1795, idn: true, dnssec: false, created_at: “2011-01-12
19:04:54”, updated_at: “2011-01-12 19:04:54”>
ruby-1.9.2-p136 :002 > d.entity
=> " United Arab Emirates"
ruby-1.9.2-p136 :003 > d.entity.class
=> String
ruby-1.9.2-p136 :004 > d.entity.encoding
=> #Encoding:UTF-8
ruby-1.9.2-p136 :005 > d.entity[0].ord
=> 160
ruby-1.9.2-p136 :006 > d.entity.bytes.to_a
=> [194, 160, 85, 110, 105, 116, 101, 100, 32, 65, 114, 97, 98, 32, 69,
109, 105, 114, 97, 116, 101, 115]

Peter V. wrote in post #974440:

Hi, I made a fresh install with rvm 1.9.2-p0 and rails 3.0.3
and I cannot reproduce your problem. Maybe you could try to
replay what I did and see if you can still reproduce it ?

Also, to examine that first character in detail, what is the
result when you try this:

009:0> d.entity.bytes.to_a[0…5]
=> [32, 85, 110, 105, 116, 101]

I see a “regular” space (character 32 in decimal notation)
as first character.

HTH,

Peter

Loading development environment (Rails 3.0.3)
001:0> IRB.prompt_mode=:RVM # this is a local patch
=> :RVM
ruby-1.9.2-p0 :002 > d = D.create :entity => " United Arab Emirates"
=> #<D id: 1, entity: " United Arab Emirates", created_at: “2011-01-12
22:31:21”, updated_at: “2011-01-12 22:31:21”>
ruby-1.9.2-p0 :003 > d.entity
=> " United Arab Emirates"
ruby-1.9.2-p0 :004 > d.entity.strip
=> “United Arab Emirates”
ruby-1.9.2-p0 :005 > d.entity.class
=> String
ruby-1.9.2-p0 :006 > d.entity.encoding
=> #Encoding:UTF-8
ruby-1.9.2-p0 :007 > exit

peterv@ASUS:~/ra/apps/trials/issue_with_strip$ rails c
Loading development environment (Rails 3.0.3)
001:0> d = D.find :last
=> #<D id: 1, entity: " United Arab Emirates", created_at: “2011-01-12
22:31:21”, updated_at: “2011-01-12 22:31:21”>
002:0> d.entity
=> " United Arab Emirates"
003:0> d.entity.strip
=> “United Arab Emirates”

Cool, thanks for that! I can just gsub/gsub! it out now that I know what
it is.

zimbatm … wrote in post #974462:

2011/1/12 Erik E. [email protected]:

Thank you for quick reply David & Peter, I was upgrading Ruby to see if
it made a difference, but I can see it’s not a space now which explains
why it didn’t strip

Yeah, it’s the dreaded non-breaking space 1. Unfortunately, somebody
thought it would be nice to map Alt+Space to this character on some
keymaps (like mine, which is Swiss-French). If you’re on a mac, see my
solution here :
http://0x2a.im/2009/04/16/terminal-unicode-problem-2.html

On Jan 12, 2011, at 16:43, Erik E. wrote:

Cool, thanks for that! I can just gsub/gsub! it out now that I know what
it is.

That will work if NO-BREAK SPACE is the only space you’ll encounter.

s.gsub(/\A[[:space:]](.?)[[:space:]]*\z/) { $1 }

will remove:
Space_Separator | Line_Separator | Paragraph_Separator | 0009 | 000A |
000B | 000C | 000D | 0085

See section 6 of サービス終了のお知らせ

PS: Note that s.gsub(/()/, ‘\1’) may alter the encoding of the result
string.

2011/1/12 Erik E. [email protected]:

Thank you for quick reply David & Peter, I was upgrading Ruby to see if
it made a difference, but I can see it’s not a space now which explains
why it didn’t strip

Yeah, it’s the dreaded non-breaking space 1. Unfortunately, somebody
thought it would be nice to map Alt+Space to this character on some
keymaps (like mine, which is Swiss-French). If you’re on a mac, see my
solution here :
http://0x2a.im/2009/04/16/terminal-unicode-problem-2.html