Forum: Ruby-core [ruby-trunk - Bug #7845][Open] Strip doesn't handle unicode space characters in ruby 1.9.2 & 1.9.3 (

Posted by timothyg56 (Timothy Garnett) (Guest)
on 2013-02-13 16:31
(Received via mailing list)
Issue #7845 has been reported by timothyg56 (Timothy Garnett).

----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by ko1 (Koichi Sasada) (Guest)
on 2013-02-17 06:13
(Received via mailing list)
Issue #7845 has been updated by ko1 (Koichi Sasada).

Subject changed from Strip doesn't handle unicode space characters in 
ruby 1.9.2 & 1.9.3 (does in 1.9.1) to Strip doesn't handle unicode 
space characters in ruby 1.9.2 & 1.9.3 (does in 1.9.1)
Category set to M17N
Assignee set to naruse (Yui NARUSE)


----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 
1.9.2 & 1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-36371

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version:
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by naruse (Yui NARUSE) (Guest)
on 2013-02-17 16:31
(Received via mailing list)
Issue #7845 has been updated by naruse (Yui NARUSE).

Subject changed from Strip doesn't handle unicode space characters 
in ruby 1.9.2 & 1.9.3 (does in 1.9.1) to Strip doesn't handle unicode 
space characters in ruby 1.9.2 & 1.9.3 (does in 1.9.1)
Status changed from Open to Rejected

The behavior is intended.
see also 
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/...
----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-36444

Author: timothyg56 (Timothy Garnett)
Status: Rejected
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version:
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by timothyg56 (Timothy Garnett) (Guest)
on 2013-02-25 21:14
(Received via mailing list)
Issue #7845 has been updated by timothyg56 (Timothy Garnett).


I'm not sure how convincing the linked conversation is.  It seems to be 
about case sensitivity issues in varying locales particularly around 
identifiers, but whether a unicode space is a whitespace or not is not 
locale dependent as far as I know.  It seems like strip, which is just 
whitespace, could easily be encoding aware while upcase/downcase and the 
like were ascii only for the cited complexity reasons.

It would be nice if strip removed the equivalent of [[:space:]] as it 
used to, but I guess that's what open source is for.  If anyone 
stumbling upon this wants to patch ruby itself to restore the old 
behavior see https://gist.github.com/tgarnett/5032660 for ruby source or 
you can monkey patch in a fix to string

class String
  def lstrip
    sub(/^[[:space:]]+/, '')
  end
  def rstrip
    sub(/[[:space:]]+$/, '')
  end
  def strip
    lstrip.rstrip
  end
  # etc. for ! versions
end


----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-37004

Author: timothyg56 (Timothy Garnett)
Status: Rejected
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version:
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by abscond (James Darling) (Guest)
on 2013-02-28 18:43
(Received via mailing list)
Issue #7845 has been updated by abscond (James Darling).


I'm not sure I understand the rationale behind rejecting this issue 
based on locale issues. I'm in support of this ticket, and will try and 
grab someone who can respond to naruse's concerns.
----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-37195

Author: timothyg56 (Timothy Garnett)
Status: Rejected
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version:
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by threedaymonk (Paul Battley) (Guest)
on 2013-03-13 15:57
(Received via mailing list)
Issue #7845 has been updated by threedaymonk (Paul Battley).


It's true that whitespace *can* be locale-dependent, at least insofar as 
Unix locales can specify which codepoints are to be considered as 
whitespace (in addition to space, tab, etc.).

The linked conversation about capitalisation doesn't seem relevant, 
though. The problem with capitalisation is that it really is 
language-specific: uppercase i is not I in Turkish, for example. Space, 
on the other hand isn't; it's just that the inventory of spaces used and 
their encoding depends on the locale: you won't find a double-width 
space in English, and the representation in UTF-8 and EUC-JP is 
different.

However, given that String#upcase/downcase are basically useless for 
non-ASCII content, I'm not sure I'd expect strip to handle Unicode 
spaces either.
----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-37573

Author: timothyg56 (Timothy Garnett)
Status: Rejected
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version:
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by marcandre (Marc-Andre Lafortune) (Guest)
on 2013-03-13 16:12
(Received via mailing list)
Issue #7845 has been updated by marcandre (Marc-Andre Lafortune).

Status changed from Rejected to Open
Target version set to current: 2.1.0

Let's reopen this issue.

Yui: could you explain why strip wouldn't remove leading and trailing 
/\p{space}/ ? I can only see upside to this, but maybe you can point out 
downside?
----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-37574

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: current: 2.1.0
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by timothyg56 (Timothy Garnett) (Guest)
on 2013-05-04 18:52
(Received via mailing list)
Issue #7845 has been updated by timothyg56 (Timothy Garnett).


A patch for this is pretty straightforward, see 
https://gist.github.com/tgarnett/5032660 which is only a couple of 
lines.

As someone dealing with a lot of web crawling and chinese source data, 
having strip remove non-breaking / ideographic spaces is a real boon 
(particularly given the large amount of code we have originally written 
to 1.9.1).
----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-39126

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: current: 2.1.0
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]
Backport:


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by naruse (Yui NARUSE) (Guest)
on 2013-05-04 19:04
(Received via mailing list)
Issue #7845 has been updated by naruse (Yui NARUSE).

Status changed from Open to Rejected

marcandre (Marc-Andre Lafortune) wrote:
> Let's reopen this issue.
>
> Yui: could you explain why strip wouldn't remove leading and trailing 
/\p{space}/ ? I can only see upside to this, but maybe you can point out downside?

Did you read [ruby-core:19379]?
----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-39127

Author: timothyg56 (Timothy Garnett)
Status: Rejected
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: current: 2.1.0
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]
Backport:


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by marcandre (Marc-Andre Lafortune) (Guest)
on 2013-05-06 07:28
(Received via mailing list)
Issue #7845 has been updated by marcandre (Marc-Andre Lafortune).

Status changed from Rejected to Open
Assignee deleted (naruse (Yui NARUSE))

naruse (Yui NARUSE) wrote:
> Did you read [ruby-core:19379]?

I did.

Out of respect, I will assume that you read [ruby-core:53374] which 
explains nicely that [ruby-core:19379] has absolutely nothing to do with 
what constitutes a space and how `strip` should behave.

It would have been appreciated if instead of repeating your reference 
(which some of us believe not to be relevant) you would explain why you 
still feel it is of any relevance. This would be the helpful and polite 
thing to do.

I would also appreciate if you answered my questions too. For your 
convenience:

> Yui: could you explain why strip wouldn't remove leading and trailing 
/\p{space}/ ? I can only see upside to this, but maybe you can point out downside?

Finally, could you please tell me why you rejected again this issue? 
Maybe if some people disagree with you, including a fellow committer, 
maybe the right thing to do is to explain yourself and ask Matz for his 
decision instead, would you not agree?
----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-39154

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee:
Category: M17N
Target version: current: 2.1.0
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]
Backport:


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by matz (Yukihiro Matsumoto) (Guest)
on 2013-05-06 10:58
(Received via mailing list)
Issue #7845 has been updated by matz (Yukihiro Matsumoto).


Five yeas have passed since the decision in [ruby-core:19379], and 
Unicode had almost taken over the world.
Maybe it's time to relax the limitation at least when we are using 
Unicode.

Matz.

----------------------------------------
Bug #7845: Strip doesn't handle unicode space characters in ruby 1.9.2 & 
1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-39157

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee:
Category: M17N
Target version: current: 2.1.0
ruby -v: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-linux]
Backport:


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by naruse (Yui NARUSE) (Guest)
on 2013-05-06 14:25
(Received via mailing list)
Issue #7845 has been updated by naruse (Yui NARUSE).


Current string-related policy is ASCII-based.
If it is changed, how wide it is applied is the issue; for example
* strip
* split
* upcase/downcase
----------------------------------------
Feature #7845: Strip doesn't handle unicode space characters in ruby 
1.9.2 & 1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-39161

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by matz (Yukihiro Matsumoto) (Guest)
on 2013-05-06 14:53
(Received via mailing list)
Issue #7845 has been updated by matz (Yukihiro Matsumoto).


Everything that can be resolved without locale/language information (for 
most of the cases).
Case conversion may have problems with some characters, e.g. german 
eszett or turkish i, but can be converted mostly.
Of course it would take time to implement everything, but the basic 
principle (and goal) will be.

Matz.

----------------------------------------
Feature #7845: Strip doesn't handle unicode space characters in ruby 
1.9.2 & 1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-39163

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by Tanaka Akira (Guest)
on 2013-05-06 15:46
(Received via mailing list)
2013/5/6 matz (Yukihiro Matsumoto) <matz@ruby-lang.org>:
>
> Everything that can be resolved without locale/language information (for most of 
the cases).
> Case conversion may have problems with some characters, e.g. german eszett or 
turkish i, but can be converted mostly.
> Of course it would take time to implement everything, but the basic principle 
(and goal) will be.

Be careful about network libraries.

Text based network protocol itself is not Unicode based in general.
If a primitive used in a such library is extended to Unicode, it may
cause mismatch to the protocol.

For example, the result of "grep -r strip lib/net" should be examined to
strip is used properly or not.
Posted by matz (Yukihiro Matsumoto) (Guest)
on 2013-05-06 18:13
(Received via mailing list)
Issue #7845 has been updated by matz (Yukihiro Matsumoto).


Akira, Thank you for pointing out.

But it's hard for me to imagine concrete problematic cases.
When text from network connection is marked as Unicode, that's OK to 
process them as Unicode text,
otherwise they should be marked as 'ASCII-8BIT' so that #strip and other 
methods should behave as
they are now.

Matz.

----------------------------------------
Feature #7845: Strip doesn't handle unicode space characters in ruby 
1.9.2 & 1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-39168

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by naruse (Yui NARUSE) (Guest)
on 2013-05-06 18:59
(Received via mailing list)
Issue #7845 has been updated by naruse (Yui NARUSE).


matz (Yukihiro Matsumoto) wrote:
> Akira, Thank you for pointing out.
>
> But it's hard for me to imagine concrete problematic cases.
> When text from network connection is marked as Unicode, that's OK to process 
them as Unicode text,
> otherwise they should be marked as 'ASCII-8BIT' so that #strip and other methods 
should behave as
> they are now.
>
> Matz.

Modern protocol like SMTPUTF8 <http://tools.ietf.org/html/rfc6532> and 
URL Standard <http://url.spec.whatwg.org/>
use UTF-8 as its character encoding, but they use ASCII whitespace.
----------------------------------------
Feature #7845: Strip doesn't handle unicode space characters in ruby 
1.9.2 & 1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-39169

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by matz (Yukihiro Matsumoto) (Guest)
on 2013-05-07 03:03
(Received via mailing list)
Issue #7845 has been updated by matz (Yukihiro Matsumoto).


=begin
Thank you for valuable input.

That indicates the need for something like (({str.strip(:ascii)})), or 
opposite (({str.strip(:utf8)}))

Matz.
=end
----------------------------------------
Feature #7845: Strip doesn't handle unicode space characters in ruby 
1.9.2 & 1.9.3 (does in 1.9.1)
https://bugs.ruby-lang.org/issues/7845#change-39172

Author: timothyg56 (Timothy Garnett)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:


Strip and associated methods in ruby 1.9.2 and 1.9.3 do not remove 
leading/trailing unicode space characters (such as non-breaking space 
\u00A0 and ideographic space \u3000) unlike ruby 1.9.1.  I'd expect the 
1.9.1 behavior.  Looking at the underlying native lstrip! and rstrip! 
methods it looks like this is because 1.9.1 uses rb_enc_isspace() 
whereas 1.9.2+ uses rb_isspace().

1.9.1p378 :001 > "\u3000\u00a0".strip
 => ""

1.9.2p320 :001 > "\u3000\u00a0".strip
 => "  "

1.9.3p286 :001 > "\u3000\u00a0".strip
 => "  "
Posted by Tanaka Akira (Guest)
on 2013-05-07 08:08
(Received via mailing list)
2013/5/7 matz (Yukihiro Matsumoto) <matz@ruby-lang.org>:
>
> But it's hard for me to imagine concrete problematic cases.
> When text from network connection is marked as Unicode, that's OK to process 
them as Unicode text,
> otherwise they should be marked as 'ASCII-8BIT' so that #strip and other methods 
should behave as
> they are now.

I see.  It seems less harmful than I expected.

However the encoding of a string can be easily changed if it contains
ASCII character only.

% ruby -e 'p(("a".force_encoding("UTF-8") +
"b".force_encoding("ASCII-8BIT")).encoding)'
#<Encoding:UTF-8>

So I think your sentence is bit weak to preserve current network 
libraries.
Appropriate restriction is "bahavior of methods for ASCII only strings
shold behave as the are now regardless of its encoding (ASCII-8BIT, 
UTF-8,
etc)".

(I assume the ASCII based network protocols, not Unicode based such as
naruse-san pointed.)

The famous example of such dangerous Unicode behavior is turkish case
conversion but it is locale dependent and you already said locale 
dependent
behavior is not target of this change.

I'm not sure that there are no such (locale-independent but affects 
ASCII
only string) Unicode behaviors, though.
Any experts here?
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.