Forum: Ruby-core [ruby-trunk - Bug #7154][Open] For whatever reason \s doesn't match \u00a0.

Posted by t0d0r (Todor Dragnev) (Guest)
on 2012-10-14 01:38
(Received via mailing list)
Issue #7154 has been reported by t0d0r (Todor Dragnev).

----------------------------------------
Bug #7154: For whatever reason \s doesn't match \u00a0.
https://bugs.ruby-lang.org/issues/7154

Author: t0d0r (Todor Dragnev)
Status: Open
Priority: Normal
Assignee:
Category: core
Target version:
ruby -v: 1.9.3p286


The problem is already explained here:

http://stackoverflow.com/questions/2588942/convert...

I just hit it today.
Posted by "Martin J. Dürst" <duerst@it.aoyama.ac.jp> (Guest)
on 2012-10-15 01:52
(Received via mailing list)
My understanding is that in Ruby, all the pre-Unicode escapes, and in
particular "\s", still refer only to characters in the ASCII range.

My understanding is that this was done in this way for backwards
compatibility, and on purpose. This can be explained as follows: Maybe
somebody wrote a script doing some processing where they wanted to match
ASCII 'space' characters. They used \s. If Ruby would change \s to
suddenly match way more than before, the meaning of that program would
change. Maybe it would change just in the right way. But maybe it would
change in an unintended way.

So the decision was to not second-guess the programmer. As a result,
this does not behave the same way as what's suggested in Unicode TR #18.
But please note that UTR #18 doesn't *require* \s to be treated as
Unicode whitespace, it just *recommends* to do so (see
http://www.unicode.org/reports/tr18/#Compatibility...).

If you want to match against Unicode whitespace, what you should do is
the following:

"\u00a0" =~ /\p{Whitespace}/u

Regards,    Martin.
Posted by "duerst (Martin Dürst)" <duerst@it.aoyama.ac.jp> (Guest)
on 2012-10-15 02:00
(Received via mailing list)
Issue #7154 has been updated by duerst (Martin Dürst).

Status changed from Open to Closed

My understanding is that this is a feature. See above for explanation. I 
hope somebody can provide the feedback to 
http://stackoverflow.com/questions/2588942/convert....
----------------------------------------
Bug #7154: For whatever reason \s doesn't match \u00a0.
https://bugs.ruby-lang.org/issues/7154#change-30686

Author: t0d0r (Todor Dragnev)
Status: Closed
Priority: Normal
Assignee:
Category: core
Target version:
ruby -v: 1.9.3p286


The problem is already explained here:

http://stackoverflow.com/questions/2588942/convert...

I just hit it today.
Posted by "Martin J. Dürst" <duerst@it.aoyama.ac.jp> (Guest)
on 2012-10-15 02:04
(Received via mailing list)
Just forgot to mention that the pickaxe book, for "\s", says "For
Unicode, add Line_Separator codepoints.".

This is wrong because even LINE SEPARATOR itself, \u2028, doesn't match
\s. It would also be wrong in that the result would be to match ASCII
whitespace and Unicode line separators, whereas other Unicode whitespace
would be ignored.

Regards,   Martin.
Posted by t0d0r (Todor Dragnev) (Guest)
on 2012-10-16 11:41
(Received via mailing list)
Issue #7154 has been updated by t0d0r (Todor Dragnev).


duerst (Martin Dürst) wrote:

> My understanding is that this is a feature. See previous post for explanation. I 
hope somebody can provide the feedback to 
http://stackoverflow.com/questions/2588942/convert....

My understanding is that:

* We are surrounded by Unicode text, most of the Internet pages and 
documents are UTF8. If the language don't adapt of the surrounding 
environment it will be replaced by new one, which provides better tools 
for the real situation. Not all people of the world use english alphabet 
as a primary language...

* We all are humans, reading "white space" for me means white space in 
the text in that case with \u00a0 I  opened hex editor to see whats 
wrong, I like the simplicity of Ruby and to code less. All good and 
popular programming languages are oriented to be in help for humans, 
complexity kill the popularity - did I know someone near you to write 
Assembler these days?

* "String".downcase produce "string",  "Стринг".downcase must produce 
"стринг", but it's not. Ok thats correct for 1.8.x - we don't have 
multibyte support. But why in 1.9.x I need to use specific libraries to 
receive a proper results. UnicodeUtils.downcase("Стринг") works fine... 
Thanks Stefan Lang. Maybe Ruby wants to become next PHP with 10 methods 
doing one think? http://www.tnx.nl/php.html. For me(and maybe others) 
downcase/upcase/\s and similar methods in 1.9.x are useless... Why we 
have multibyte support without multi language awareness? This is odd 
from me as a human...

* Firefox has a lots of features and now is going to die, because they 
did't complain with users warnings about memory management... :)




----------------------------------------
Bug #7154: For whatever reason \s doesn't match \u00a0.
https://bugs.ruby-lang.org/issues/7154#change-30840

Author: t0d0r (Todor Dragnev)
Status: Closed
Priority: Normal
Assignee:
Category: core
Target version:
ruby -v: 1.9.3p286


The problem is already explained here:

http://stackoverflow.com/questions/2588942/convert...

I just hit it today.
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.