This is mostly a Ruby thing, and partly a Rails thing. I'm expecting a validate_format_of with a regex like this /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/ to allow many of the normal characters like ö é å to be submitted via web form. However, the extended characters are being rejected. This works just fine though (which is just a-zA-Z) /^[\x41-\x5A\x61-\x7A\.\'\-\ ]*?$/ It also seems to fail with full \x0000 numbers, is there limit at \xFF? Some plain Ruby tests seem to suggest unicode characters don't work at all?? p 'abvHgtwHFuG'.scan(/[a-z]/) p 'abvHgtwHFuG'.scan(/[A-Z]/) p 'abvHgtwHFuG'.scan(/[\x41-\x5A]/) p 'abvHgtwHFuG'.scan(/[\x61-\x7A]/) p 'aébvHögtåwHÅFuG'.scan(/[\xC0-\xD6\xD9-\xF6\xF9-\xFF]/) ["a", "b", "v", "g", "t", "w", "u"] ["H", "H", "F", "G"] ["H", "H", "F", "G"] ["a", "b", "v", "g", "t", "w", "u"] ["\303", "\303", "\303", "\303"] So, what's the secret to using unicode character ranges in Ruby regex (or Rails validations)? -- def gw acts_as_n00b writes_at(www.railsdev.ws) end
on 30.11.2007 21:18
on 30.11.2007 22:10
On Nov 30, 2:18 pm, Greg Willits <li...@gregwillits.ws> wrote: > So, what's the secret to using unicode character ranges in Ruby regex > (or Rails validations)? Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006 Ruby Conference. His presentation can be found at: http://www.tbray.org/talks/rubyconf2006.pdf He described how many member functions have trouble dealing with these character sets. He made special reference to regular expressions. --Dale
on 30.11.2007 23:01
Dale Martenson wrote: > On Nov 30, 2:18 pm, Greg Willits <li...@gregwillits.ws> wrote: > >> So, what's the secret to using unicode character ranges in Ruby regex >> (or Rails validations)? > > Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006 > Ruby Conference. His presentation can be found at: > > http://www.tbray.org/talks/rubyconf2006.pdf > > He described how many member functions have trouble dealing with these > character sets. He made special reference to regular expressions. That's just beyond sad. I've been using Lasso for several years now, and *2003* it provided complete support for Unicode. I know there's some esoterics it may not deal with, but for all practical purposes we can round-trip data in western and eastern languages with Lasso quite easily. How can all these other languages be so far behind? Pretty bad if I can't even allow Mr. Muños or Göran to enter their names in a web form with proper server side validations. Aargh. -- gw
on 01.12.2007 06:25
On Nov 30, 4:00 pm, Greg Willits <li...@gregwillits.ws> wrote: > > How can all these other languages be so far behind? > > Pretty bad if I can't even allow Mr. Muos or Gran to enter their names > in a web form with proper server side validations. Aargh. > > -- gw > -- > Posted viahttp://www.ruby-forum.com/. Ruby 1.8 doesn't have unicode support (1.9 is starting to get it). Everything in ruby is a bytestring. irb(main):001:0> 'abvHgtwHFuG'.scan(/./) => ["a", "\303", "\251", "b", "v", "H", "\303", "\266", "g", "t", "\303", "\245", "w", "H", "\303", "\205", "F", "u", "G"] So your character class is matching the first byte of the composite characters (which is \303 in octal), and skipping the next (since it's below the range). You probably want something like... reg = /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/ 'abvHgtwHFuG'.scan(reg) irb(main):006:0* reg = /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/ => /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/ irb(main):007:0> 'abvHgtwHFuG'.scan(reg) => ["\303\251", "\303\266", "\303\245", "\303\205"] irb(main):008:0> "" == "\303\245" => true Ps. I'm not entirely sure the value of the second character class is right. Regards, Jordan
on 01.12.2007 11:16
> Unicode in Regex > Posted by Greg Willits (-gw-) on 30.11.2007 21:18 > This is mostly a Ruby thing, and partly a Rails thing. > > I'm expecting a validate_format_of with a regex like this > > /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/ > > to allow many of the normal characters like ö é å to be submitted via > web form. How about the utf8 validation regex here: http://snippets.dzone.com/posts/show/4527 ?
on 02.12.2007 21:35
Greg Willits wrote: > I'm expecting a validate_format_of with a regex like this > /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/ > to allow many of the normal characters like ö é å to be submitted via > web form. However, the extended characters are being rejected. So, I've been pounding the web for info on UTF8 in Ruby and Rails the past couple days to concoct some validations that allow UTF8 characters. I have discovered that I can get a little further by doing the following: - declaring $KCODE = 'UTF8' - adding /u to regex expressions. The only thing not working now is the ability to define a range of \x characters in a regex. So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed to have an ä in it. Perfect. But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u I've boiled the experiments down to realizing I can't define a range with \x Is this just one of those things that just doesn't work yet WRT Ruby/ Rails/UTF8, or is there another syntax? I've scoured all the regex docs I can find, and they seem to indicate a range should work. For now, I just have all the characters I want included < \xFF listed individually. utf_accents = '\xC0\xC1\xC2\.......' Is_person_name = /^[a-zA-Z#{utf_accents}\.\'\-\ ]*?$/u But I'd like to solve the range notation if I can. -- def gw acts_as_n00b writes_at(www.railsdev.ws) end
on 03.12.2007 02:19
MonkeeSage wrote: > Ruby 1.8 doesn't have unicode support (1.9 is starting to get it). I enrages me to see this kind of FUD. Through regular expressions, ruby 1.8 has 80-90% complete utf8 support. And oniguruma makes utf8 support well-near 100% complete. >> 'aébvHögtåwHÅFuG'.scan(/./u) => ["a", "é", "b", "v", "H", "ö", "g", "t", "å", "w", "H", "Å", "F", "u", "G"] >> 'aébvHögtåwHÅFuG'.scan(/[éöåÅ]/u) => ["é", "ö", "å", "Å"] Ok, sometimes you have to take a weird approach because of the missing 10-20%, but it's still workable >> 'aébvHögtåwHÅFuG'.scan(/(?:\303\251|\303\266|\303\245|\303\205)/u) => ["é", "ö", "å", "Å"] > Everything in ruby is a bytestring. YES! And that's exactyly how it should be. Who is it that spread the flawed idea that strings are fundamentally made of characters? I'd like to slap him around a little. Fundamentally, ever since the word "string" was applied to computing, strings were made of 8-BIT CHARS, not n-bit characters. If only the creators of C has called that datatype "byte" instead of "char" it would have saved us so many misunderstandings. Usually the complaint about the support lack of unicode support is that something like "日本語".length returns 9 instead of 3, or that "日本語 ".index("語") returns 6 instead of 2. It's nice that people want to completely redefine the API to return character positions and all that, but please don't complain that it's broken just because you happen to be using it incorrectly. Use the right tool for the job. SQL for database queries, non-home-brewed crypto libraries for security, regular expressions for string manipulation. I'm terribly sorry for the rant but I had to get it off my chest. Dan
on 03.12.2007 02:41
Greg Willits wrote: > characters. I have discovered that I can get a little further by doing > > But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u > > But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u > > I've boiled the experiments down to realizing I can't define a range > with \x > > Is this just one of those things that just doesn't work yet WRT Ruby/ > Rails/UTF8, or is there another syntax? I've scoured all the regex > docs I can find, and they seem to indicate a range should work. Let me try to explain that in order to redeem myself from my previous angry post. Basically, \xE4 is counted as the byte value 0xE4, not the unicode character U+00E4. And in a range expression, each escaped value is taken as one character within the range. Which results in not-immediately obvious situations: >> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u) => [] >> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u) => ["é"] What is happening in the first case is that the string does not contain characters \303 or \251 because those are invalid utf8 sequences. But when the value "\303\251" is *inlined* into the regex, that is recognized as the utf8 character "é" and a match is found. So ranges *do* work in utf8 but you have to be careful: >> "àâäçèéêîïôü".scan(/[ä-î]/u) => ["ä", "ç", "è", "é", "ê", "î"] >> "àâäçèéêîïôü".scan(/[\303\244-\303\256]/u) => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250", "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303", "\264", "\303", "\274"] >> "àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u) => ["ä", "ç", "è", "é", "ê", "î"] Hope this helps. Dan
on 03.12.2007 02:50
On Dec 2, 2:35 pm, Greg Willits <li...@gregwillits.ws> wrote: > following: > > individually. > writes_at(www.railsdev.ws) > end > -- > Posted viahttp://www.ruby-forum.com/. This seems to work... $KCODE = "UTF8" p /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?/u =~ "Jsp...it works" # => 0 However, it looks to me like it would be more robust to use a slightly modified version of UTF8REGEX (found in the link Jimmy posted above)... UTF8REGEX = /\A(?: [a-zA-Z\.\-\'\ ] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*\z/mnx p UTF8REGEX =~ "Jsp...it works here too" # => 0 Look at the link to see the explanation of the alternations. Regards, Jordan
on 03.12.2007 02:56
Greg Willits wrote: > So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed > to have an ä in it. Perfect. If that actually works, it means you are really using ISO-8859-1 strings, not UTF-8. > utf_accents = '\xC0\xC1\xC2\.......' Nope, that's not UTF-8. UTF-8 characters ÀÁÂ would look like utf_accents = "\xC3\x80\xC3\x81\xC3\x82..." Dan
on 03.12.2007 20:47
Daniel DeLorme wrote: > Greg Willits wrote: >> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u >> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u >> >> I've boiled the experiments down to realizing I can't define a range >> with \x > Let me try to explain that in order to redeem myself from my previous > angry post. :-) > Basically, \xE4 is counted as the byte value 0xE4, not the unicode > character U+00E4. And in a range expression, each escaped value is taken > as one character within the range. Which results in not-immediately > obvious situations: > > >> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u) > => [] > >> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u) > => ["é"] OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a character code point -- which with your explanation I can finally tie together what that means. Took me a second to recognize the #{} as Ruby and not some new regex I'd never seen :-P And I realize now too I wasn't picking up on the use of octal vs decimal. Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant? > What is happening in the first case is that the string does not contain > characters \303 or \251 because those are invalid utf8 sequences. But > when the value "\303\251" is *inlined* into the regex, that is > recognized as the utf8 character "é" and a match is found. > > So ranges *do* work in utf8 but you have to be careful: > > >> "àâäçèéêîïôü".scan(/[ä-î]/u) > => ["ä", "ç", "è", "é", "ê", "î"] > >> "àâäçèéêîïôü".scan(/[\303\244-\303\256]/u) > => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250", > "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303", > "\264", "\303", "\274"] > >> "àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u) > => ["ä", "ç", "è", "é", "ê", "î"] > > Hope this helps. Yes! -- gw
on 03.12.2007 22:07
>> Basically, \xE4 is counted as the byte value 0xE4, not the unicode >> character U+00E4. And in a range expression, each escaped value is taken >> as one character within the range. Which results in not-immediately >> obvious situations: >> >> >> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u) >> => [] >> >> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u) >> => ["é"] OK, one thing I'm still confused about -- when I look up é in any table, it's DEC is 233 which converted to OCT is 351, yet you're using 251 (and indeed it seems like reducing the OCTs I come up with by 100 is what actually works). Where is this 100 difference coming from? -- gw
on 03.12.2007 22:20
On Dec 3, 2:07 pm, Greg Willits <li...@gregwillits.ws> wrote: > Where is this 100 difference coming from? http://www.fileformat.info/info/unicode/char/00e9/index.htm The UTF-16 value is 233 (decimal), but the UTF-8 value is 0xC3 0xA9, which is 195 169 in decimal, or 0303 0251 in octal.
on 03.12.2007 23:21
On Dec 3, 12:47 pm, Greg Willits <li...@gregwillits.ws> wrote:
on 04.12.2007 01:32
Daniel DeLorme wrote: > Usually the complaint about the support lack of unicode support is that > something like "日本語".length returns 9 instead of 3, or that "日本語 > ".index("語") returns 6 instead of 2. It's nice that people want to > completely redefine the API to return character positions and all that, > but please don't complain that it's broken just because you happen to be > using it incorrectly. Use the right tool for the job. SQL for database > queries, non-home-brewed crypto libraries for security, regular > expressions for string manipulation. > > I'm terribly sorry for the rant but I had to get it off my chest. Regular expressions for all character work would be a *terribly* slow way to get things done. If you want to get the nth character, should you do a match for n-1 characters and a group to grab the nth? Or would it be better if you could just index into the string and have it do the right thing? How about if you want to iterate over all characters in a string? Should the iterating code have to know about the encoding? Should you use a regex to peel off one character at a time? Absurd. Regex for string access goes a long way, but's just about the heaviest way to do it. Strings should be aware of their encoding and should be able to provide you access to characters as easily as bytes. That's what 1.9 (and upcoming changes in JRuby) fixes. - Charlie
on 04.12.2007 01:45
On Dec 2, 7:40 pm, Daniel DeLorme <dan...@dan42.com> wrote: > > characters. I have discovered that I can get a little further by doing > > > >> 'abvHgtwHFuG'.scan(/[#{"\303\251"}]/u) > => ["", "", "", "", "", ""] > >> "".scan(/[\303\244-\303\256]/u) > => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250", > "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303", > "\264", "\303", "\274"] > >> "".scan(/[#{"\303\244-\303\256"}]/u) > => ["", "", "", "", "", ""] > > Hope this helps. > > Dan I missed your ranting. Firstly, ruby doesn't have unicode support in 1.8, since unicode *IS* a standard mapping of bytes to *characters*. That's what unicode is. I'm sorry you don't like that, but don't lie and say ruby 1.8 supports unicode when it knows nothing about that standard mapping and treats everything as individual bytes (and any byte with a value greater than 126 just prints an octal escape); and please don't accuse others of spreading FUD when they state the facts. Secondly, as I said in my first post to this thread, the characters trying to be matched are composite characters, which requires you to match both bytes. You can try to using a unicode regexp, but then you run into the problem you mention--the regexp engine expects the pre- composed, one-byte form... "".scan(/[\303\262]/u) # => [] "".scan(/[\xf2]/u) # => ["\303\262"] ...which is why I said it's more robust to use something like the the regexp that Jimmy linked to and I reposted, instead of a unicode regexp. Regards, Jordan
on 04.12.2007 01:50
On Dec 3, 1:47 pm, Greg Willits <li...@gregwillits.ws> wrote: > :-) > > Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant? Oniguruma is not in ruby 1.8 (though you can install it as a gem). It is in 1.9.
on 04.12.2007 05:56
Jordan Callicoat wrote: > On Dec 3, 1:47 pm, Greg Willits <li...@gregwillits.ws> wrote: >> :-) >> >> Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant? > Oniguruma is not in ruby 1.8 (though you can install it as a gem). It > is in 1.9. Oh. I always thought Oniguruma was the engine in Ruby. Anyway -- everyone, thanks for all the input. I believe I'm headed in the right direction now, and have a better hands on understanding of UTF-8. -- gw
on 04.12.2007 09:48
Charles Oliver Nutter wrote: > Regular expressions for all character work would be a *terribly* slow > way to get things done. If you want to get the nth character, should you > do a match for n-1 characters and a group to grab the nth? Or would it > be better if you could just index into the string and have it do the Ok, I'm not very familiar with the internal working of strings in 1.9, but it seems to me that for character sets with variable byte size, it is logically *impossible* to directly index into the string. Unless there's some trick I'm unaware of, you *have* to count from the beginning of the string for utf8 strings. > right thing? How about if you want to iterate over all characters in a > string? Should the iterating code have to know about the encoding? > Should you use a regex to peel off one character at a time? That is certainly one possible way of doing things... string.scan(/./){ |char| do_someting_with(char) } > Regex for string access goes a long way, but's just about the heaviest > way to do it. Heavy compared to what? Once compiled, regex are orders of magnitude faster than jumping in and out of ruby interpreted code. > Strings should be aware of their encoding and should be > able to provide you access to characters as easily as bytes. That's what > 1.9 (and upcoming changes in JRuby) fixes. Overall I agree that the encoding stuff in 1.9 is very nice. Encapsulating the encoding with the string is very OO. Very intuitive. No need to think about encoding anymore, now it "just works" for encoding-ignorant programmers (at least until the abstraction leaks). It allows to shut up one frequent complaint about ruby; a clear political victory. Overall it is more robust and less error-prone than the 1.8 way. But my point was that there *is* a 1.8 way. The thing that riled me up and that I was responding to was the claim that 1.8 did not have unicode support AT ALL. Unequivocally, it does, and it works pretty well for me. IMHO there is a certain minimalist elegance in considering strings as encoding-agnostic and using regex to get encoding-specific views. I could do str[/./n] and str[/./u]; I can't do that anymore. 1.9 makes encodings easier for the english-speaking masses not used to extended characters, but let's remember that ruby *always* had support for multibyte character sets; after all it *did* originate from a country with two gazillion "characters". Daniel
on 04.12.2007 10:08
MonkeeSage wrote: > Firstly, ruby doesn't have unicode support in 1.8, since unicode *IS* > a standard mapping of bytes to *characters*. That's what unicode is. > I'm sorry you don't like that, but don't lie and say ruby 1.8 supports > unicode when it knows nothing about that standard mapping and treats > everything as individual bytes (and any byte with a value greater than > 126 just prints an octal escape) Ok, then how do you explain this: >> $KCODE='u' => "u" >> "abc\303\244".scan(/./) => ["a", "b", "c", ""] This doesn't require any libraries, and it seems to my eyes that ruby is converting 5 bytes into 4 characters. It shows an awareness of utf8. If that's not *some* kind of unicode support then please tell me what it is. It seem were disagreeing on some basic definition of what "unicode support" means. > Secondly, as I said in my first post to this thread, the characters > trying to be matched are composite characters, which requires you to > match both bytes. You can try to using a unicode regexp, but then you > run into the problem you mention--the regexp engine expects the pre- > composed, one-byte form... > > "".scan(/[\303\262]/u) # => [] > "".scan(/[\xf2]/u) # => ["\303\262"] Wow, I never knew that second one could work. Unicode support is actually better than I thought! You learn something new every day. > ...which is why I said it's more robust to use something like the the > regexp that Jimmy linked to and I reposted, instead of a unicode > regexp. I'm not sure what makes that huge regexp more robust than a simple unicode regexp. Daniel
on 04.12.2007 13:32
On Dec 4, 3:07 am, Daniel DeLorme <dan...@dan42.com> wrote: > support" means. I guess we were talking about different things then. I never meant to imply that the regexp engine can't match unicode characters (it's "dumb" implementation though; it basically only knows that bytes above 127 can have more bytes following and should be grouped together as candidates for a match; that's slightly simplified, but basically accurate). I, like Charles (and I think most people), was referring to the ability to index into strings by characters, find their lengths in characters, to compose and decompose composite characters, to normalize characters, convert them to other encodings like shift-jis, and other such things. Ruby 1.9 has started adding such support, while ruby 1.8 lacks it. It can be hacked together with regular expressions (e.g., the link Jimmy posted), or even as a real, compiled extension [1], but merely saying that *you* the programmer can implement it using ruby 1.8, is not the same thing as saying ruby 1.8 supports it (just like I could build a python VM in ruby, but that doesn't mean that the ruby interpreter runs python bytecode). Anyhow, I guess it's just a difference of opinion. I don't mind being wrong (happens a lot! ;) I just don't like being accused of spreading FUD about ruby, which to my mind implies malice of forethought rather that simply mistake. [1] http://rubyforge.org/projects/char-encodings/ http://git.bitwi.se/ruby-character-encodings.git/ > actually better than I thought! You learn something new every day. > > > ...which is why I said it's more robust to use something like the the > > regexp that Jimmy linked to and I reposted, instead of a unicode > > regexp. > > I'm not sure what makes that huge regexp more robust than a simple > unicode regexp. > > Daniel Well, I won't claim that you can't get a unicode regexp to match the same. And I only saw that large regexp when it was posted here, so I've not tested it to any great length. Interestingly, 1.9 uses this regexp (originally from jcode.rb in stdlib) to classify a string as containing utf-8: '[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf][\x80- \xbf]'. My thought was that without knowing all of the minute intricacies of unicode and how ruby strings and regexps work with unicode values (which I don't, and assume the OP doesn't), I think the huge regexp is more likely to Just Work in more cases than a home- brewed unicode regexp. But like I said, that's just an initial conclusion, I don't claim it's absolutely correct. Regards, Jordan
on 05.12.2007 02:59
MonkeeSage wrote: > I guess we were talking about different things then. I never meant to > imply that the regexp engine can't match unicode characters Since regular expressions are embedded in the very syntax of ruby just as arrays and hashes, IMHO that qualifies as unicode support. So yeah, it seems like we have a semantic disagreement. :-( > I, like Charles (and I think most people), was referring to the > ability to index into strings by characters, find their lengths in > characters That is certainly *one* way of supporting unicodde but by no means the only way. My belief is that you can do most string manipulations in a way that obviates the need for char indexing & char length, if only you change your mindset from "operating on individual characters" to "operating on the string as a whole". And since regex are a specialized language for string manipulation, they're also a lot faster. It's a little like imperative vs functional programming; if I told you about a programming language that has no variable assignments you might think it's completely broken, and yet that's how functional languages work. > to compose and decompose composite characters, to > normalize characters, convert them to other encodings like shift-jis, > and other such things. Converting encodings is a worthy goal but unrelated to unicode support. As for character [de]composition that would be a very nice thing to have if it was handled automatically (e.g. "a\314\200"=="\303\240") but if the programmer has to worry about it then you might as well leave it to a specialized library. Well, it's not like ruby lets us abstract away composite characters either in 1.8 or 1.9... I never claimed unicode support was 100%, just good enough for most needs. > just a difference of opinion. I don't mind being wrong (happens a > lot! ;) I just don't like being accused of spreading FUD about ruby, > which to my mind implies malice of forethought rather that simply > mistake. Yes, that was too harsh on my part. My apologies. Daniel
on 05.12.2007 03:04
Daniel DeLorme wrote: > Heavy compared to what? Once compiled, regex are orders of magnitude > faster than jumping in and out of ruby interpreted code. Sorry to beat a dead horse, but I just did an interesting little experiment with 1.9: >> str = "abcde"*1000 >> str.encoding => <Encoding:ASCII-8BIT> >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real => 0.010282039642334 >> str.force_encoding 'utf-8' >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real => 1.29934501647949 >> arr = str.scan(/./u) >> Benchmark.measure{10.times{ 1000.times{|i|arr[i]} }}.real => 0.00343608856201172 indexing into UTF-8 strings is *expensive* Daniel
on 05.12.2007 08:48
Daniel DeLorme wrote: > >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real > => 0.010282039642334 > >> str.force_encoding 'utf-8' > >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real > => 1.29934501647949 > >> arr = str.scan(/./u) > >> Benchmark.measure{10.times{ 1000.times{|i|arr[i]} }}.real > => 0.00343608856201172 > > indexing into UTF-8 strings is *expensive* ...but correct. I'd rather have correct than broken. - Charlie
on 05.12.2007 12:06
On Dec 4, 7:58 pm, Daniel DeLorme <dan...@dan42.com> wrote: > > characters > > That is certainly *one* way of supporting unicodde but by no means the > only way. My belief is that you can do most string manipulations in a > way that obviates the need for char indexing & char length, if only you > change your mindset from "operating on individual characters" to > "operating on the string as a whole". And since regex are a specialized > language for string manipulation, they're also a lot faster. It's a > little like imperative vs functional programming; if I told you about a > programming language that has no variable assignments you might think > it's completely broken, and yet that's how functional languages work. I think we'll just have to agree to disagree. But there is one point... main = do let i_like = "I like " putStrLn $ i_like ++ haskell where haskell = "a functional language" ;) > support was 100%, just good enough for most needs. > > > just a difference of opinion. I don't mind being wrong (happens a > > lot! ;) I just don't like being accused of spreading FUD about ruby, > > which to my mind implies malice of forethought rather that simply > > mistake. > > Yes, that was too harsh on my part. My apologies. No worries. :) I apologize as well for responding by saying you were lying about unicode support; I see that we just have a difference of opinion and were talking past each another. > Daniel Regards, Jordan
on 05.12.2007 22:35
Daniel DeLorme said... > YES! And that's exactyly how it should be. Who is it that spread the > flawed idea that strings are fundamentally made of characters? Are you being ironic? > I'd like > to slap him around a little. Fundamentally, ever since the word "string" > was applied to computing, strings were made of 8-BIT CHARS, not n-bit > characters. If only the creators of C has called that datatype "byte" > instead of "char" it would have saved us so many misunderstandings. And look at the trouble we're having ditching the waterfall method, all because someone misread a paper in the 1700s or thereabouts. You might want to spar with Tim Bray from Sun who presented at RubyConf 2006, where his slides state: "99.99999% of the time, programmers want to deal with characters not bytes. I know of one exception: running a state machine on UTF8-encoded text. This is done by the Expat XML parser." "In 2006, programmers around the world expect that, in modern languages, strings are Unicode and string APIs provide Unicode semantics correctly & efficiently, by default. Otherwise, they perceive this as an offense against their language and their culture. Humanities/computing academics often need to work outside Unicode. Few others do." He reviews his chat here: http://www.tbray.org/ongoing/When/200x/2006/10/22/Unicode-and-Ruby and the slides are here: http://www.tbray.org/talks/rubyconf2006.pdf
on 06.12.2007 01:19
marc wrote: > Daniel DeLorme said... >> MonkeeSage wrote: >>> Everything in ruby is a bytestring. >> YES! And that's exactyly how it should be. Who is it that spread the >> flawed idea that strings are fundamentally made of characters? > > Are you being ironic? Not at all. By "fundamentally" I mean the fundamental, lowest level of representation. If strings were fundamentally made of characters then we wouldn't be able to access individual bytes because that's a lower level than the fundamental level, which is by definition impossible. If you are using UCS2 it makes sense to consider strings as arrays of characters because that's what they are. But UTF8 strings do not follow the characteristics of arrays at all. Each access into the "array" is O(n) rather than O(1). So IMHO treating it as an array of characters is a *very* leaky abstraction. I agree that 99.9% of the time you want to deal with characters, and I believe that in 99% of those cases you would be better served with regex than this pretend "array" disguise. Daniel
on 06.12.2007 02:25
On Dec 5, 6:15 pm, Daniel DeLorme <dan...@dan42.com> wrote: > representation. If strings were fundamentally made of characters then we > believe that in 99% of those cases you would be better served with regex > than this pretend "array" disguise. > > Daniel Here is a micro-benchmark on three common string operations (split, index, length), using bytestrings and unicode regexp, verses native utf-8 strings in 1.9.0 (release). $ ruby19 -v ruby 1.9.0 (2007-10-15 patchlevel 0) [i686-linux] $ echo && cat bench.rb #!/usr/bin/ruby19 # -*- coding: ascii -*- require "benchmark" require "test/unit/assertions" include Test::Unit::Assertions $KCODE = "u" $target = "!$BF|K\8l(B!" * 100 $unichr = "$BK\(B".force_encoding('utf-8') $regchr = /[$BK\(B]/u def uni_split $target.split($unichr) end def reg_split $target.split($regchr) end def uni_index $target.index($unichr) end def reg_index $target =~ $regchr end def uni_chars $target.length end def reg_chars $target.unpack("U*").length # this is *alot* slower # $target.scan(/./u).length end $target.force_encoding("ascii") a = reg_split $target.force_encoding("utf-8") b = uni_split assert_equal(a.length, b.length) $target.force_encoding("ascii") a = reg_index $target.force_encoding("utf-8") b = uni_index assert_equal(a-2, b) $target.force_encoding("ascii") a = reg_chars $target.force_encoding("utf-8") b = uni_chars assert_equal(a, b) n = 10_000 Benchmark.bm(12) { | x | $target.force_encoding("ascii") x.report("reg_split") { n.times { reg_split } } $target.force_encoding("utf-8") x.report("uni_split") { n.times { uni_split } } puts $target.force_encoding("ascii") x.report("reg_index") { n.times { reg_index } } $target.force_encoding("utf-8") x.report("uni_index") { n.times { uni_index } } puts $target.force_encoding("ascii") x.report("reg_chars") { n.times { reg_chars } } $target.force_encoding("utf-8") x.report("uni_chars") { n.times { uni_chars } } } ==== With caches initialized, an 5 prior runs, I got these numbers: $ ruby19 bench.rb user system total real reg_split 2.550000 0.010000 2.560000 ( 2.799292) uni_split 1.820000 0.020000 1.840000 ( 2.026265) reg_index 0.040000 0.000000 0.040000 ( 0.097672) uni_index 0.150000 0.000000 0.150000 ( 0.202700) reg_chars 0.790000 0.010000 0.800000 ( 0.919995) uni_chars 0.130000 0.000000 0.130000 ( 0.193307) ==== So String#=~ with a bytestring and unicode regexp is faster than String#index by a fator or ~0.5. In the other two cases, the opposite is true. Ps. BTW, in case there is any confusion, bytestrings aren't going away; you can, as you see above, specify a magic encoding comment to ensure that you have bytestrings by default. You can also explicitly decode from utf-8 back to ascii. and you can get a byte enumerator (or array from calling to_a on the enumerator) from String#bytes, and an iterator from #each_byte, irregardless of the encoding. Regards, Jordan
on 06.12.2007 03:32
MonkeeSage wrote: > Here is a micro-benchmark on three common string operations (split, > index, length), using bytestrings and unicode regexp, verses native > utf-8 strings in 1.9.0 (release). That's nice, but split and index do not operate using integer indexing into the string, so they are rather irrelevant to the topic at hand. They produce the same results in ruby1.8, i.e. uni_split==reg_split and uni_index==reg_index. I also stated that the point of regex manipulation is to *obviate* the need for methods like index and length. So a more accurate benchmark might be something like: reg_chars N/A N/A N/A ( N/A ) uni_chars 0.130000 0.000000 0.130000 ( 0.193307) ;-) > Ps. BTW, in case there is any confusion, bytestrings aren't going > away; you can, as you see above, specify a magic encoding comment to > ensure that you have bytestrings by default. Yes, it's still possible to access bytes but it's not possible to run a utf8 regex on a bytestring if it contains extended characters: $ ruby1.9 -ve '"abc" =~ /b/u' ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux] $ ruby1.9 -ve '"$BF|K\8l(B" =~ /$BK\(B/u' ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux] -e:1:in `<main>': character encodings differ (ArgumentError) And that kinda kills my whole approach. Daniel
on 06.12.2007 04:11
On Dec 5, 8:31 pm, Daniel DeLorme <dan...@dan42.com> wrote: > MonkeeSage wrote: > > Here is a micro-benchmark on three common string operations (split, > > index, length), using bytestrings and unicode regexp, verses native > > utf-8 strings in 1.9.0 (release). > > That's nice, but split and index do not operate using integer indexing > into the string, so they are rather irrelevant to the topic at hand. Heh, if the topic at hand is only that indexing into a string is slower with native utf-8 strings (don't disagree), then I guess it's irrelevant. ;) Regarding the idea that you can do everything just as efficiently with regexps that you can do with native utf-8 encoding...it seems relevant. In other words, it goes to show a general behavior that is benefited by a native implementation (the same reason we're using native hashes rather than building our own implementations out of arrays of pairs). > They produce the same results in ruby1.8, i.e. uni_split==reg_split and > uni_index==reg_index. Yes. My point was to show how a native implementation of unicode strings effects performance compared to using regular expressions on bytestrings. The behavior should be the same (hence the asserts). > I also stated that the point of regex manipulation is to *obviate* the > need for methods like index and length. So a more accurate benchmark > might be something like: > reg_chars N/A N/A N/A ( N/A ) > uni_chars 0.130000 0.000000 0.130000 ( 0.193307) > ;-) Someone just posted a question today about how to printf("%20s ...", a, ...) when "a" contains unicode (it screws up the alignment since printf only counts byte width, not character width). There is no *elegant* solution in 1.8., regexps or otherwise. There are haskish solutions (I provided one in that thread)...but the need was still there. Another example is GtkTextView widgets from ruby-gtk2. They deal with utf-8 in their C backend. So all the cursor functions that deal with characters mean utf-8 characters, not bytestrings. So without kludges, stuff doesn't always work right. > ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux] > -e:1:in `<main>': character encodings differ (ArgumentError) > > And that kinda kills my whole approach. You can't use mixed encodings (not just in regexps, not anywhere). You'd have to use a proposed-but-not-implemented-in-1.9.0-release, command line switch to set your encoding to ascii (or whatever), or else use a magic comment [1] like I did above. That or explicitly encode both objects in the same encoding. > Daniel Regards, Jordan [1] http://www.ruby-forum.com/topic/127831
on 06.12.2007 06:31
MonkeeSage wrote: > Heh, if the topic at hand is only that indexing into a string is > slower with native utf-8 strings (don't disagree), then I guess it's > irrelevant. ;) Regarding the idea that you can do everything just as > efficiently with regexps that you can do with native utf-8 > encoding...it seems relevant. How so? These methods work just as well in ruby1.8 which does *not* have native utf8 encoding embedded in the strings. Of course, comparing a string with a string is more efficient than comparing a string with a regexp, but that is irrelevant of whether the string has "native" utf8 encoding or not: $ ruby1.8 -rbenchmark -KU puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index("$BK\(B") }}.real puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index(/[$BK\(B]/) }}.real puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index(/[$BK\(B]/u) }}.real ^D 0.225839138031006 0.304145097732544 0.313494920730591 $ ruby1.9 -rbenchmark -KU puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index("$BK\(B") }}.real puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index(/[$BK\(B]/) }}.real puts Benchmark.measure{100000.times{ "$BF|K\8l(B".index(/[$BK\(B]/u) }}.real ^D 0.183344841003418 0.255104064941406 0.263553857803345 1.9 is more performant (one would hope so!) but the performance ratio between string comparison and regex comparison does not seem affected by the encoding at all. > Someone just posted a question today about how to printf("%20s ...", > a, ...) when "a" contains unicode (it screws up the alignment since > printf only counts byte width, not character width). There is no > *elegant* solution in 1.8., regexps or otherwise. It's not perfect in 1.9 either. "%20s" % "$BF|K\8l(B" results in a string of 20 characters... that uses 23 columns of terminal space because the font for Japanese uses double width. In other words neither bytes nor characters have an intrinsic "width" :-/ Daniel
on 06.12.2007 08:06
On Dec 5, 11:29 pm, Daniel DeLorme <dan...@dan42.com> wrote: > regexp, but that is irrelevant of whether the string has "native" utf8 > > between string comparison and regex comparison does not seem affected by > the encoding at all. Ok, I wasn't being clear. What I was trying to say is, yes the methods perform the same on bytestrings -- whether using regex or standard string operations. The problem is in their behavior, not performance considered in the abstract. In 1.9, using ascii default encoding, this bytestring acts just like 1.8: "ܸ".index("") #=> 3 That's fine! Faster than a regexp, no problems. That is, unless I want to know where the character match is (for whatever reason -- take work- necessitated interoperability with some software that required it). For that I'd have to do something hackish and likely fragile. It's possible, but not desirable; however, being able to do this gains performance and ruby already does all the work for you: "ܸ".force_encoding("utf-8").index("".force_encoding("utf-8")) #=> 1 But it's obviously not better to type! But that's because I'm using ascii default encoding. There is, as I understand it, going to be a way to specify default encoding from the command-line, and probably from within ruby, rather than just the magic comments and String#force_encoding; so this extra typing is incidental and will go away. Actually, it goes away right now if you use utf-8 default and use the byte api to get at the underlying bytestrings. > Daniel It works as expected in 1.9, you just have to set the right encoding: printf("%20s\n".force_encoding("utf-8"), "ni\xc3\xb1o".force_encoding("utf-8")) #=> nio printf("%20s\n", "nino") #=> nio Any case, I just don't think there is any reason to dislike the new string api. It adds another tool to the toolbox. It doesn't make sense to use it always, everywhere (like trying to make data that naturally has the shape of an array, fit into a hash); but I see no reason to try and cobble it together ourselves either (like building a hash api from arrays ourselves). And with that, I'm going to sleep. Have to think more on it tomorrow. Peace, Jordan
on 07.12.2007 11:06
> Re: Unicode in Regex > Posted by Jordan Callicoat (monkeesage) on 03.12.2007 02:50 > > This seems to work... > > $KCODE = "UTF8" > p /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?/u =~ "J�sp...it works" > # => 0 > ... > However, it looks to me like it would be more robust to use a slightly > modified version of UTF8REGEX (found in the link Jimmy posted > above)... > > UTF8REGEX = /\A(?: > [a-zA-Z\.\-\'\ ] > | [\xC2-\xDF][\x80-\xBF] > | \xE0[\xA0-\xBF][\x80-\xBF] > | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} > | \xED[\x80-\x9F][\x80-\xBF] > | \xF0[\x90-\xBF][\x80-\xBF]{2} > | [\xF1-\xF3][\x80-\xBF]{3} > | \xF4[\x80-\x8F][\x80-\xBF]{2} > )*\z/mnx Just to avoid confusion over the meaning of 'UTF8' in UTF8REGEX: the n option sets the encoding of UTF8REGEX to none! Cheers, j. k.