Forum: Ruby Ruby 1.9.1 - invalid multibyte escape: // (RegexpError)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-04 21:42
(Received via mailing list)
Hi, using TreeTop parser I had a grammar defined working in Ruby1.8 but
it
fails in 1.9.1:

  ~# ruby1.8 -e "Regexp.new('[\xC0-\xDF]')"
  OK

  ~# ruby1.9 -e "Regexp.new('[\xC0-\xDF]')"
  -e:1:in `initialize': invalid multibyte escape: /[\xC0-\xDF]/
(RegexpError)

I've found the following text about differences between 1.8 and 1.9:

"It is more rigorous that 1.8 when it comes to detecting invalid code.
For
example, 1.8 accepts /[^\x00-\xa0]/u, while 1.9 complains of invalid
multibyte
escape"

Ok, so how should I write the above Regexp to work on 1.9.1?

Thanks a lot.


--


Iñaki Baz Castillo <ibc@aliax.net>
58479f76374a3ba3c69b9804163f39f4?d=identicon&s=25 Eric Hodel (Guest)
on 2009-04-06 07:43
(Received via mailing list)
On Apr 4, 2009, at 12:40, Iñaki Baz Castillo wrote:

>
> I've found the following text about differences between 1.8 and 1.9:
>
> "It is more rigorous that 1.8 when it comes to detecting invalid
> code. For
> example, 1.8 accepts /[^\x00-\xa0]/u, while 1.9 complains of invalid
> multibyte
> escape"
>
> Ok, so how should I write the above Regexp to work on 1.9.1?

Regexp.new '[\xC0-\xDF]', nil, 'n'
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-06 10:04
(Received via mailing list)
2009/4/6 Eric Hodel <drbrain@segment7.net>:
>> Ok, so how should I write the above Regexp to work on 1.9.1?
>
> Regexp.new '[\xC0-\xDF]', nil, 'n'

Great! Thanks a lot.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-06 10:06
(Received via mailing list)
2009/4/6 Iñaki Baz Castillo <ibc@aliax.net>:
> 2009/4/6 Eric Hodel <drbrain@segment7.net>:
>>> Ok, so how should I write the above Regexp to work on 1.9.1?
>>
>> Regexp.new '[\xC0-\xDF]', nil, 'n'
>
> Great! Thanks a lot.

However I don't understant these parameters for Regexp.new.
I read:
  http://www.ruby-doc.org/core-1.9/classes/Regexp.html
About the third parameter you use ('n') it doens't appear on the doc ¿?

Thanks a lot.
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2009-04-06 16:36
Iñaki Baz Castillo wrote:
> However I don't understant these parameters for Regexp.new.
> I read:
>   http://www.ruby-doc.org/core-1.9/classes/Regexp.html
> About the third parameter you use ('n') it doens't appear on the doc ¿?

All the new stuff to do with String and encodings in ruby 1.9 is
undocumented.

(At least, it's not documented within Ruby itself. You may be able to
purchase a book which has some reverse-engineered documentation)

If you care about stability or documentation, my own advice is to stick
with 1.8 - preferably 1.8.6.
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2009-04-06 16:48
(Received via mailing list)
On Apr 6, 2009, at 9:36 AM, Brian Candler wrote:

> Iñaki Baz Castillo wrote:
>> However I don't understant these parameters for Regexp.new.
>> I read:
>>  http://www.ruby-doc.org/core-1.9/classes/Regexp.html
>> About the third parameter you use ('n') it doens't appear on the
>> doc ¿?
>
> All the new stuff to do with String and encodings in ruby 1.9 is
> undocumented.

I've got the majority of the new functionality covered in my m17n
series now:

http://blog.grayproductions.net/articles/understanding_m17n

I expect to have the minor side topics I'm still missing covered in
the next few weeks.

James Edward Gray II
6d3c187a8b3ef53b08e3e7e8572c4fea?d=identicon&s=25 Jeremy McAnally (Guest)
on 2009-04-06 17:06
(Received via mailing list)
I'm working on documenting some of this stuff when I have time (always
the magic words, eh? :-/).  I ran dcov on the whole of Ruby core last
week (results: http://jeremymcanally.com/coverage.html ; it's a little
deceiving since methods like to_yaml I think are actually included
from elsewhere.  I'll have to look...), and I'm currently setting up
some tasks for myself to knock things out.

I might setup a Lighthouse for it or something if other people want to
get involved.

--Jeremy

On Mon, Apr 6, 2009 at 9:36 AM, Brian Candler <b.candler@pobox.com>
wrote:
> purchase a book which has some reverse-engineered documentation)
>
> If you care about stability or documentation, my own advice is to stick
> with 1.8 - preferably 1.8.6.
> --
> Posted via http://www.ruby-forum.com/.
>
>



--
http://jeremymcanally.com/
http://entp.com/
http://omgbloglol.com

My books:
http://manning.com/mcanally/
http://humblelittlerubybook.com/ (FREE!)
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-06 17:28
(Received via mailing list)
2009/4/6 Brian Candler <b.candler@pobox.com>:
> purchase a book which has some reverse-engineered documentation)
Regexp.new of Ruby 1.9 is obviously documented in:
  http://www.ruby-doc.org/core-1.9/classes/Regexp.html
but the number of parameters doesn't match with the reality ¿?

Does it make sense? Isn't that documentation been created with Rdoc?
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2009-04-06 21:23
Iñaki Baz Castillo wrote:
> 2009/4/6 Brian Candler <b.candler@pobox.com>:
>> purchase a book which has some reverse-engineered documentation)
> Regexp.new of Ruby 1.9 is obviously documented in:
>   http://www.ruby-doc.org/core-1.9/classes/Regexp.html
> but the number of parameters doesn't match with the reality ¿?
>
> Does it make sense? Isn't that documentation been created with Rdoc?

The rdoc is only as good as the comments in the source code.
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2009-04-08 10:02
James Gray wrote:
> I've got the majority of the new functionality covered in my m17n
> series now:
>
> http://blog.grayproductions.net/articles/understanding_m17n
>
> I expect to have the minor side topics I'm still missing covered in
> the next few weeks.

This is a good start, but I think it just scratches the surface.

Questions which immediately spring to mind:

* What is the nature of the "compatible" relationship? Does A compatible
with B imply B compatible with A? It's not commutative:

irb(main):002:0> a = "abc".force_encoding("UTF-8")
=> "abc"
irb(main):003:0> b = "def".force_encoding("ISO-8859-1")
=> "def"
irb(main):004:0> Encoding.compatible?(a,b)
=> #<Encoding:UTF-8>
irb(main):005:0> Encoding.compatible?(b,a)
=> #<Encoding:ISO-8859-1>

Also, it's not encodings which are compatible, but actual strings. Two
strings may or may not be compatible, dependent not just on their
encoding, but on their actual content at that instant.

irb(main):006:0> a = "abc\xff".force_encoding("UTF-8")
=> "abc\xFF"
irb(main):007:0> b = "def\xff".force_encoding("ISO-8859-1")
=> "def�"
irb(main):008:0> Encoding.compatible?(a,b)
=> nil

* What about string literals which include escape sequences like \u?
This seems to override the source encoding rule.

$ ruby19
# encoding: ISO-8859-1
puts "abc".encoding
puts "abc\u1234".encoding
^D
ISO-8859-1
UTF-8

* What encoding is chosen for regexp literals? (Seems to be different
rules to string literals). What about string literals which include
#{interpolation}? What about regexp literals which include
#{interpolation}?

* What source encoding and external encoding is used in irb?

* I think it will be worth explaining what you need to do to handle
binary data (using "rb" and "wb", the ASCII-8BIT encoding, how to set
external encoding for STDIN, the fact that read() and gets() return
different encodings for the same data...)

* What actually happens if you use string operations on two strings with
different encodings? e.g. str1 == str2, str1 + str2, str1 << str2? What
about indexing a hash with two strings which are identical byte
sequences but different encodings?

* What do C extension writers need to know about strings? It seems at
the moment there is some magic hidden state (ENC_CODERANGE_7BIT) which
you must remember to update whenever you create or modify a string, and
if you don't, things break badly.
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/...
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/...

Regards,

Brian.
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2009-04-08 16:50
(Received via mailing list)
On Apr 8, 2009, at 3:03 AM, Brian Candler wrote:

> * What about string literals which include escape sequences like \u?
> This seems to override the source encoding rule.

I plan to cover this in the next article.

> * What encoding is chosen for regexp literals? (Seems to be different
> rules to string literals). What about string literals which include
> #{interpolation}? What about regexp literals which include
> #{interpolation}?

I'm going to cover this too.

> * I think it will be worth explaining what you need to do to handle
> binary data (using "rb" and "wb", the ASCII-8BIT encoding, how to set
> external encoding for STDIN, the fact that read() and gets() return
> different encodings for the same data...)

Planned for the next article.

> * What actually happens if you use string operations on two strings
> with
> different encodings? e.g. str1 == str2, str1 + str2, str1 << str2?
> What
> about indexing a hash with two strings which are identical byte
> sequences but different encodings?

I feel a gave a much better strategy that prevents you from worrying
about such things.  However, that article did link to a detailed
explanation.

James Edward Gray II
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2009-04-15 20:32
(Received via mailing list)
On Apr 8, 2009, at 9:47 AM, James Gray wrote:

>> #{interpolation}?
>
> I'm going to cover this too.
>
>> * I think it will be worth explaining what you need to do to handle
>> binary data (using "rb" and "wb", the ASCII-8BIT encoding, how to set
>> external encoding for STDIN, the fact that read() and gets() return
>> different encodings for the same data...)
>
> Planned for the next article.

I've added a new post to my m17n series covering all of the above and
more:

http://blog.grayproductions.net/articles/miscellan...

James Edward Gray II
This topic is locked and can not be replied to.