How do I set the encoding on a regexp?

pedz · February 23, 2010, 6:05pm

Title pretty much says it all. Here is a small sample program:

#!/usr/bin/env ruby

-- coding: utf-8 --

s = “string”
puts s.encoding
r = Regexp.new(s)
puts r.encoding

Here is the output:

UTF-8
US-ASCII

I was expecting both to be set to UTF-8. There is no force_encoding
method for RegExp.

If I later try to use it on strings of type UTF-8, it can throw an
exception.

How is this suppose to be handled?

Thanks,
Perry

pedz · February 24, 2010, 6:54pm

I was expecting both to be set to UTF-8. There is no force_encoding
method for RegExp.

If I later try to use it on strings of type UTF-8, it can throw an
exception.

Do you have an example of this? It might be a bug.

I did notice that

Regexp.new(“CafÃ©”).encoding

keeps it in UTF-8

so maybe it’s optimizing it and when it doesn’t “have to be” UTF-8 it is
leaving it as ASCII?

-r

pedz · February 24, 2010, 7:22pm

Roger P. wrote:

I was expecting both to be set to UTF-8. There is no force_encoding
method for RegExp.

If I later try to use it on strings of type UTF-8, it can throw an
exception.

Do you have an example of this? It might be a bug.

I did notice that

Regexp.new(“CafÃ©”).encoding

keeps it in UTF-8

so maybe it’s optimizing it and when it doesn’t “have to be” UTF-8 it is
leaving it as ASCII?

I’m not clear what you mean by an example other than what I put in the
original note.

I think I’m going to open a bug report – it might not be a bug but I
sure am confused. The “Pick Axe” book describes a third argument but I
can’t get that to work either. “ri” for Ruby 1.9.1 does not describe
the third argument at all – but it does seem to exist at least.

It appears as if, as you pointed out, if the input string happens to be
ASCII, then the regexp encoding is ascii and there doesn’t seem to be
anything you can do about it.

I’m testing on 1.9.1 p243.

But, due to another discussion thread, I think I want to be in 8 bit
binary anyway in my case. I’m not 100% positive my input is UTF-8. Its
suppose to be but I can’t really trust it.

Thanks
Perry

pedz · February 24, 2010, 7:36pm

Typo fix:

Regexp.new(/foo/u).encoding # => UTF-8

pedz · February 24, 2010, 7:58pm

If I later try to use it on strings of type UTF-8, it can throw an
exception.
I’m not clear what you mean by an example other than what I put in the
original note.

Do you have a small example (like your original) that throws an
exception where you “use it on strings later of type UTF-8” and it
throws an exception?

-r

pedz · February 24, 2010, 8:35pm

Perry S. wrote:

r = Regexp.new(s)

Try this:

r = Regexp.new(s,16)

-David

pedz · February 24, 2010, 7:36pm

If you want to preserve the UTF-8 encoding you can do:

Regexp.new(/foo/u).encoing # => UTF-8

pedz · February 24, 2010, 10:30pm

Perry S. wrote:

I think I’m going to open a bug report – it might not be a bug but I
sure am confused.

It’s not a bug(*), and it sure is confusing. My own attempt to document
Ruby 1.9’s encoding rules, which is woefully incomplete but covers about
200 different cases, is at

github.com

candlerb/string19/blob/master/string19.rb

#!/usr/bin/env ruby19
# encoding: UTF-8
# This document is Copyright (C) Brian Candler 2009 and released under a
# Creative Commons Attribution-NonCommercial 3.0 Unported License.

############# CONTENTS ###################

# -1. PREAMBLE
#  0. INTRODUCTION
#  1. ENCODINGS
#  2. PROPERTIES OF ENCODINGS
#  3. STRING, FILE AND REGEXP ENCODINGS
#  4. VALID AND FIXED ENCODINGS
#  5. COMPATIBLE OBJECTS
#  6. STRING CONCATENATION
#  7. THE BINARY / ASCII-8BIT ENCODING
#  8. SINGLE CHARACTERS
#  9. EQUALITY AND COLLATION
# 10. HASH AND EQL?
# 11. UPPER AND LOWER CASE

This file has been truncated. show original

What you’ve observed is described in section 3.3.

Basically, a Regexp which contains only ASCII characters is given an
encoding of US-ASCII regardless of the original string’s encoding (this
is different to Strings, which might have an encoding of say UTF-8 but
have the ascii_only? property true if they contain only ASCII
characters).

However there is a hidden “fixed_encoding” property you can set on a
Regexp:

r1 = Regexp.new(“string”)
=> /string/
r2 = Regexp.new(“string”, Regexp::FIXEDENCODING)
=> /string/
r1.encoding
=> #Encoding:US-ASCII
r2.encoding
=> #Encoding:UTF-8
r1.fixed_encoding?
=> false
r2.fixed_encoding?
=> true

I say it’s a “hidden” property because the flag isn’t revealed if you
use inspect or to_s (unlike the //m, //i and //x properties)

r1.to_s
=> “(?-mix:string)”
r2.to_s
=> “(?-mix:string)”

HTH,

Brian.

(*) Except in as much as the entire Encoding nonsense in ruby 1.9 is one
enormous bug

pedz · February 24, 2010, 10:42pm

Perry,

In 1.9 there is only one optional parameter.

You can force the encoding of the string parameter (if needed)
AND also pass the options parameter.

Try this:

#!/usr/bin/env ruby

s = “string”
puts s.encoding
r = Regexp.new(s.encode(“utf-8”), Regexp::ENC_UTF8)
puts r.encoding

Here is the output:

US-ASCII
UTF-8

-David

pedz · February 24, 2010, 11:32pm

Hi Brian and David,

Thanks. I’m doing more experimenting and I’m also looking at the source
code. I need to drag down the latest. I’m looking at 1.9.1 p243 right
now.

Regexp.new has a third optional argument – it is sorta described in the
Pick Axe book but the code looks wrong. It can be either ‘n’ or ‘xN’
where x can be anything. Perhaps that is gone in the latest code.

But the “fixed encoding” is a key part of the puzzle I was missing.
Also, David, I had not bumped into the ENC_UTF8 constant yet. There are
quite a few constants (like the 16 pointed out by David also) is a flag
to make the encoding “fixed”.

The latest code that David posted answers exactly what my original
question was. Thanks!

pedz · February 24, 2010, 11:55pm

Perry S. wrote:

But the “fixed encoding” is a key part of the puzzle I was missing.
Also, David, I had not bumped into the ENC_UTF8 constant yet. There are
quite a few constants (like the 16 pointed out by David also) is a flag
to make the encoding “fixed”.

16 is just Regexp::FIXEDENCODING

irb(main):001:0> Regexp::FIXEDENCODING
=> 16

In the 1.9.2 I have here (r24186, 2009-07-18) there is no
Regexp::ENC_UTF8, so it must be relatively new.

irb(main):002:0> Regexp::ENC_UTF8
NameError: uninitialized constant Regexp::ENC_UTF8
from (irb):2
from /usr/local/bin/irb192:12:in `’
irb(main):003:0> Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE, :FIXEDENCODING]

As for the third arg to Regexp.new, I have no idea. Documentation is not
Ruby’s strong point at the best of times, but it’s nonexistent for the
encoding stuff.

pedz · February 25, 2010, 12:29am

My bad.

I was running 1.9.1, which had no FIXEDENCODING.

Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE, :ONCE, :ENC_NONE, :ENC_EUC, :ENC
_SJIS, :ENC_UTF8]

So things have changed since 1.9.1

If you are running 1.9.2 then use FIXEDENCODING and you should be fine.

I THINK that you are saying with FIXEDENCODING is NOT to revert back to
something like ASCII.

BTW in 1.9.1

Regexp::ENC_EUC
=> 16

Regexp::ENC_SJIS
=> 16

Regexp::ENC_UTF8
=> 16

pedz · February 26, 2010, 4:03pm

On 24-Feb-10, at 6:22 PM, David S. wrote:

If you are running 1.9.2 then use FIXEDENCODING and you should be
fine.

I THINK that you are saying with FIXEDENCODING is NOT to revert back
to
something like ASCII.

This has been really helpful, but I’m still having difficulties. I’m
running 1.9.1p376 and:

Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE]

But if I use 16 rather than FIXEDENCODING it works as in the examples
in this thread.

Does anyone know what’s going on here? I used to have a pretty good
handle on encodings. This Ruby encoding stuff is something I’ve been
struggling with for 6 months and I think all that I’ve managed to do
is completely corrupt my understanding of encoding. It’s starting to
look like magic. I know that a bunch of things changed between
1.9.1p243 and 1.9.1p376, but, since I think that what I ‘know’ about
encoding might be completely delusional at this point, I suppose I
don’t really know.

Brian your string19/string19.rb at master · candlerb/string19 · GitHub
is something else! I’m laughing with a slightly hysterical edge.

Cheers,
Bob

On Wed, Feb 24, 2010 at 4:55 PM, Brian C. [email protected]

16 is just Regexp::FIXEDENCODING
from /usr/local/bin/irb192:12:in `’

–
David N. Springer
Eau Claire, WI

Bob H.
Recursive Design Inc.
http://www.recursive.ca/
weblog: Xampl.com is for sale | HugeDomains

pedz · February 26, 2010, 5:34pm

Bob H. wrote:

On 24-Feb-10, at 6:22 PM, David S. wrote:

Regexp.constants
=> [:IGNORECASE, :EXTENDED, :MULTILINE]

I might help you to know that your constants are the same as mine. I
don’t know how David got his.

Unfortunately, I still have not gotten back to my investigation of this.
Looking at the code in re.c helped me a bit.

Aside from that, I think we are all struggling with this. I’m hoping
that there are a few “bugs” in the code… i.e. Mat has a clear idea of
how things should work but there are just a few mistakes that really
hamper our understanding.

HTH,
Perry

pedz · February 26, 2010, 10:04pm

Bob H. wrote:

Brian your string19/string19.rb at master · candlerb/string19 · GitHub
is something else! I’m laughing with a slightly hysterical edge.

One has to laugh or cry. As best I could, I factored out my opinion of
all this into a separate file:
http://github.com/candlerb/string19/raw/47b0cba0a2047eca0612b4e24a540f011cf2cac3/soapbox.rb

pedz · February 26, 2010, 10:55pm

One has to laugh or cry. As best I could, I factored out my opinion of
all this into a separate file:
http://github.com/candlerb/string19/raw/47b0cba0a2047eca0612b4e24a540f011cf2cac3/soapbox.rb

You should post it to core…
-r

pedz · February 24, 2010, 10:02pm

Roger P. wrote:

If I later try to use it on strings of type UTF-8, it can throw an
exception.
I’m not clear what you mean by an example other than what I put in the
original note.

Do you have a small example (like your original) that throws an
exception where you “use it on strings later of type UTF-8” and it
throws an exception?

No I don’t. I think that I might have had a string that was not
utf-8. I was fetching strings from a file and just doing a
force_encoding because they were suppose to be utf-8 but maybe they were
not.

I’m not sure. Let me see if I can make an example. My trivial examples
so far don’t throw an exception.

pedz · March 2, 2010, 7:26pm

On Fri, Feb 26, 2010 at 3:04 PM, Brian C. [email protected]
wrote:

Bob H. wrote:

Brian your string19/string19.rb at master · candlerb/string19 · GitHub
is something else! I’m laughing with a slightly hysterical edge.

One has to laugh or cry. As best I could, I factored out my opinion of
all this into a separate file:
http://github.com/candlerb/string19/raw/47b0cba0a2047eca0612b4e24a540f011cf2cac3/soapbox.rb

This is exactly the situation I worried about when Matz proposed the
“all encodings” view of Ruby 1.9. Even though many applications won’t
run into this, any that try to deal with >1 encoding at a time will
have a clusterfuck of a time making sure everything fits together. And
this is to say nothing of the implementation effort required, which
still isn’t all there in JRuby (and won’t be until 1.6 or later).

I didn’t read this whole thread, since there’s a lot of “it’s a
bug/it’s not a bug” exploration, but if there’s something we need to
fix in JRuby, please do report it (and try to help fix it, too :)).

Charlie

pedz · March 2, 2010, 11:27pm

One has to laugh or cry. As best I could, I factored out my opinion of
all this into a separate file:
http://github.com/candlerb/string19/raw/47b0cba0a2047eca0612b4e24a540f011cf2cac3/soapbox.rb

re: string1 + string2 + string3 actually working without fear…

One thing that might help would be to set the default encoding, then all
three strings would (might ?) have the same encoding (?)

-rp

pedz · March 3, 2010, 12:43pm

Roger P. wrote:

re: string1 + string2 + string3 actually working without fear…

One thing that might help would be to set the default encoding, then all
three strings would (might ?) have the same encoding (?)

That depends where the strings came from. If they were returned by a
library function (either Ruby core or 3rd party) you won’t know what
encoding they have unless it is documented what the encoding is or how
it is chosen, and it almost never is.

Equally, if you are writing a library for use by other people, then you
really should not touch global state such as Encoding.default_external.
So you are left with Ruby guessing encodings and forcing them if it
guesses wrongly, e.g.

$ ruby19 -e ‘puts %x{cat /bin/sh}.encoding’
UTF-8

Of course, if you’re saying that your application handles all strings in
the same encoding, then this whole business of tagging every
individual string object with its own encoding is a waste of time and
effort, and is just something which you have to fight against.

But we’re flogging a dead horse here. I hate this stuff; other people
seem to love it.