UTF in Regexp


#1

I am sorrrrrry, but I am banging my head against this, and can’t seem to
find the answer!

Text gets displayed in an input field in a web page with “
prepended and ” appended to the string (needs to be inside the
string otherwise it looks funny). The user edits it, and when it comes
back to the (Rails) backend, the new string with (possibly) these quotes
attached comes back, but in unicode.

So the string possibly starts with UTF “ and possibly ends with
UTF ”

I want to do a regexp removal. Here is what works (but I am embarrased):

ldquo = ‘123’; ldquo[0] = 226; ldquo[1] = 128; ldquo[2] = 156
rdquo = ‘123’; rdquo[0] = 226; rdquo[1] = 128; rdquo[2] = 157
string.gsub!(/(\A#{ldquo}|#{rdquo}\Z)/,’’)

There must be a better way.

Abu Mats al-Nemsi


#2

On 2/3/07, Wido M. removed_email_address@domain.invalid wrote:

So the string possibly starts with UTF “ and possibly ends with
UTF ”

I want to do a regexp removal. Here is what works (but I am embarrased):

ldquo = ‘123’; ldquo[0] = 226; ldquo[1] = 128; ldquo[2] = 156
rdquo = ‘123’; rdquo[0] = 226; rdquo[1] = 128; rdquo[2] = 157
string.gsub!(/(\A#{ldquo}|#{rdquo}\Z)/,’’)

There must be a better way.

  1. it’s possible to insert the chars directly, either in octal (226 =
    “\342”) or hexa (226= “\xe2”)

string.gsub!(\A\xe2\x80\x9c|\xe2\x80\9d\Z/,")

  1. | has low priority, so your regex is equal to /(\Alquo)|(rquo\z)/.
    the correct one is (notice the non-capturing group (?:…)

string.gsub!(\A(?:\xe2\x80\x9c|\xe2\x80\9d)\Z/,")

  1. there’s iconv library that will convert things for you.