When I read in a text with accents from a file under cygwin, these get
converted to something like ‘\352’.
You can then search for these using regexps:
thanks for the reply. If I try your code, my characters with accents
don’t get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of ê . Could you split the following
sentence for me and let me know what the result is:
Indeed, it isn’t in UTF-8. It’s in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven’t found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.
I have to deal with similar problems when processing the infamous german
umlaute äöü. My solution has been to convert a string from latin1 or
latin15 to utf8 via this
utf8_string=latin1_string.unpack(“C*”).pack(“U*”)
and the other way round with
latin1_string=utf8_string.unpack(“U*”).pack(“C*”)
Did work so far and does not include changes in the environment.
HTH,
Lars
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.