Re: Premature end of regular expression with non-ascii chara

unknown · January 31, 2006, 12:27am

When I read in a text with accents from a file under cygwin, these get
converted to something like ‘\352’.
You can then search for these using regexps:

a=“un texte extrÃªmement Ã©nervant”
p splitted_text=a.split(/(?=)/)
b=/extr\352mement/
d=a.match(b)
p d[0] => extr\352mement

When I write the result to a file, it appears correctly as
“extrÃªmement”.

f=File.new(“t.txt”,‘w’)
f.puts d[0]
f.close

Hope that helps,

Best regards,

Axel

unknown · January 31, 2006, 12:07pm

Hi Axel,

thanks for the reply. If I try your code, my characters with accents
don’t get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of Ãª . Could you split the following
sentence for me and let me know what the result is:

a=“Ils sont trÃ¨s Ã©nervÃ© les regexps.”
splitted_text=a.split(/\s/)

Not my best French. But if I try this, ‘trÃ¨s Ã©nervÃ© les’ is still one
part, eventhough I split it on the spaces. Maybe it is different with
you and then I have to look deeper. Thanks for your help. If anybody is
able to split is like ‘trÃ¨s’, ‘Ã©nervÃ©’, ‘les’ please let me know!!

Kind regards,

Nick

unknown · January 31, 2006, 12:34pm

The odds are your text is in non-UTF8 encoding, but in CP1252 or
similar.
Then indeed, if $KCODE = ‘u’ split won’t work right.

2006/1/31, Nick S. [email protected]:

unknown · January 31, 2006, 1:11pm

Indeed, it isn’t in UTF-8. It’s in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven’t found
a way of converting these strings to Unicode in Ruby. Ã© and Ã¨ etc. form
part of ISO-8859-1.

Anyway I remove $KCODE altogether in config/environment.rb and now it
works. And Axel I also get the numbers. In config/environment.rb I
added:

$KCODE = ‘u’
require ‘jcode’

to get Gettext to work. So it turns out that if you aren’t fully working
in UTF-8, you have to be carefull adding this.

Thanks for pointing me to $KCODE, twice!

Kind regards,

Nick

unknown · January 31, 2006, 1:23pm

2006/1/31, Nick S. [email protected]:

Indeed, it isn’t in UTF-8. It’s in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven’t found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.

use Iconv library

unknown · January 31, 2006, 2:18pm

Hi Nikolai,

thanks for the suggestion I will definitely give Iconv a try. Hope it
doesn’t slow things down a lot.

Kind regards,

Nick

unknown · February 1, 2006, 11:40pm

Nick S. wrote:

Indeed, it isn’t in UTF-8. It’s in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven’t found
a way of converting these strings to Unicode in Ruby. Ã© and Ã¨ etc. form
part of ISO-8859-1.

I have to deal with similar problems when processing the infamous german
umlaute Ã¤Ã¶Ã¼. My solution has been to convert a string from latin1 or
latin15 to utf8 via this
utf8_string=latin1_string.unpack(“C*”).pack(“U*”)

and the other way round with
latin1_string=utf8_string.unpack(“U*”).pack(“C*”)

Did work so far and does not include changes in the environment.
HTH,
Lars