Re: Premature end of regular expression with non-ascii chara

When I read in a text with accents from a file under cygwin, these get
converted to something like ‘\352’.
You can then search for these using regexps:

a=“un texte extrêmement énervant”
p splitted_text=a.split(/(?=)/)
b=/extr\352mement/
d=a.match(b)
p d[0] => extr\352mement

When I write the result to a file, it appears correctly as
“extrêmement”.

f=File.new(“t.txt”,‘w’)
f.puts d[0]
f.close

Hope that helps,

Best regards,

Axel

Hi Axel,

thanks for the reply. If I try your code, my characters with accents
don’t get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of ê . Could you split the following
sentence for me and let me know what the result is:

a=“Ils sont très énervé les regexps.”
splitted_text=a.split(/\s/)

Not my best French. But if I try this, ‘très énervé les’ is still one
part, eventhough I split it on the spaces. Maybe it is different with
you and then I have to look deeper. Thanks for your help. If anybody is
able to split is like ‘très’, ‘énervé’, ‘les’ please let me know!!

Kind regards,

Nick

The odds are your text is in non-UTF8 encoding, but in CP1252 or
similar.
Then indeed, if $KCODE = ‘u’ split won’t work right.

2006/1/31, Nick S. [email protected]:

Indeed, it isn’t in UTF-8. It’s in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven’t found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.

Anyway I remove $KCODE altogether in config/environment.rb and now it
works. And Axel I also get the numbers. In config/environment.rb I
added:

$KCODE = ‘u’
require ‘jcode’

to get Gettext to work. So it turns out that if you aren’t fully working
in UTF-8, you have to be carefull adding this.

Thanks for pointing me to $KCODE, twice!

Kind regards,

Nick

2006/1/31, Nick S. [email protected]:

Indeed, it isn’t in UTF-8. It’s in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven’t found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.

use Iconv library

Hi Nikolai,

thanks for the suggestion I will definitely give Iconv a try. Hope it
doesn’t slow things down a lot.

Kind regards,

Nick

Nick S. wrote:

Indeed, it isn’t in UTF-8. It’s in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven’t found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.

I have to deal with similar problems when processing the infamous german
umlaute äöü. My solution has been to convert a string from latin1 or
latin15 to utf8 via this
utf8_string=latin1_string.unpack(“C*”).pack(“U*”)

and the other way round with
latin1_string=utf8_string.unpack(“U*”).pack(“C*”)

Did work so far and does not include changes in the environment.
HTH,
Lars