Forum: Ruby Re: Premature end of regular expression with non-ascii chara

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
unknown (Guest)
on 2006-01-31 01:27
(Received via mailing list)
When I read in a text with accents from a file under cygwin, these  get
converted to something like '\352'.
You can then search for these using regexps:

a="un texte extrêmement énervant"
p  splitted_text=a.split(/(?=)/)
b=/extr\352mement/
d=a.match(b)
p  d[0]  => extr\352mement

When I write the result to a file, it appears correctly as
"extrêmement".

f=File.new("t.txt",'w')
f.puts d[0]
f.close

Hope that helps,

Best regards,

Axel
Nick S. (Guest)
on 2006-01-31 13:07
Hi Axel,

thanks for the reply. If I try your code, my characters with accents
don't get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of ê . Could you split the following
sentence for me and let me know what the result is:

a="Ils sont très énervé les regexps."
splitted_text=a.split(/\s/)

Not my best French. But if I try this, 'très énervé les' is still one
part, eventhough I split it on the spaces. Maybe it is different with
you and then I have to look deeper. Thanks for your help. If anybody is
able to split is like 'très', 'énervé', 'les' please let me know!!

Kind regards,

Nick
Lugovoi N. (Guest)
on 2006-01-31 13:34
(Received via mailing list)
The odds are your text is in non-UTF8 encoding, but in CP1252 or
similar.
Then indeed,  if $KCODE = 'u' split won't work right.

2006/1/31, Nick S. <removed_email_address@domain.invalid>:
Nick S. (Guest)
on 2006-01-31 14:11
Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven't found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.

Anyway I remove $KCODE altogether in config/environment.rb and now it
works. And Axel I also get the numbers. In config/environment.rb I
added:

$KCODE = 'u'
require 'jcode'

to get Gettext to work. So it turns out that if you aren't fully working
in UTF-8, you have to be carefull adding this.

Thanks for pointing me to $KCODE, twice!

Kind regards,

Nick
Lugovoi N. (Guest)
on 2006-01-31 14:23
(Received via mailing list)
2006/1/31, Nick S. <removed_email_address@domain.invalid>:
> Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
> is that I would like to work in UTF-8, but I have to read in files. And
> these files are often (almost always) in ISO-8859-1. And I haven't found
> a way of converting these strings to Unicode in Ruby. é and è etc. form
> part of ISO-8859-1.
>

use Iconv library
Nick S. (Guest)
on 2006-01-31 15:18
Hi Nikolai,

thanks for the suggestion I will definitely give Iconv a try. Hope it
doesn't slow things down a lot.

Kind regards,

Nick
Lars B. (Guest)
on 2006-02-02 00:40
(Received via mailing list)
Nick S. wrote:
> Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
> is that I would like to work in UTF-8, but I have to read in files. And
> these files are often (almost always) in ISO-8859-1. And I haven't found
> a way of converting these strings to Unicode in Ruby. é and è etc. form
> part of ISO-8859-1.

I have to deal with similar problems when processing the infamous german
umlaute äöü. My solution has been to convert a string from latin1 or
latin15 to utf8 via this
	utf8_string=latin1_string.unpack("C*").pack("U*")

and the other way round with
	latin1_string=utf8_string.unpack("U*").pack("C*")

Did work so far and does not include changes in the environment.
HTH,
Lars
This topic is locked and can not be replied to.