Forum: Ruby Re: Premature end of regular expression with non-ascii chara

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
67bb4df2775f6a6b603347dce7119571?d=identicon&s=25 unknown (Guest)
on 2006-01-31 00:27
(Received via mailing list)
When I read in a text with accents from a file under cygwin, these  get
converted to something like '\352'.
You can then search for these using regexps:

a="un texte extrêmement énervant"
p  splitted_text=a.split(/(?=)/)
b=/extr\352mement/
d=a.match(b)
p  d[0]  => extr\352mement

When I write the result to a file, it appears correctly as
"extrêmement".

f=File.new("t.txt",'w')
f.puts d[0]
f.close

Hope that helps,

Best regards,

Axel
91308e9bc88cb069fd1bcf88e910d042?d=identicon&s=25 Nick Snels (nicksnels)
on 2006-01-31 12:07
Hi Axel,

thanks for the reply. If I try your code, my characters with accents
don't get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of ê . Could you split the following
sentence for me and let me know what the result is:

a="Ils sont très énervé les regexps."
splitted_text=a.split(/\s/)

Not my best French. But if I try this, 'très énervé les' is still one
part, eventhough I split it on the spaces. Maybe it is different with
you and then I have to look deeper. Thanks for your help. If anybody is
able to split is like 'très', 'énervé', 'les' please let me know!!

Kind regards,

Nick
5c19f2d52879a1e10670c7334ba4c7e3?d=identicon&s=25 Lugovoi Nikolai (Guest)
on 2006-01-31 12:34
(Received via mailing list)
The odds are your text is in non-UTF8 encoding, but in CP1252 or
similar.
Then indeed,  if $KCODE = 'u' split won't work right.

2006/1/31, Nick Snels <nick.snels@gmail.com>:
91308e9bc88cb069fd1bcf88e910d042?d=identicon&s=25 Nick Snels (nicksnels)
on 2006-01-31 13:11
Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven't found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.

Anyway I remove $KCODE altogether in config/environment.rb and now it
works. And Axel I also get the numbers. In config/environment.rb I
added:

$KCODE = 'u'
require 'jcode'

to get Gettext to work. So it turns out that if you aren't fully working
in UTF-8, you have to be carefull adding this.

Thanks for pointing me to $KCODE, twice!

Kind regards,

Nick
5c19f2d52879a1e10670c7334ba4c7e3?d=identicon&s=25 Lugovoi Nikolai (Guest)
on 2006-01-31 13:23
(Received via mailing list)
2006/1/31, Nick Snels <nick.snels@gmail.com>:
> Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
> is that I would like to work in UTF-8, but I have to read in files. And
> these files are often (almost always) in ISO-8859-1. And I haven't found
> a way of converting these strings to Unicode in Ruby. é and è etc. form
> part of ISO-8859-1.
>

use Iconv library
91308e9bc88cb069fd1bcf88e910d042?d=identicon&s=25 Nick Snels (nicksnels)
on 2006-01-31 14:18
Hi Nikolai,

thanks for the suggestion I will definitely give Iconv a try. Hope it
doesn't slow things down a lot.

Kind regards,

Nick
69fcb15fd5503ea6683f88d7e0c514eb?d=identicon&s=25 Lars Broecker (Guest)
on 2006-02-01 23:40
(Received via mailing list)
Nick Snels wrote:
> Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
> is that I would like to work in UTF-8, but I have to read in files. And
> these files are often (almost always) in ISO-8859-1. And I haven't found
> a way of converting these strings to Unicode in Ruby. é and è etc. form
> part of ISO-8859-1.

I have to deal with similar problems when processing the infamous german
umlaute äöü. My solution has been to convert a string from latin1 or
latin15 to utf8 via this
	utf8_string=latin1_string.unpack("C*").pack("U*")

and the other way round with
	latin1_string=utf8_string.unpack("U*").pack("C*")

Did work so far and does not include changes in the environment.
HTH,
Lars
This topic is locked and can not be replied to.