UTF-8 support - still stuck

luislavena · March 5, 2011, 6:53pm

OK, I appreciate the feedback on my last post regarding pattern matching
accented French characters. But I am still not getting anywhere.

I’m running Ruby 1.9.2p0.

Here’s the type of pseudo-code I want to use.

====================================

variable = “exagérer”

if variable =~ /érer$/ then
print “the verb was #{variable}”
end

====================================

I’ve tried using jcode (which is apparently gone), -u extensions, having
the string # coding: UTF-8 at the beginning of the script, etc.

What I really want to do is read in a comprehensive list of verbs (with
various French accented characters), then have a simple I/O where I test
myself on a conjugation. So I need to be able to read, write, and
pattern match the accented characters.

What do I have to do to make this work?

TPL

tlockney · March 5, 2011, 8:32pm

Am 05.03.2011 18:53, schrieb Thomas L.:

various French accented characters), then have a simple I/O where I test
myself on a conjugation. So I need to be able to read, write, and
pattern match the accented characters.

What do I have to do to make this work?

TPL

Encode your string as UTF-8 and match it against an UTF-8 regexp.
Simplest way to do this is to do something like this:

==============================
#Encoding: UTF-8

variable = “exagérer”

puts “The verb was #{variable}” if variable =~ /érer/

Ensure that your editor saves the file in UTF-8 (some don’t do this by
default, notably Window’s notepad and SciTE).

If you have the verbs in an external file (which I suppose), and that
file is encoded in UTF-8, you can do (assuming that there is one verb
per line):

=================================
#Encoding: UTF-8

verbs = File.readlines(“verbs.txt”)

puts “The verb was #{verbs.first}” if verbs.first =~ /érer/

If the file is in another encoding, e.g. Windows-1252, do

==================================
#Encoding: UTF-8

verbs = File.open(“verbs.txt”, “r:Windows-1252”){|f| f.readlines}

puts “The verb was #{verbs.first}” if verbs.first =~ /érer/

The line saying “#Encoding: UTF-8” is a so-called magic comment that
tells Ruby that it should treat the content of this file as
UTF-8-encoded text. If you leave it out, Ruby assumes your file is
encoded in ASCII-8Bit, which will cause errors as soon as you start to
use characters not defined in ASCII. As an alternative, you may start
Ruby with the -U (capital U) switch, but I didn’t try this.

Read up on String#encode and String#force_encoding if you want to
convert between encodings or change the encoding tag of a string without
actual touching of the data in it.

Since Ruby 1.9, Ruby has quite good support for encodings other than
ASCII.

Just a thought: Is there anything such as Regexp#encode?

Vale,
Marvin

tlockney · March 5, 2011, 8:53pm

Am 05.03.2011 20:31, schrieb Quintus:

=================================
#Encoding: UTF-8

What I forgot to mention: Some editors put an invisible BOM (Byte Order
Mark) at the beginning of UTF-8 files. That one can cause problems
because the first line is not read properly in that case. So ensure your
editor doesn’t write the BOM.

Vale,
Marvin

tlockney · March 5, 2011, 11:33pm

You can try and troubleshoot the problems you are having by determining
the encoding of every string in your program.

To determine your source code’s encoding, i.e. what the literal strings
you type in your program get encoded as, do this:

puts ENCODING

As mentioned earlier, you set the source code’s encoding with the
comment:

encoding: UTF-8

which must be at the top of your program file.

To determine a particular string’s encoding, e.g. a string you read from
a file, do this:

puts the_str.encoding.name

tlockney · March 6, 2011, 12:11am

By the way, if you read the strings from a file, it might be easier to
change the encoding of the regex so that it matches the encoding of the
strings.

I’ve tried using jcode (which is apparently gone),
-u extensions, having the string # coding: UTF-8
at the beginning of the script, etc.

You need to post the exact code you ran along with any error messages,
or
the desired output and the actual output. We can’t troubleshoot the
code you are thinking about.

tlockney · March 6, 2011, 9:06am

Am 06.03.2011 08:47, schrieb Thomas L.:

verb = “appèler”
if ${verb} =~ /èler/ then print “The verb was #{verb}” end

========================================

Don’t leave a blank line between the shebang line and the magic comment.
The magic comment must either be the very first line, or the second one
if you have a shebang.

Vale,
Marvin

tlockney · March 7, 2011, 4:22am

Add this line to your ~/.profile

export RUBYOPT="-Ku -rrubygems"

Sadly, there’s no other way to set global default source encoding in
ruby 1.9

tlockney · March 6, 2011, 8:47am

This seemed to have work in NotePad ++, set to UTF-8 and with the BOM
off:

======================

#! /bin/ruby -Kn

#Encoding: UTF-8

verb = “appèler”
if( verb =~ /èler/) then print “The verb was #{verb}” end

======================

I think it was the -Kn flag, although I don’t understand what that
changes. I’ll look into it. Thanks for all your help!

tlockney · March 8, 2011, 11:59pm

Thomas L. wrote in post #985708:

This seemed to have work in NotePad ++, set to UTF-8 and with the BOM
off:

======================

#! /bin/ruby -Kn

#Encoding: UTF-8

verb = “appèler”
if( verb =~ /èler/) then print “The verb was #{verb}” end

======================

I think it was the -Kn flag, although I don’t understand what that
changes. I’ll look into it. Thanks for all your help!

In ruby, there is a global variable called $KCODE whose default value is
‘n’. If you set it to “UTF-8” (or just “U”), then it makes regular
expressions match characters rather than single bytes. If you set
$KCODE to “N” (the default), then regular expressions will match single
bytes (which you can still change by using the /u flag on your regular
expression).

You can set $KCODE from the command line, e.g. -Ku or -Kn. But
according to this document:

http://slideshow.rubyforge.org/ruby19.html#26

$KCODE no longer does anything in ruby 1.9.

tlockney · March 9, 2011, 12:52am

On Sun, Mar 6, 2011 at 2:47 AM, Thomas L. [email protected]
wrote:

note.rb:9: syntax error, unexpected tIDENTIFIER, expecting $end

verb = “appΦler”

Are you absolutely certain that your file is UTF-8 encoded?

$ cat i.rb
#Encoding: UTF-8
verb = “appèler”
puts “The verb was #{verb}” if verb =~ /èler/

$ ruby -v i.rb
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
The verb was appèler

$ enca -L none i.rb
Universal transformation format 8 bits; UTF-8

$ iconv -t LATIN1 -f UTF8 < i.rb > l.rb

$ enca -L none l.rb
Unrecognized encoding

$ ruby -v l.rb
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
l.rb:2: invalid multibyte char (UTF-8)
l.rb:2: syntax error, unexpected tIDENTIFIER, expecting $end
verb = “app?ler”
^