Pattern matching French accented characters

luislavena · March 1, 2011, 11:36am

I am writing a French conjugation testing script, and a significant
problem I have run into is how to pattern match the accented characters
used in the French language. For example, é, à, è, î, ï, etc.

I’ve tried a number of approaches, but can’t seem to make it work.
After some research on the Internet, it may require a UTF-8 approach,
but I am not familiar with it.

As an example, assume I want to directly pattern match the French verb
haïr, and distinguish it from other verbs ending in -ir. How would I do
this?

Thanks in advance.

TPL

tlockney · March 1, 2011, 9:17pm

If you are not familiar with unicode, and you want to match utf-8
characters, then you better start reading some unicode tutorials. If
you are already familiar with unicode in general, then in ruby you can
set the $KCODE variable to ‘U’ for UTF-8, and then you can require the
jcode standard library, which will change the way regexes work–they
will match characters rather than single bytes.

It also depends on what version of ruby you are using.

See here:

http://blog.grayproductions.net/articles/the_kcode_variable_and_jcode_library

And here is an example:

str = ‘BàD’
puts str.size

puts ‘----’

str.scan(/(.)/) {puts $1}
puts

$KCODE = ‘U’
require ‘jcode’

puts str.jsize
puts “----”

str.scan(/(.)/) {puts $1}

–output:–
4

B
�
�
D

3

B
à
D

tlockney · March 1, 2011, 9:42pm

7stud – wrote in post #984785:

If you are not familiar with unicode, and you want to match utf-8
characters, then you better start reading some unicode tutorials. If
you are already familiar with unicode in general, then in ruby you can
set the $KCODE variable to ‘U’ for UTF-8, and then you can require the
jcode standard library, which will change the way regexes work–they
will match characters rather than single bytes.

Uhhmm…you don’t need to require ‘jcode’ to make regexes match
characters rather than bytes–just set $KCODE = ‘U’ (or ‘UTF-8’). The
jcode library just gives you some methods like jsize to get the
character length rather than the byte length, which is what String#size
returns.

As an alternative, you can set the /u flag for a regex to make it match
utf-8 characters rather than bytes.

tlockney · March 2, 2011, 7:30pm

Thanks, guys. I’ll take a shot in that direction.

tlockney · March 2, 2011, 4:51am

7stud – wrote in post #984789:

7stud – wrote in post #984785:

If you are not familiar with unicode, and you want to match utf-8
characters, then you better start reading some unicode tutorials.

Here is a short one, ‘unicode in three rules’:

Unicode assigns an integer to every letter in every alphabet in the
world. Currently, there are something like 100,000 letters.
Now the question becomes: what is the best way to store those unicode
integers (which represent characters) on a computer? The way in which
you decide to store a unicode integer on a computer is called an
“encoding”.

For instance, you could use 4 bytes to store each unicode integer. In
that system, a series of unicode integers is very easy for ruby to
parse: every 4 bytes represents one unicode integer(which in turn
represents one character). If ruby blindly reads 4 byte chunks, then
each 4 byte chunk will be one unicode integer.

But you don’t need 4 bytes to store, say, the unicode integer 60 because
three of those bytes would be empty. In fact, for all unicode integers
under 256 (which correspond to the letters in the Western alphabet),
three out of the four bytes would always be empty. Enter the UTF-8
encoding.

The UTF-8 encoding uses a variable number of bytes to store unicode
integers on your computer. For smaller unicode integers, UTF-8 stores
them in 1 byte, and for larger unicode integers, UTF-8 stores them in
2,3, or 4 bytes. But then how does ruby know how many bytes it should
read for each unicode integer?

Well, UTF-8 has a tricky way of signaling to ruby that the end of one
unicode integer has been reached. As long as you tell ruby that it is
reading unicode integers stored in the UTF-8 format, then ruby will be
able to sort out where one unicode integer ends and the next one
begins–even though some of the unicode integers will be stores in 1
byte and others will be stored in 2, 3, or 4 bytes.

That is my current mental model of how unicode works. I hope it helps.

Pattern matching French accented characters

–output:– 4

3

–output:–
4