Regexp with accent insensitive?

dmjb · October 12, 2008, 10:03pm

Hello,
Is there anyway to make the regexp accent-insensitive? (/a/ match with Ã£
and Ãƒ)

If not, can any one give a solution to my problem:
I’m making a search web page with mod_ruby, so I made an
accent/case-insensitive sql query and this works fine (with
latin1_swedish_ci). Now I want to highlight what the user searched for.
To achieve this I’m doing something like:
string.gsub(/search/i,’\0’)
This works fine if search and the relevant part of string don’t have
accents, but if there are any accents it doesn’t match, so the entry is
not highlighted.

I know that with
Iconv.conv(“ascii//translit”,“UTF-8”,str)
I can remove all the accents from str, so I can remove the accents from
‘search’ without any problem, but if I remove some accents from string
to do the highlighting, I need to put it back later to display it to the
user.
Does anyone have any idea?

Thank you

dmjb · October 13, 2008, 4:10pm

On Sun, 12 Oct 2008 15:00:29 -0500, Davi Barbosa wrote:

if there are any accents it doesn’t match, so the entry is not
highlighted.

I know that with
Iconv.conv(“ascii//translit”,“UTF-8”,str) I can remove all the accents
from str, so I can remove the accents from ‘search’ without any problem,
but if I remove some accents from string to do the highlighting, I need
to put it back later to display it to the user.
Does anyone have any idea?

Thank you

in which case, I would try replacing the accented letters with periods
(which match any single character) when searching. This will give some
false positives. I would use gsub with a block to do a more specific
conditional test.

Suppose the search was for ole (without the accent, and the real hits
will have an accent on the e) the search is in a language that allows
accents on only the letter e.

query=‘ole’
pattern=Regexp.compile(‘ole’.gsub(/[e]/,’.’)) #=> /ol./

translit=Iconv.conv(“ascii//translit”,“UTF-8”,‘ole’) #=> “ole”

gsub(pattern) do |match|
#use the regular expression to get close enough, and to get
#the actual text we’re concerned about
if Iconv.conv(“ascii//translit”,“UTF-8”,match) == translit
#the if test does the actual exact comparison
“<span class=“highlight”>#{match}”
else
match
end
end

Of course, there may be some locale tricks that I’m missing that would
make this much easier.

dmjb · October 13, 2008, 10:26pm

Thank you for your answer, but I’m working with a lot of languages, so I
don’t know where someone can put an accent.

For the moment, I just discovered that I can’t remove the accents with
Iconv like I said before. Here, it works only under irb… I described
this problem here: Iconv and incompatible encodings - Ruby - Ruby-Forum
Another problem with utf-8 under ruby is that ruby can’t index correctly
the string. For example: ‘Ã¡b’[2…2] gives the second half of ‘Ã¡’. I
discovered how to workaround using the unicode version of regexp:
$KCODE = ‘u’
‘Ã¡b’.split(//m) == [“Ã¡”, “b”]

Without these problems, I think that I know how to make it without false
matchs with an ugly loop.
If str and regexp are the versions without accents, str =~ regexp gives
the position of the match and str[regexp].length the length. With these
two numbers, It’s possible to make the highlight in the original string.
It’s something like:
ascii_string = Iconv.conv(‘US-ASCII//TRANSLIT’,‘UTF-8’,string)
ascii_search = Iconv.conv(‘US-ASCII//TRANSLIT’,‘UTF-8’,search)
regexp = Regexp.new(Regexp.escape(ascii_search),true)
position = (ascii_string =~ regexp)
size = ascii_string[regexp].length
highlighted = ascii_string[0…(position-1)]+‘’+ascii_string[position…(position+size-1)]+‘’+ascii_string[(position+size)…-1]

Of course, it need some modifications to put this in a loop (and I need
to use the vector version of the string to index correctly the string).

dmjb · October 16, 2008, 3:10am

On Mon, 13 Oct 2008 15:23:23 -0500, Davi Barbosa wrote:

‘Ã¡b’.split(//m) == [“Ã¡”, “b”]
regexp)
size = ascii_string[regexp].length
highlighted = ascii_string[0…(position-1)]+’’+ascii_string[position…(position+size-1)]+’</
span>’+ascii_string[(position+size)…-1]

Of course, it need some modifications to put this in a loop (and I need
to use the vector version of the string to index correctly the string).

You can use a StringScanner (require ‘strscan’) to properly do this in a
loop, because StringScanner#pos will tell you the starting position of
the match, where String#scan will not.

Consider whether Ruby 1.9.0 is stable enough for your purposes because
it
handles Unicode natively and should save you from needing to have a
vector version of the string.