Forum: Ruby regexp with accent insensitive ??

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
1f0d0fbccd8f8fc1ab999e8b79faf495?d=identicon&s=25 Davi Barbosa (dmjb)
on 2008-10-12 22:03
Hello,
Is there anyway to make the regexp accent-insensitive? (/a/ match with ã
and Ã)

If not, can any one give a solution to my problem:
I'm making a search web page with mod_ruby, so I made an
accent/case-insensitive sql query and this works fine (with
latin1_swedish_ci). Now I want to highlight what the user searched for.
To achieve this I'm doing something *like*:
string.gsub(/search/i,'<span class="highlight">\0</span>')
This works fine if search and the relevant part of string don't have
accents, but if there are any accents it doesn't match, so the entry is
not highlighted.

I know that with
Iconv.conv("ascii//translit","UTF-8",str)
I can remove all the accents from str, so I can remove the accents from
'search' without any problem, but if I remove some accents from string
to do the highlighting, I need to put it back later to display it to the
user.
Does anyone have any idea?

Thank you
851acbab08553d1f7aa3eecad17f6aa9?d=identicon&s=25 Ken Bloom (Guest)
on 2008-10-13 16:10
(Received via mailing list)
On Sun, 12 Oct 2008 15:00:29 -0500, Davi Barbosa wrote:

> if there are any accents it doesn't match, so the entry is not
> highlighted.
>
> I know that with
> Iconv.conv("ascii//translit","UTF-8",str) I can remove all the accents
> from str, so I can remove the accents from 'search' without any problem,
> but if I remove some accents from string to do the highlighting, I need
> to put it back later to display it to the user.
> Does anyone have any idea?
>
> Thank you

in which case, I would try replacing the accented letters with periods
(which match any single character) when searching. This will give some
false positives. I would use gsub with a block to do a more specific
conditional test.

Suppose the search was for ole (without the accent, and the real hits
will have an accent on the e) the search is in a language that allows
accents on only the letter e.

query='ole'
pattern=Regexp.compile('ole'.gsub(/[e]/,'.')) #=> /ol./

translit=Iconv.conv("ascii//translit","UTF-8",'ole') #=> "ole"

gsub(pattern) do |match|
  #use the regular expression to get close enough, and to get
  #the actual text we're concerned about
  if Iconv.conv("ascii//translit","UTF-8",match) == translit
    #the if test does the actual exact comparison
    "<span class=\"highlight\">#{match}</span>"
  else
    match
  end
end

Of course, there may be some locale tricks that I'm missing that would
make this much easier.
1f0d0fbccd8f8fc1ab999e8b79faf495?d=identicon&s=25 Davi Barbosa (dmjb)
on 2008-10-13 22:26
Thank you for your answer, but I'm working with a lot of languages, so I
don't know where someone can put an accent.

For the moment, I just discovered that I can't remove the accents with
Iconv like I said before. Here, it works only under irb.. I described
this problem here: http://www.ruby-forum.com/topic/70827#738081
Another problem with utf-8 under ruby is that ruby can't index correctly
the string. For example: 'áb'[2..2] gives the second half of 'á'. I
discovered how to workaround using the unicode version of regexp:
$KCODE = 'u'
'áb'.split(//m) == ["á", "b"]

Without these problems, I think that I know how to make it without false
matchs with an ugly loop.
If str and regexp are the versions without accents, str =~ regexp gives
the position of the match and str[regexp].length the length. With these
two numbers, It's possible to make the highlight in the original string.
It's something like:
ascii_string = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',string)
ascii_search = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',search)
regexp = Regexp.new(Regexp.escape(ascii_search),true)
position = (ascii_string =~ regexp)
size = ascii_string[regexp].length
highlighted = ascii_string[0..(position-1)]+'<span
class="highlight">'+ascii_string[position..(position+size-1)]+'</span>'+ascii_string[(position+size)..-1]

Of course, it need some modifications to put this in a loop (and I need
to use the vector version of the string to index correctly the string).
851acbab08553d1f7aa3eecad17f6aa9?d=identicon&s=25 Ken Bloom (Guest)
on 2008-10-16 03:10
(Received via mailing list)
On Mon, 13 Oct 2008 15:23:23 -0500, Davi Barbosa wrote:

> 'áb'.split(//m) == ["á", "b"]
> regexp)
> size = ascii_string[regexp].length
> highlighted = ascii_string[0..(position-1)]+'<span
> class="highlight">'+ascii_string[position..(position+size-1)]+'</
span>'+ascii_string[(position+size)..-1]
>
> Of course, it need some modifications to put this in a loop (and I need
> to use the vector version of the string to index correctly the string).

You can use a StringScanner (require 'strscan') to properly do this in a
loop, because StringScanner#pos will tell you the starting position of
the match, where String#scan will not.

Consider whether Ruby 1.9.0 is stable enough for your purposes because
it
handles Unicode natively and should save you from needing to have a
vector version of the string.
This topic is locked and can not be replied to.