Indexing multiple languages with acts_as_ferret

I have an applicaiton where I’m indexing content in a number of
languages, all encoded in UTF8. I would think that having my locale
set to en_US.utf8 would be sufficient to make this work in
ferret/acts_as_ferret, but I keep running into problems, even with
english text.

What have others done to cope with these encoding difficulties?

-ryan

I have an applicaiton where I’m indexing content in a number of
languages, all encoded in UTF8. I would think that having my locale
set to en_US.utf8 would be sufficient to make this work in
ferret/acts_as_ferret, but I keep running into problems, even with
english text.

What have others done to cope with these encoding difficulties?

hi…

i’m using ferret (not acts_as_ferret, but this shouldn’t matter) to
index
content in german, english, polish, japanese, chinese, french … all in
UTF8 and i don’t had any problem with it yet :slight_smile: (using ferret 0.9.4 and
0.9.5)

Ben

On 8/22/06, Ryan K. [email protected] wrote:

I have an applicaiton where I’m indexing content in a number of
languages, all encoded in UTF8. I would think that having my locale
set to en_US.utf8 would be sufficient to make this work in
ferret/acts_as_ferret, but I keep running into problems, even with
english text.

What have others done to cope with these encoding difficulties?

-ryan

Hi Ryan,

Usually these problems stem from adding data that you think is UTF-8
but is actually ISO-8859-1. The best solution is to make sure all data
added to Ferret really is UTF-8. This may require some data
conversion. See the Iconv class in the standard library.

Ferret 0.10.0 is a little more lenient on encoding errors, ie it
handles them silently. It is up to you to make sure it gets the
correct encoding. If you pass in ISO-8859-1 when the locale is set to
handle UTF-8, all non-ascii characters will be treated as letters
which is often (but not always) what you want.

Cheers,
Dave

On 9/18/06, Frank [email protected] wrote:

Hi,Ben
Have u modified any code of ferret? I have also used ferret to index
CJK(Chinese,Korea,Japanese) languages,all of which are encoded in
utf-8,but i can not get them searched correctly

Frank

Hi Frank,

Someone else had this problem earlier. I think the Chinese charecters
were being escaped by the browser. Are you running your searches
through a browser? If so, you may need to call CGI.unescape on the
query string. At any rate, the first thing I would check is the actual
query string that you are passing to Ferret. Make sure it looks like
you would expect it to and it really is UTF-8, not some other chinese
character encoding.

cheers,
Dave

hi…

i’m using ferret (not acts_as_ferret, but this shouldn’t matter) to
index
content in german, english, polish, japanese, chinese, french … all in
UTF8 and i don’t had any problem with it yet :slight_smile: (using ferret 0.9.4 and
0.9.5)

Ben

Hi,Ben
Have u modified any code of ferret? I have also used ferret to index
CJK(Chinese,Korea,Japanese) languages,all of which are encoded in
utf-8,but i can not get them searched correctly

Frank