Is there any schema of full-text search that support utf-8?

charlie · July 5, 2006, 4:22pm

Is there any schema of full-text search that support utf-8 especially
for Asia language such as Chinese,Japanese,etc.
Ferret/acts_as_ferret can not work when these language key words are
searched,and also, it is difficult to implement pagination-which need
both the count of search results and offset.
Very grateful!

charlie · July 6, 2006, 5:34am

On 7/5/06, Charlie [email protected] wrote:

Is there any schema of full-text search that support utf-8 especially
for Asia language such as Chinese,Japanese,etc.
Ferret/acts_as_ferret can not work when these language key words are
searched,and also, it is difficult to implement pagination-which need
both the count of search results and offset.
Very grateful!

Hi Charlie,

Ferret will work fine on Asian Languages. You just need to write your
own Analyzer which matches tokens correctly for the language you are
interested in. Have a look at the RegExpAnalyzer in Ferret. You can
look at test/unit/analysis/ctc_analyzer.rb to see how it works.

Cheers,
Dave

charlie · July 7, 2006, 4:58am

Hi,David
Can you give me an example of how to add analyzer to ferret to Asian
languages?
My web application will have to support multi language search,which
means,for example,both Chinese and English will be searched through the
form.
Currently,I have decided to use the simple token principles,which means
that every Chinese character will be a token,although this is not so
well in some cases,my database column to be full-text searched include
at most tens of UTF-8 characters,therefore i think it can works well.
Thanks a lot!

David B. wrote:

On 7/5/06, Charlie [email protected] wrote:

Is there any schema of full-text search that support utf-8 especially
for Asia language such as Chinese,Japanese,etc.
Ferret/acts_as_ferret can not work when these language key words are
searched,and also, it is difficult to implement pagination-which need
both the count of search results and offset.
Very grateful!

charlie · July 7, 2006, 2:14pm

On 7/7/06, Charlie [email protected] wrote:

Thanks a lot!

Create a PerFieldAnalyzer (AKA PerFieldAnalyzerWrapper) which

defaults to Standard
analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)

# Add a special character analyzer for the chinese field or

whatever field it is that has
# chinese characters. This splits the data into single characters.
analyzer[“chinese”] = RegExpAnalyzer.new(/./, false)

There you have it. Pretty simple.

Cheers,
Dave