How to add Asia token analyzer to ferret simply?

charlie · July 7, 2006, 4:45am

Hi,David
Can you give me an example of how to add analyzer to ferret to Asian
languages?
My web application will have to support multi language search,which
means,for example,both Chinese and English will be searched through the
form.
Currently,I have decided to use the simple token principles,which means
that every Chinese character will be a token,although this is not so
well in some cases,my database column to be full-text searched include
at most tens of UTF-8 characters,therefore i think it can works well.
Thanks a lot!

David B. wrote:

On 7/5/06, Charlie [email protected] wrote:

Is there any schema of full-text search that support utf-8 especially
for Asia language such as Chinese,Japanese,etc.
Ferret/acts_as_ferret can not work when these language key words are
searched,and also, it is difficult to implement pagination-which need
both the count of search results and offset.
Very grateful!

Hi Charlie,

Ferret will work fine on Asian Languages. You just need to write your
own Analyzer which matches tokens correctly for the language you are
interested in. Have a look at the RegExpAnalyzer in Ferret. You can
look at test/unit/analysis/ctc_analyzer.rb to see how it works.

Cheers,
Dave

charlie · July 7, 2006, 4:59am

And also it is needed to make the new Chinese analyzer work together
with the original standard analyzer

charlie · July 8, 2006, 9:05am

David B. wrote:

On 7/7/06, Charlie [email protected] wrote:

And also it is needed to make the new Chinese analyzer work together
with the original standard analyzer

I answered this on the rails list but just in case;

Create a PerFieldAnalyzer (AKA PerFieldAnalyzerWrapper) which

defaults to Standard

analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)

Add a special character analyzer for the chinese field or

whatever field it is that has chinese characters. This splits the

data into single characters.

analyzer[“chinese”] = RegExpAnalyzer.new(/./, false)

Thank you Dave,I looked up the API and found that
PerFieldAnalyzerWrapper is useful for field analyze,especially for the
coresponding SQL: select * from students where title like ‘%Charlie%’
and location_id = 1, where location_id =1 query can be got through
PerFieldAnalyzerWrapper.

I have just now downloaded and read the book of Lucene In Action,and in
Chapter 4,it tolds that the standardanalyzer will also split the CJK
language into tokens although there is no spaces among them,for
example:‘ä¸æ??å?ç¬¦’ will be splitted into tokens of ‘ä¸’ ‘æ??’ ‘å?’ ‘ç¬¦’,that is
just what I want. But I still can not search any results from ferret. I
use the MySQL as the database with all the encoding of UTF-8,and
also,all of my rails sources is saved in the form of UTF-8,then when I
input the search box of the above characters of
‘ä¸æ??å?ç¬¦’, I will got zero searched results,can you please help with that
situation? Very Grateful!

Best Regards
Charlie

charlie · July 8, 2006, 6:54pm

On 7/8/06, Charlie [email protected] wrote:

I have just now downloaded and read the book of Lucene In Action,and in
Chapter 4,it tolds that the standardanalyzer will also split the CJK
language into tokens although there is no spaces among them,for
example:‘¤¤¤å¦r²Å’ will be splitted into tokens of ‘¤¤’ ‘¤å’ ‘¦r’ ‘²Å’,that is
just what I want. But I still can not search any results from ferret. I
use the MySQL as the database with all the encoding of UTF-8,and
also,all of my rails sources is saved in the form of UTF-8,then when I
input the search box of the above characters of
‘¤¤¤å¦r²Å’, I will got zero searched results,can you please help with that
situation? Very Grateful!

Hi Charlie,

The StandardAnalyzer in Ferret works a little differently to the
StandardAnalyzer in Lucene. That’s why you need to use the
RegExpAnalyzer I gave you.

analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)
analyzer[“chinese”] = RegExpAnalyzer.new(/./, false)

You also need to make sure that this is the analyzer that is being
used by the query parser. If you are using the Index::Index class it
will handle it for you. Try this in irb;

$ irb -KU
irb(main):001:0> require ‘rubygems’
=> true
irb(main):002:0> require ‘ferret’
=> true
irb(main):003:0> include Ferret::Index
=> Object
irb(main):004:0> include Ferret::Analysis
=> Object
irb(main):005:0> analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)
=> #Ferret::Analysis::PerFieldAnalyzer:0xb7b2332c
irb(main):006:0> analyzer[“chinese”] = RegExpAnalyzer.new(/./, false)
=> #Ferret::Analysis::RegExpAnalyzer:0xb7c8bdd4
irb(main):007:0> index = Index.new(:analyzer => analyzer)
=> #Ferret::Index::Index:0xb7bbda30
irb(main):008:0> index << {:english => “the quick brown fox jumped
over the lazy dog”, :chinese => ‘¤¤¤å¦r²Å’}
=> #Ferret::Index::Index:0xb7bbda30
irb(main):009:0> index << {:chinese => “the quick brown fox jumped
over the lazy dog”, :english => ‘¤¤¤å¦r²Å’}
=> #Ferret::Index::Index:0xb7bbda30
irb(main):010:0> index.search_each(“chinese:¤¤”) {|doc, score| puts
“found in #{doc}”}
found in 0
=> 1
irb(main):011:0> index.search_each(“english:¤¤”) {|doc, score| puts
“found in #{doc}”}
=> 0

charlie · July 7, 2006, 2:18pm

On 7/7/06, Charlie [email protected] wrote:

And also it is needed to make the new Chinese analyzer work together
with the original standard analyzer

I answered this on the rails list but just in case;

Create a PerFieldAnalyzer (AKA PerFieldAnalyzerWrapper) which

defaults to Standard

analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)

Add a special character analyzer for the chinese field or

whatever field it is that has chinese characters. This splits the

data into single characters.

analyzer[“chinese”] = RegExpAnalyzer.new(/./, false)

There you have it. Pretty simple.

Cheers,
Dave