Searching with chinese chars

Julio_Cesar_O · July 18, 2006, 8:53am

Hi all,

maybe not a Ferret question, but I assume here might have came across
that already.

I wrote a simple CGI app that adds docs into a Ferret index. The idea
is testing asian languages input and searching.

The script that does the input seems to be OK. As David mentioned in a
question I made a little while ago, Ferret’s index is agnostic, in the
sense that you can store anything in it. I then wrote another one to
search the index created. This is what it looks like:

####################################

#!/usr/bin/ruby

$KCODE = ‘u’
require ‘cgi’
require ‘ferret’
include Ferret

index = Index::Index.new(:path => ‘/var/index’, :default_field => “*”)

cgi = CGI.new(“html4”)

result = “”
if cgi[‘query’] and not cgi[‘query’].empty?
index.search_each(cgi[‘query’]) do |doc, score|
result << "

#{index[doc]['tileid']}

#{index[doc]['title']}

#{index[doc]['description']}

" end end ####################################

It’s A-OK for searching english. But when trying to input chinese
characters in the “query” field, I’m getting the following error in my
lighttpd log file:

####################################
/var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15:in
`search_each’: : Error occured at <analysis.c>:701 (Exception)
Error: exception 2 not handled: Error decoding input string. Check
that you have the locale set correctly
from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15
####################################

Is the error message above suggesting I should specify a chinese
locale and not UTF-8? I thought UTF-8 would actually handle chinese
and anything else one could throw at it as long as it’s a human
language.

Any help is appreciated.

Julio_Cesar_O · July 18, 2006, 9:24am

On 7/18/06, Julio Cesar O. [email protected] wrote:

sense that you can store anything in it. I then wrote another one to
"
`search_each’: : Error occured at <analysis.c>:701 (Exception)
Any help is appreciated.
The error is being raised when the analyzer tries to tokenize the
query string My guess would be that the query string either starts in
the wrong encoding (when you type it in) or it gets converted
somewhere between being typed in the browser and going into your
script. UTF-8 can certainly handle Chinese characters if they are
UTF-8 encoded but there are other encodings for Chinese as well. If I
were trying to debug this, the first thing I’d do is log the query
string in a file and check its encoding. Something like;

File.open("query.log", "w") {|f| f.write(cgi['query'])}

If you want, send me the file and I’ll try and see what encoding it is.

Cheers,
Dave

Julio_Cesar_O · July 18, 2006, 9:52am

The error is being raised when the analyzer tries to tokenize the
query string My guess would be that the query string either starts in
the wrong encoding (when you type it in)

Didn’t get that bit.

or it gets converted
somewhere between being typed in the browser and going into your
script.

Umm… maybe yes.

UTF-8 can certainly handle Chinese characters if they are
UTF-8 encoded but there are other encodings for Chinese as well. If I
were trying to debug this, the first thing I’d do is log the query
string in a file and check its encoding. Something like;
File.open("query.log", "w") {|f| f.write(cgi['query'])}
If you want, send me the file and I’ll try and see what encoding it is.

I wrote another script that does just that (writes cgi[‘query’] to
/tmp/query.log). After inputting this in a text field name “query” and
submitting this chinese string:

ÐÂÎÅ

This is what appears in the /tmp/query.log

新闻

Note that the only thing I did hoping to have evething magically
working in UTF-8 is putting this in my script:

$KCODE = ‘u’

Anything I’m missing?

Julio_Cesar_O · July 18, 2006, 10:04am

On 7/18/06, Julio Cesar O. [email protected] wrote:

Note that the only thing I did hoping to have evething magically
working in UTF-8 is putting this in my script:

$KCODE = ‘u’

Anything I’m missing?

dbalmain@ubuntu:~/ $ irb -Ku
irb(main):001:0> require ‘cgi’
=> true
irb(main):002:0> CGI.unescapeHTML(“新闻”)
=> “ÐÂÎÅ”

That should fix your problem.

Dave

Julio_Cesar_O · July 19, 2006, 2:51am

Yep, it did. Thanks tons!

But I’m not getting any results now. I take this is because of the
default analyzer being used, right?

How can I use a whitespace analyzer in my query? (or something that
could work effectively with asian languages).

For my needs, I suppose the whitespace one could do…

Julio_Cesar_O · July 19, 2006, 3:25am

On 7/19/06, Julio Cesar O. [email protected] wrote:

Yep, it did. Thanks tons!

But I’m not getting any results now. I take this is because of the
default analyzer being used, right?

How can I use a whitespace analyzer in my query? (or something that
could work effectively with asian languages).

For my needs, I suppose the whitespace one could do…

index = Index::Index.new(:path => ‘/var/index’, :default_field => “*”,
:analyzer =>
Ferret::Analysis::WhiteSpaceAnalzyer.new)

Although you should probably use the same analyzer I gave you for
indexing;

http://www.ruby-forum.com/topic/72086#101764

Cheers,
Dave

Julio_Cesar_O · July 19, 2006, 3:48am

Thanks, and sorry. I checked the documentation for Index::Index and
found it right after I asked the question. My bad.

I’m getting segfauls when trying to initialize an index using a
different analyzer other than the default one (but it works
otherwise). But as I can see in this thread

http://www.ruby-forum.com/topic/71620

It ain’t stable yet for 64 bit. So I’ll wait.

Thanks again.

Julio_Cesar_O · July 19, 2006, 5:11am

Just sharing my experience and asking another question.

I tried the analyzer suggested here:
How to add Asia token analyzer to ferret simply? - Ferret - Ruby-Forum. It works fine if you
specify the search field you want to use (anyway, it seems that’s how
it’s suppose to work).

CODE

analyzer =
Ferret::Analysis::PerFieldAnalyzer.new(Ferret::Analysis::StandardAnalyzer.new)
analyzer[“chinese”] = Ferret::Analysis::RegExpAnalyzer.new(/./, false)

index = Index::Index.new(:path => ‘/var/index’, :analyzer => analyzer,
:default_field => “*”)

…

index.search_each(“chinese: #{val}”) do |doc, score| #val is a chinese
char
puts “#{doc} - #{score}”
end

END CODE

This works OK. However, if you try searching like this:

CODE

index.search_each(val) do |doc, score| #val is a chinese char
puts “#{doc} - #{score}”
end

END CODE

I get in my lighttpd error log:

/var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19:in
`search_each’: : Error occured at <analysis.c>:701 (StandardError)
Error: exception 2 not handled: Error decoding input string. Check
that you have the locale set correctly
from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19

Which MAKES SENSE, since the docs I created before are created like
this:

doc = { “author” => “englishchars”, “title” => “more regular chars”,
“chinese” => “ÐÂÎÅ”}
index << doc

and I think search_each is going through all the fields (since I
explicitly said it should when I issued :default_field => “*” up
there), finding english chars, and trying to match them against the
chinese ones I supplied as a search query.

So alright, I can use the suggested analyzer. But my question is: is
there a way to use an analyzer that would work with both character
types (english, and asian) simply by not returning matches them as
opposed to giving me an error?

Thanks a ton for any help.

Julio_Cesar_O · July 19, 2006, 5:21am

On 7/19/06, Julio Cesar O. [email protected] wrote:

This works OK. However, if you try searching like this:
`search_each’: : Error occured at <analysis.c>:701 (StandardError)
and I think search_each is going through all the fields (since I
explicitly said it should when I issued :default_field => “*” up
there), finding english chars, and trying to match them against the
chinese ones I supplied as a search query.

Actually, it’s not because of there is a comparison between Chinese
and English characters. That shouldn’t cause an error. The error is
being thrown because val can’t be decoded using the StandardAnalyzer.
Again, you need to check that val is correctly encoded and you have
your locale set correctly.The only times tokenizing happens are when
you add documents to the index and when you run a query through the
query parser. Apart from that, all operations on strings are done at
the byte level. I hope that makes sense.

So alright, I can use the suggested analyzer. But my question is: is
there a way to use an analyzer that would work with both character
types (english, and asian) simply by not returning matches them as
opposed to giving me an error?

Thanks a ton for any help.

The answer to this question is that it already should work correctly.
Just make sure the locale is set correctly when the search method is
called and that whatever you pass as a query to the search method is
correctly encoded according to the locale.

Cheers,
Dave

Julio_Cesar_O · July 19, 2006, 5:36am

Reply to myself: yes:

ENV[‘LANG’] = ‘en_US.utf8’

Did the job.

Thanks!

Julio_Cesar_O · July 19, 2006, 5:27am

Does it take anything other than simply:

$KCODE = ‘u’

right in the beginning of the script?

I have that in place already.

(it’s CGI we’re talking about)