"ö" causes find_by_contents not to return


#1

I’ve installed ferret 0.10.9 together with the latest acts_as_ferret
using Windows XP and indexed a location database (geonames.org) with
Location.rebuild_index. The data is in utf-8.

Now calling Location.find_by_contents “ö” does not return a result,
causes a lot of CPU load, and finally exits with an error “index.rb:702:
in ‘parse’: failed to allocate memory (NoMemoryError)”. Seems a problem
in ‘process_query’.

Similar results for sometimes for other German Umlauts…


#2

On 3/19/07, Star B. removed_email_address@domain.invalid wrote:

I’ve installed ferret 0.10.9 together with the latest acts_as_ferret
using Windows XP and indexed a location database (geonames.org) with
Location.rebuild_index. The data is in utf-8.

Now calling Location.find_by_contents “ö” does not return a result,
causes a lot of CPU load, and finally exits with an error “index.rb:702:
in ‘parse’: failed to allocate memory (NoMemoryError)”. Seems a problem
in ‘process_query’.

Similar results for sometimes for other German Umlauts…

Unfortunately Ferret doesn’t come with UTF-8 support in Windows as the
win32 runtime environment doesn’t seem to support UTF-8. You will
therefore need to write your own analyzer on Windows if you want to
support UTF-8 searches.

Hopefully the NoMemoryError will be fixed in the next win32 gem I
release.


#3

David B. wrote:

Unfortunately Ferret doesn’t come with UTF-8 support in Windows as the
win32 runtime environment doesn’t seem to support UTF-8. You will
therefore need to write your own analyzer on Windows if you want to
support UTF-8 searches.

Hello Star B.,

if you’re planning to write your own UTF-8 Analyzer consider the
unpack/pack duo:

utf-8_encoded_string_from_db.unpack(“U*”).pack(“C*”)
@index << {:content => utf-8_encoded_string_from_db}
@index.search_each(‘content:Behörde’) {|id,score| do_sth}

I didn’t try this in afa, but with ruby it worked in my case.


#4

I tried this with an UTF-8 encoded string (japanese):

“\u304A\u308C\u3068\u9B5A”.unpack(“U*”).pack(“C*”)

Which gives me this in return:

“u304Au308Cu3068u9B5A”

And that’s not what I want stored in my index, right?

Now I’m pretty sure I’m doing something dumb :slight_smile: hopefully someone can
clarify.

Thanks.