Forum: Ferret "ö" causes find_by_contents not to return

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Star B. (Guest)
on 2007-03-18 22:11
I've installed ferret 0.10.9 together with the latest acts_as_ferret
using Windows XP and indexed a location database (geonames.org) with
Location.rebuild_index. The data is in utf-8.

Now calling Location.find_by_contents "ö" does not return a result,
causes a lot of CPU load, and finally exits with an error "index.rb:702:
in 'parse': failed to allocate memory (NoMemoryError)". Seems a problem
in 'process_query'.

Similar results for sometimes for other German Umlauts...
David B. (Guest)
on 2007-03-20 04:44
(Received via mailing list)
On 3/19/07, Star B. <removed_email_address@domain.invalid> wrote:
> I've installed ferret 0.10.9 together with the latest acts_as_ferret
> using Windows XP and indexed a location database (geonames.org) with
> Location.rebuild_index. The data is in utf-8.
>
> Now calling Location.find_by_contents "ö" does not return a result,
> causes a lot of CPU load, and finally exits with an error "index.rb:702:
> in 'parse': failed to allocate memory (NoMemoryError)". Seems a problem
> in 'process_query'.
>
> Similar results for sometimes for other German Umlauts...

Unfortunately Ferret doesn't come with UTF-8 support in Windows as the
win32 runtime environment doesn't seem to support UTF-8. You will
therefore need to write your own analyzer on Windows if you want to
support UTF-8 searches.

Hopefully the NoMemoryError will be fixed in the next win32 gem I
release.
Thomas S. (Guest)
on 2007-03-21 15:18
David B. wrote:
>
> Unfortunately Ferret doesn't come with UTF-8 support in Windows as the
> win32 runtime environment doesn't seem to support UTF-8. You will
> therefore need to write your own analyzer on Windows if you want to
> support UTF-8 searches.
>

Hello Star B.,

if you're planning to write your own UTF-8 Analyzer consider the
unpack/pack duo:

utf-8_encoded_string_from_db.unpack("U*").pack("C*")
@index << {:content => utf-8_encoded_string_from_db}
@index.search_each('content:Behörde') {|id,score| do_sth}

I didn't try this in afa, but with ruby it worked in my case.
Julio Cesar O. (Guest)
on 2007-03-22 02:10
(Received via mailing list)
I tried this with an UTF-8 encoded string (japanese):

"\u304A\u308C\u3068\u9B5A".unpack("U*").pack("C*")

Which gives me this in return:

"u304Au308Cu3068u9B5A"

And that's not what I want stored in my index, right?

Now I'm pretty sure I'm doing something dumb :-)  hopefully someone can
clarify.

Thanks.
This topic is locked and can not be replied to.