How to handle non-ascii characters

nicksnels · January 26, 2006, 10:11pm

Hi,

the last couple of days I’m trying to index some txt files. Once indexed
I have the habit of checking the contents of the Ferret index with Luke.
But everytime I tried to open the index I got a ‘read past EOF’ error. I
managed to get it down to the way Ferret handles non-ascii characters. I
have one txt file with the following content ‘a o b c’ and one with 'Ã© Ã¨
Ã§ Ã ’ . If I index the first one I can read the index perfectly, however
when I index the second one I get the EOF error. The error is with the
standard and whitespace analyzers. The stop analyzer just ignores these
characters. How can I solve this, so that Ferret handles these ‘special’
characters correctly. Thanks.

Kind regards,

Nick

nicksnels · January 27, 2006, 4:10am

Hi Nick,

Sorry but this is due to an incompatibilities with the index. It’s
complicated but basically, Ferret counts string lengths in bytes while
Lucene sometimes uses number of characters. I do plan to fix this in
the future but it could be a month or two. Hope you can wait that
long.

Cheers,
Dave

nicksnels · January 27, 2006, 11:29am

Hi David,

good to hear that it will be fixed in the near future. For me personally
it doesn’t matter that it takes a month or two. I have tons of other
stuff I have to add, before it is finished. Will this be around the same
period that cFerret will be ready for prime time?

Kind regards,

Nick

nicksnels · January 27, 2006, 12:22pm

On 1/27/06, Nick S. [email protected] wrote:

Hi David,

good to hear that it will be fixed in the near future. For me personally
it doesn’t matter that it takes a month or two. I have tons of other
stuff I have to add, before it is finished. Will this be around the same
period that cFerret will be ready for prime time?

Hopefully cFerret will be finished before then. I just have to finish
implementing span queries and threading and then I’ll be ready to
start adding the ruby bindings. The fix to make the indexes of Ferret
and Lucene compatible will hopefully involve a patch to Lucene rather
than a fix to Ferret but I may have difficulty getting it accepted. I
realize index compatibility with Lucene is a show stopper for many
people so it’s definitely high priority and I’ll get it done one way
or another.

Dave