[0.10.0] Index#add_document bug with strange value?

Perhaps, I found where is my problem (during a big import).
Why this silly (really silly :)) example crash ?

http://pastie.caboo.se/10357

/usr/lib/ruby/site_ruby/1.8/ferret/index.rb:211:in `add_document’: IO
Error occured at <except.c>:79 in xraise (IOError)
Error occured in fs_store.c:225 - fso_flush_i
flushing src of length -2

    from /usr/lib/ruby/site_ruby/1.8/ferret/index.rb:211:in `<<'
    from /usr/lib/ruby/1.8/monitor.rb:229:in `synchronize'
    from /usr/lib/ruby/site_ruby/1.8/ferret/index.rb:186:in `<<'
    from test.rb:13
    from test.rb:8

On 8/26/06, Florent S. [email protected] wrote:

   from /usr/lib/ruby/site_ruby/1.8/ferret/index.rb:211:in `<<'
   from /usr/lib/ruby/1.8/monitor.rb:229:in `synchronize'
   from /usr/lib/ruby/site_ruby/1.8/ferret/index.rb:186:in `<<'
   from test.rb:13
   from test.rb:8

Hi Florent,

This is something that I still need to work on. The Locale sensitive
analyzers aren’t as robust as they could be. Try using the
AsciiStandardAnalyzer instead. Or better yet, don’t index binary data.
You can store binary data but indexing it doesn’t usually make a lot
of sense. At least not without a custom analyzer. Having said that, I
will try and fix this.

Cheers,
Dave

This is something that I still need to work on. The Locale sensitive
analyzers aren’t as robust as they could be. Try using the
AsciiStandardAnalyzer instead. Or better yet, don’t index binary data.
You can store binary data but indexing it doesn’t usually make a lot
of sense. At least not without a custom analyzer. Having said that, I
will try and fix this.

I’m totaly agree with you, it’s just an example. The real data is a file
with encodings bug and the result is the same.

Thanks for your answer.

On 8/29/06, David B. [email protected] wrote:

AsciiStandardAnalyzer instead. Or better yet, don’t index binary data.
You can store binary data but indexing it doesn’t usually make a lot
of sense. At least not without a custom analyzer. Having said that, I
will try and fix this.

Cheers,
Dave

Just an update on this issue. I’ve now made the StandardAnalyzer more
robust so it won’t crash as easily (hopefully not at all) with bad
data. In the process of fixing this I also added a fix so that the
StandardTokenizer will now tokenize negative numbers. ie it will parse
“-23” as “-23” instead of just “23”.

Cheers,
Dave

Cool ! and as usual, great job Dave !