Ruby Forum Ferret > Handling Carriage Returns

Posted by S D (Guest)
on 28.04.2008 09:05
(Received via mailing list)
It's my understanding that the tokens in a token_stream consist of text
along with start/stop positions that represent the byte positions of the
text within the corresponding document field. The documentation I've 
been
reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte
positions represent positions within the entire field but based on my
testing it appears that the byte positions are with respect to the line 
that
contains the corresponding text within the field. I read my fields 
following
Brian McCallister:

      index.add_document :file => path,
                         :content => file.readlines


Hence, if I have a file that contains carriage returns, the token 
positions
will be reset with each new line. For example, the following file 
contents
(File A)
          this is a sentence
will result in a token for the text "sentence" with start position equal 
to
10 (assume "this" starts in position 0) while a file with a carriage 
return
          this is a
          sentence
will result in a token for the text "sentence" with start position equal 
to
0. I get the same results for my custom tokenizer as well as
StandardTokenizer. The above does not seem consistent with the 
documentation
but more importantly, it seems that global positions are more useful 
than
line-based positions (e.g., for highlighting).

Digging a little deeper it seems that the tokenizer's initialize method 
is
called each time the token_stream method of the containing analyzer is
called:

class CustomAnalyzer
  def token_stream(field, str)
    ts = StandardTokenizer.new(str)
  end
end

Am I missing something here? Are the start/stop byte positions intended 
to
be with respect to the line? Is there a way for token_stream to only be
called once for an entire string sequence (even if carriage returns are
contained)?

Thanks,
John
Posted by Jens Krämer (jkraemer)
on 28.04.2008 12:37
(Received via mailing list)
Hi,

File.readlines returns an array which I think is the root cause of the
problem.
Just using File.read instead should solve your problem.

Cheers,
Jens

On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote:
>                          :content => file.readlines
> will result in a token for the text "sentence" with start position equal to
>   def token_stream(field, str)
> John
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk@rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database
Posted by S D (Guest)
on 30.04.2008 07:54
(Received via mailing list)
That was it. Stupid mistake on my part.

Thanks!
John