On Jun 21, 2006, at 3:32 AM, David B. wrote:
be implementing the highlighter in C rather than in Ruby so I’ll be
interested to see how you go with it.
The main difference in the API is that you won’t specify the store,
index and term_vector parameters per document field any more. This
option will still be available but the behaviour will be slightly
different. I’ll go into more detail later.
How close is what you’re going to be doing to the Lucene contrib
FWIW, the KinoSearch Highlighter uses similar techniques for adding
tags and encoding, but the excerpt selection is pretty different. No
TokenStream required, it uses a heat map. Right now it requires that
the field have term vectors stored with positions and offsets, but it
could be adapted to generate the vectors by re-analyzing.
The principle advantage it has over the Lucene Highlighter in that it
handles phrases properly:
http://xrl.us/nm2z (Link to www.lucenebook.com)
http://xrl.us/nm25 (Link to www.rectangular.com)
Whatever algorithm we choose for Lucy, I hope it will meet that
Higlighter.pm isn’t that long (384 lines including docs) and if I
didn’t have an serious deadlines bearing down doing a Ruby version
would be a great exercise for me. If you or Marcus want to check it
out, the new version’s only in subversion:
http://xrl.us/nm28 (Link to www.rectangular.com)