How to get the words of a query

Jean-Christophe_M · August 28, 2006, 12:11am

Hi,

Using aaf to search pages, I wanted to present excerpts from the texts
even when more than one term was used in the search.
I came to some results, despite the difficulty caused by Unicode+ruby.
The last problem I’m faced is to get the query words, without the
logical articulation chars if any.
Is there a clean way to get them ?

–
Jean-Christophe M.

Jean-Christophe_M · August 28, 2006, 12:24pm

On Mon, Aug 28, 2006 at 12:09:27AM +0200, Jean-Christophe M. wrote:

Hi,

Using aaf to search pages, I wanted to present excerpts from the texts
even when more than one term was used in the search.
I came to some results, despite the difficulty caused by Unicode+ruby.
The last problem I’m faced is to get the query words, without the
logical articulation chars if any.
Is there a clean way to get them ?

in Ferret 0.10 there’s a highlight method in the Searcher class. Maybe
that does what you want ?

Jens

http://ferret.davebalmain.com/api/classes/Ferret/Search/Searcher.html#M000223

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Jean-Christophe_M · August 28, 2006, 2:28pm

Hi,

Le 28 août 06, à 12:22, Jens K. a écrit :

in Ferret 0.10 there’s a highlight method in the Searcher class. Maybe
that does what you want ?

http://ferret.davebalmain.com/api/classes/Ferret/Search/
Searcher.html#M000223

Seems good, will be perfect if your truncate respects multi-byte chars.
My ruby helper does it, see how it works on

(it highlights only the first occurence of each word currently).

I cannot find the whole code in
http://ferret.davebalmain.com/trac/browser/tags/REL-0.10.1/ext/
r_search.c
though to check this (and would I, I’m not sure I could check unicode
implementation in C :/)

I’ll maybe wait for aaf to be updated for 0.10.1 before testing.

Jean-Christophe M.

Symétrie, édition de musique et services multimédia
30 rue Jean-Baptiste Say
69001 LYON (FRANCE)
tél +33 (0)478 29 52 14
fax +33 (0)478 30 01 11
web www.symetrie.com

Jean-Christophe_M · August 28, 2006, 11:54pm

On Mon, Aug 28, 2006 at 02:11:26PM +0200, Jean-Christophe M. wrote:

My ruby helper does it, see how it works on
Édition et distribution de livres sur la musique, de partitions et de revues — Symétrie
(it highlights only the first occurence of each word currently).

I cannot find the whole code in
http://ferret.davebalmain.com/trac/browser/tags/REL-0.10.1/ext/
r_search.c
though to check this (and would I, I’m not sure I could check unicode
implementation in C :/)

I’ll maybe wait for aaf to be updated for 0.10.1 before testing.

the current trunk of aaf is supposed to be 0.10.x compatible, feel free
to try it out.

But for now you’d have to create your own Searcher to use the
highlighting, because aaf doesn’t give you access to a Searcher instance
to use. But this might be an interesting feature.

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Jean-Christophe_M · September 1, 2006, 5:10pm

On 8/28/06, Jean-Christophe M. [email protected] wrote:

My ruby helper does it, see how it works on
Édition et distribution de livres sur la musique, de partitions et de revues — Symétrie
(it highlights only the first occurence of each word currently).

Hi Jean-Christophe,

Are you saying the highlight doesn’t respect multi-byte characters? If
so, could you give an example? The highlighter uses the byte
boundaries returned by the analyzer during indexing so I can’t see any
reason multi-byte characters wouldn’t be respected.

Also, it’s quite a bit more advanced then your version (and the
version in Lucene contrib for that matter). It highlights only the
terms that match the query. So if you search for the phrase “red
truck” the terms “red” and “truck” will only be highlighted if they
appear together. If you search for “red truck”~1 then the phrase “red
fire truck” will be highlighted. It also uses a pretty clever
algorithm to find the excerpts with the most matching information.
It’s still quite experimental though so I need people to try it out
and send in their suggestions.

Cheers,
Dave

Jean-Christophe_M · August 29, 2006, 12:37am

Hi,

Thks for your reply.

Le 28 août 06, à 23:52, Jens K. a écrit :

the current trunk of aaf is supposed to be 0.10.x compatible, feel free
to try it out.

Ah, I’ll try.

But for now you’d have to create your own Searcher to use the
highlighting, because aaf doesn’t give you access to a Searcher
instance
to use. But this might be an interesting feature.

If I wanted to use the parsed query words, is there a way to get them
through aaf ?
I currently use a hack:
@query = params[:query].chars.gsub(/[^\w\s]/, ’ ').strip.downcase

(chars comes from unicode_hacks)
If I don’t filter chars like ‘&’, it makes my server down (memory
error in mongrel).

Jean-Christophe M.

Symétrie, édition de musique et services multimédia
30 rue Jean-Baptiste Say
69001 LYON (FRANCE)
tél +33 (0)478 29 52 14
fax +33 (0)478 30 01 11
web www.symetrie.com

Jean-Christophe_M · September 2, 2006, 7:10pm

Hi,

Le 1 sept. 06, à 17:09, David B. a écrit :

reason multi-byte characters wouldn’t be respected.
No, it was a question, I was wondering wether it respected the
multibyte.
It’s a good news it can handle unicode.

Also, it’s quite a bit more advanced then your version (and the
version in Lucene contrib for that matter). It highlights only the
terms that match the query. So if you search for the phrase “red
truck” the terms “red” and “truck” will only be highlighted if they
appear together. If you search for “red truck”~1 then the phrase “red
fire truck” will be highlighted. It also uses a pretty clever
algorithm to find the excerpts with the most matching information.
It’s still quite experimental though so I need people to try it out
and send in their suggestions.

Ok, I’ll try. Till now I was using my own ruby hilighter.

Jean-Christophe M.

Symétrie, édition de musique et services multimédia
30 rue Jean-Baptiste Say
69001 LYON (FRANCE)
tél +33 (0)478 29 52 14
fax +33 (0)478 30 01 11
web www.symetrie.com