Questions about Searching

tdellaringa · January 20, 2006, 3:19pm

Hi,

I have some questions about searching with Ferret. I have a user
index with first_name, last_name and full_name (which is just first
plus last with a space).

Here are a couple of questions:

If I store the fields tokenized, it appears as though queries are
case-insensitive. However, for untokenized, the query is
case-sensitive. How can I make the untokenized searches
case-insensitive?
If I have a field with whitespace in it, how can I search for the
whitespace using wildcard searches. For instance, if the full_name I
am searching for is “John D.”, how can I build a query for that. I
have tried numerous combinations, here are a couple I tried:
full_name:"#{query}"* <-- This will match every field in the index
full_name:"#{query}*" <-- This matches nothing
When I store the fields as untokenized, exact matches seem to not
work for me anymore. For instance, this query worked for tokenized
first_name, but does not for untokenized first_name:
first_name:John

But this query will return results:
first_name:Joh?

Is there a better way to search for the first and last name
combination that storing another index with them concatenated?

Thanks,

Tom

tdellaringa · January 20, 2006, 5:14pm

On Jan 20, 2006, at 8:39 AM, Tom D. wrote:

Here are a couple of questions:

If I store the fields tokenized, it appears as though queries are
case-insensitive. However, for untokenized, the query is
case-sensitive. How can I make the untokenized searches
case-insensitive?

By lowercasing the text you index and lowercasing the text in the
query. Search matches are case sensitive always, but generally
tokenized fields get lowercased along the way, and the query parser
lowercases terms also (generally by the same analyzer).

If I have a field with whitespace in it, how can I search for the
whitespace using wildcard searches. For instance, if the full_name I
am searching for is “John D.”, how can I build a query for that. I
have tried numerous combinations, here are a couple I tried:
full_name:"#{query}"* <-- This will match every field in the index
full_name:"#{query}*" <-- This matches nothing

I strongly suspect the issue is the field being analyzed during query
parsing. I’m not sure what facilities Ferret has for doing this
exactly off the top of my head, but in Java Lucene there is a
PerFieldAnalyzerWrapper that helps with this. The space would be
problematic, as well as the double quotes in how you have created
it. You may need to create a WildcardQuery via the API rather than
using the parser.

When I store the fields as untokenized, exact matches seem to not
work for me anymore. For instance, this query worked for tokenized
first_name, but does not for untokenized first_name:
first_name:John

But this query will return results:
first_name:Joh?

This again has to do with the case and analyzer issue. You are
using a parser that does analysis of the text. Try using the parser
to create a Query and see what it consists of (.to_s?).

Is there a better way to search for the first and last name
combination that storing another index with them concatenated?

It really all depends on what your searching needs are. What does
the user interface for searching demand?

Erik

tdellaringa · January 20, 2006, 5:35pm

Thanks Erik. Very informative. I suspect the QueryParser either has
some bugs or is not designed to handle this scenario. I will try
manually building the specific types of queries via the API.

It really all depends on what your searching needs are. What does
the user interface for searching demand?

For the full name searches, I just wanted wild card matches on the
right hand side of the query. For instance, any of these should
result in john doe being found:
J, Jo, Joh, John, John D, etc.

Tom

tdellaringa · January 24, 2006, 2:05pm

Thanks Erik. Nice article. I was able to get the wildcard search to
work including whitespace by manually creating the query as follows:

qp = Ferret::QueryParser.new
query = qp.get_wild_query('full_name', "#{partial}*")
INDEX.search_each(query) do |doc, score|

where #{partial} is the partial portion of the full name.

Thanks for your responses.

Tom

tdellaringa · January 20, 2006, 7:54pm

On Jan 20, 2006, at 10:56 AM, Tom D. wrote:

Thanks Erik. Very informative. I suspect the QueryParser either has
some bugs or is not designed to handle this scenario. I will try
manually building the specific types of queries via the API.

There are many tricky scenarios because of the necessity for
whitespace and special characters to be handled as separators and
operators and the analyzer (and when it is used) with the query parser.

So no bugs, per se, I don’t think in this case.

My article at java.net covers this (in the context of Java) in some
of its glory and frustration I think:

<http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html>

It really all depends on what your searching needs are. What does
the user interface for searching demand?

For the full name searches, I just wanted wild card matches on the
right hand side of the query. For instance, any of these should
result in john doe being found:
J, Jo, Joh, John, John D, etc.

The simplest thing to do in this case is what you’re doing for
indexing… combine a field with “firstname lastname” as untokenized,
though lowercased. Then build a WildcardQuery for “piece*” - though
this isn’t going to be possible with the whitespace involved when
using the parser, I don’t think (unless you can escape it somehow).
Be sure to lowercase the query also.

Erik