Questions about Searching


#1

Hi,

I have some questions about searching with Ferret. I have a user
index with first_name, last_name and full_name (which is just first
plus last with a space).

Here are a couple of questions:

  1. If I store the fields tokenized, it appears as though queries are
    case-insensitive. However, for untokenized, the query is
    case-sensitive. How can I make the untokenized searches
    case-insensitive?

  2. If I have a field with whitespace in it, how can I search for the
    whitespace using wildcard searches. For instance, if the full_name I
    am searching for is “John D.”, how can I build a query for that. I
    have tried numerous combinations, here are a couple I tried:
    full_name:"#{query}"* <-- This will match every field in the index
    full_name:"#{query}*" <-- This matches nothing

  3. When I store the fields as untokenized, exact matches seem to not
    work for me anymore. For instance, this query worked for tokenized
    first_name, but does not for untokenized first_name:
    first_name:John

But this query will return results:
first_name:Joh?

  1. Is there a better way to search for the first and last name
    combination that storing another index with them concatenated?

Thanks,

Tom


#2

On Jan 20, 2006, at 8:39 AM, Tom D. wrote:

Here are a couple of questions:

  1. If I store the fields tokenized, it appears as though queries are
    case-insensitive. However, for untokenized, the query is
    case-sensitive. How can I make the untokenized searches
    case-insensitive?

By lowercasing the text you index and lowercasing the text in the
query. Search matches are case sensitive always, but generally
tokenized fields get lowercased along the way, and the query parser
lowercases terms also (generally by the same analyzer).

  1. If I have a field with whitespace in it, how can I search for the
    whitespace using wildcard searches. For instance, if the full_name I
    am searching for is “John D.”, how can I build a query for that. I
    have tried numerous combinations, here are a couple I tried:
    full_name:"#{query}"* <-- This will match every field in the index
    full_name:"#{query}*" <-- This matches nothing

I strongly suspect the issue is the field being analyzed during query
parsing. I’m not sure what facilities Ferret has for doing this
exactly off the top of my head, but in Java Lucene there is a
PerFieldAnalyzerWrapper that helps with this. The space would be
problematic, as well as the double quotes in how you have created
it. You may need to create a WildcardQuery via the API rather than
using the parser.

  1. When I store the fields as untokenized, exact matches seem to not
    work for me anymore. For instance, this query worked for tokenized
    first_name, but does not for untokenized first_name:
    first_name:John

But this query will return results:
first_name:Joh?

This again has to do with the case and analyzer issue. You are
using a parser that does analysis of the text. Try using the parser
to create a Query and see what it consists of (.to_s?).

  1. Is there a better way to search for the first and last name
    combination that storing another index with them concatenated?

It really all depends on what your searching needs are. What does
the user interface for searching demand?

Erik

#3

Thanks Erik. Very informative. I suspect the QueryParser either has
some bugs or is not designed to handle this scenario. I will try
manually building the specific types of queries via the API.

It really all depends on what your searching needs are. What does
the user interface for searching demand?

For the full name searches, I just wanted wild card matches on the
right hand side of the query. For instance, any of these should
result in john doe being found:
J, Jo, Joh, John, John D, etc.

Tom


#4

Thanks Erik. Nice article. I was able to get the wildcard search to
work including whitespace by manually creating the query as follows:

qp = Ferret::QueryParser.new
query = qp.get_wild_query('full_name', "#{partial}*")
INDEX.search_each(query) do |doc, score|

where #{partial} is the partial portion of the full name.

Thanks for your responses.

Tom


#5

On Jan 20, 2006, at 10:56 AM, Tom D. wrote:

Thanks Erik. Very informative. I suspect the QueryParser either has
some bugs or is not designed to handle this scenario. I will try
manually building the specific types of queries via the API.

There are many tricky scenarios because of the necessity for
whitespace and special characters to be handled as separators and
operators and the analyzer (and when it is used) with the query parser.

So no bugs, per se, I don’t think in this case.

My article at java.net covers this (in the context of Java) in some
of its glory and frustration I think:

<http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html>

It really all depends on what your searching needs are. What does
the user interface for searching demand?

For the full name searches, I just wanted wild card matches on the
right hand side of the query. For instance, any of these should
result in john doe being found:
J, Jo, Joh, John, John D, etc.

The simplest thing to do in this case is what you’re doing for
indexing… combine a field with “firstname lastname” as untokenized,
though lowercased. Then build a WildcardQuery for “piece*” - though
this isn’t going to be possible with the whitespace involved when
using the parser, I don’t think (unless you can escape it somehow).
Be sure to lowercase the query also.

Erik