Removing special/syntax characters

Is there any somewhat standard way to remove or otherwise handle
special or syntax characters from a user’s search, such as a colon?

I was thinking maybe there was something akin to
Ferret::Analysis::FULL_ENGLISH_STOP_WORDS, like
Ferret::Analysis::FERRET_SYNTAX_CHARS, but no such luck.

How are other folks dealing with filtering user input?

John

Excerpts from John B.'s message of Wed Jan 17 13:24:48 -0800 2007:

Is there any somewhat standard way to remove or otherwise handle
special or syntax characters from a user’s search, such as a colon?

If you want to allow them the full syntax, just use QueryParser#parse
(and handle the QueryParseException). If you want to disallow anything
special, you could split on whitespace and turn each token into a
TermQuery, then throw them all into a BooleanQuery.

Anything in between (e.g. allow phrase queries, but disallow everything
else) will be more complicated. But I can’t think of many good reasons
to disallow the full syntax in the first place.

On 2007-01-17, at 10:24, John B. wrote:

Is there any somewhat standard way to remove or otherwise handle
special or syntax characters from a user’s search, such as a colon?

I was thinking maybe there was something akin to
Ferret::Analysis::FULL_ENGLISH_STOP_WORDS, like
Ferret::Analysis::FERRET_SYNTAX_CHARS, but no such luck.

How are other folks dealing with filtering user input?

Hey John,

i guess that would be a nice addition to have a const defined… i’ll do
it manually …

if not defined?(FERRET_SPECIAL_CHARS)
FERRET_SPECIAL_CHARS = [ /:/, /(/, /)/, /[/, /]/, /!/, /
+/, /"/, /~/, /^/,
/-/, /|/, />/, /</, /=/, /*/, /?/, /
./, /&/ ]
end

Ben

Excerpts from John B.'s message of Wed Jan 17 16:14:47 -0800 2007:

Unfortunately, one of the things that the client has asked for is

one two three

to be transformed to

one two three

Ok. Then I don’t think you really need to worry about escaping anything.
You can split on whitespace, and wrap each token in a WildcardQuery,
prefixed and suffixed with a star. Unless you’re supporting phrase
queries surrounded by quotes, in which case “split on whitespace”
becomes something more complicated. Or unless you want to disallow
wildcards from the user, in which case you’ll need to escape * and ?.

And also to be able to transparently search FOR the special characters
themselves. Which means I will actually not be filtering, but escaping
the special characters. (I’m assuming Ferret has some facility for
searching for special characters, although I admit I haven’t looked
into it much yet).

Yep, as long as your tokenizer doesn’t discard them, you’re fine.

Basically if you’re avoiding QueryParser and building Query objects
directly from the strings, then none of these characters have special
semantics (except for * and ? with WildcardQuery).

On Jan 17, 2007, at 5:26 PM, Benjamin K. wrote:

i guess that would be a nice addition to have a const defined…
i’ll do
it manually …

if not defined?(FERRET_SPECIAL_CHARS)
FERRET_SPECIAL_CHARS = [ /:/, /(/, /)/, /[/, /]/, /!/, /
+/, /"/, /~/, /^/,
/-/, /|/, />/, /</, /=/, /*/, /?/, /
./, /&/ ]
end

Thanks Benjamin!

On Jan 17, 2007, at 6:46 PM, William M. wrote:

If you want to allow them the full syntax, just use QueryParser#parse
(and handle the QueryParseException). If you want to disallow anything
special, you could split on whitespace and turn each token into a
TermQuery, then throw them all into a BooleanQuery.

Anything in between (e.g. allow phrase queries, but disallow
everything
else) will be more complicated. But I can’t think of many good reasons
to disallow the full syntax in the first place.

William-

I agree. If it was up to me, I would allow the full syntax.
Unfortunately, one of the things that the client has asked for is

one two three

to be transformed to

one two three

And also to be able to transparently search FOR the special
characters themselves. Which means I will actually not be filtering,
but escaping the special characters. (I’m assuming Ferret has some
facility for searching for special characters, although I admit I
haven’t looked into it much yet).

Cheers,
John

On Jan 17, 2007, at 8:23 PM, William M. wrote:

You can split on whitespace, and wrap each token in a WildcardQuery,
prefixed and suffixed with a star. Unless you’re supporting phrase
queries surrounded by quotes, in which case “split on whitespace”
becomes something more complicated. Or unless you want to disallow
wildcards from the user, in which case you’ll need to escape * and ?.

Yes, I want to do all of the above :smiley:

Thanks for all the tips William, I’m going to look into this in the
future when I make a more refined solution.

In the meantime, I am just going to strip out all special/syntax
chars from the queries, which I believe will have the behavior I desire.

i want a search for

one-two

to pull up results with

one two
one-two
onetwo

John

Excerpts from John B.'s message of Fri Jan 19 15:57:35 -0800 2007:

On Jan 17, 2007, at 5:26 PM, Benjamin K. wrote:

 FERRET_SPECIAL_CHARS = [ /:/, /\(/, /\)/, /\[/, /\]/, /!/, /\

+/, /"/, /~/, /^/, /-/, /|/, />/, /</, /=/, /*/, /?/, /./, /&/ ]

  1. Should $ be in the list?

There’s a list at
http://ferret.davebalmain.com/api/classes/Ferret/QueryParser.html
and $ doesn’t seem to be on it. (Neither does & or .)

  1. Here is the solution I came up with, (nothing mind shattering but
    I thought some folks on the list might appreciate seeing it):

query = (query.split(‘’) - (FERRET_SPECIAL_CHARS - CONFIG
[:allowed_ferret_syntax])).join()

Doesn’t this also eliminate escaped versions of the special characters?
(Might not be a problem, depending on the specifics of the corpus.)

On Jan 17, 2007, at 5:26 PM, Benjamin K. wrote:

 FERRET_SPECIAL_CHARS = [ /:/, /\(/, /\)/, /\[/, /\]/, /!/, /\

+/, /"/, /~/, /^/, /-/, /|/, />/, /</, /=/, /*/, /?/, /./, /&/ ]

  1. Should $ be in the list?

  2. Here is the solution I came up with, (nothing mind shattering but
    I thought some folks on the list might appreciate seeing it):

query = (query.split(’’) - (FERRET_SPECIAL_CHARS - CONFIG
[:allowed_ferret_syntax])).join()

CONFIG[:allowed_ferret_syntax] contains the characters we are
allowing, right now only double quote.

Unless I am missing something, we are now successfully allowing no
ferret syntax other than phrases. Whoo hoo!

John