Ferret using UTF-8

Hey all,

I went through the docs in Ferret’s page, plus a quick search through
the email list (thread titles), and I couldn’t find any info on how to
have Ferret storing it’s data using UTF-8.

In the scenario I would use it, nothing’s being stored outside (like
external databases). So it’s just how Ferret would do it that I’m
interesting in knowing.

The reason why I ask is because I’m deploying a search engine for an
application that will probably be searching for text content in
Japanese/Chinese apart from english. I’m hinting it in case someone
did it before and knows any pitfalls.

Thanks in advance.

On 7/12/06, Julio Cesar O. [email protected] wrote:

The reason why I ask is because I’m deploying a search engine for an
application that will probably be searching for text content in
Japanese/Chinese apart from english. I’m hinting it in case someone
did it before and knows any pitfalls.

Thanks in advance.

The core of ferret is character encoding agnostic. It treats all
strings as an array of bytes so it doesn’t matter what you put in. You
could store JPEGs in the index if you wanted to.

The analysis section of Ferret is another matter. There are two sets
of analyzers, ASCII analyzers (AsciiWhiteSpaceAnalyzer,
AsciiStandardAnalyzer) which are the most robust (no encoding errors
raised) and the the other analyzers (WhiteSpaceAnalyzer,
StandardAnalyzer) which are based on whichever locale you have set. So
if your operating system’s locale is set to UTF-8 then that will be
how the analyzer treats any strings you pass through it.

David B. wrote:

The core of ferret is character encoding agnostic. It treats all
strings as an array of bytes so it doesn’t matter what you put in. You
could store JPEGs in the index if you wanted to.

On which subject, I happen to have chucked some bmp files into my index,
and was really quite amazed to see them being returned on search
results. Not only that, but the results were accurate.

For example, if I have a bmp which contains the word “Sheep” (when
viewed as an image) and I search the index for “Sheep” - the bmp is
returned.

I am adding documents using the standard analyser and file.readlines to
add the contents.

If I open the bmp in a text editor and search for “Sheep” - that word is
not contained within the file.

So how come ferret can read the bmp?

Cheers,
Steven

On Wed, 2006-07-12 at 17:23 +0200, steven shingler wrote:

So how come ferret can read the bmp?

OK please ignore what must rank as the stupidest question for some time.

“Sheep” was in the file path, and the path is one of the Ferret document
fields.

For a minute there, I was excited. :slight_smile:

And David was probably scared that ferret had become conscious. :slight_smile:

Pedro.

So how come ferret can read the bmp?

OK please ignore what must rank as the stupidest question for some time.

“Sheep” was in the file path, and the path is one of the Ferret document
fields.

For a minute there, I was excited. :slight_smile:

Cheers,
Steven

Cool. For a minute I thought if I should ask if the file is maybe named
‘sheep’ but then decided that this might offend you :wink:

Great one!

Nonetheless I’ve got a question on this subject too. Has anyone
experience
with a task like this: A searchengine that doesn’t use words as query
objects but an uploaded image? Is there something like this already
available on the net - a little google research of mine didn’t yielded
any
results. This should be able to also find resized images of the same
kind.
Background: Images that aren’t authorized by the copyright owner but
won’t
be found by google images or the like because they were renamed.

Cheers,
Jan

On Wed, 2006-07-12 at 17:30 +0200, Jan P. wrote:

Nonetheless I’ve got a question on this subject too. Has anyone
experience with a task like this: A searchengine that doesn’t use
words as query objects but an uploaded image? Is there something like
this already available on the net - a little google research of mine
didn’t yielded any results. This should be able to also find resized
images of the same kind. Background: Images that aren’t authorized by
the copyright owner but won’t be found by google images or the like
because they were renamed.

See this

Never tried it myself but looks like what you meant. It’s a desktop app
though.

Pedro.

PS: Story for the other empty email. I pressed send by mistake before I
was done.

On Wed, 2006-07-12 at 17:30 +0200, Jan P. wrote:

images of the same kind. Background: Images that aren’t authorized by
the copyright owner but won’t be found by google images or the like
because they were renamed.

See this:

Something like this but on the net is what I’m searching for.

Thanks for the pointer!

Jan

On 7/13/06, steven shingler [email protected] wrote:

So how come ferret can read the bmp?

OK please ignore what must rank as the stupidest question for some time.

“Sheep” was in the file path, and the path is one of the Ferret document
fields.

For a minute there, I was excited. :slight_smile:

This functionality isn’t due until version Ferret-4.0.