Thinking of using aaf- looking for advice

Hi-

I’m technical lead at Lingr (http://www.lingr.com), a chatroom-based
social networking site. We’ve currently got several million user
utterances stored in MySQL, and we’re looking to build a local search
functionality. I’ve played around with aaf and I really like it, but I
have some questions.

  1. Is anyone out there using aaf to index a corpus of this size? If
    so, how has your scaling experience been?

  2. We would be running one central aaf server instance, talking to it
    over drb from our many application servers. We add tens of thousands of
    utterances per day- anyone out there indexing this many items on a daily
    basis over drb? If so, how has your experience been in terms of
    stability?

  3. All of our utterance data is in UTF8, but we don’t know what
    language a particular utterance is in. It’s common to have both latin
    and non-latin text even in the same room. How can I index both types of
    strings effectively within the same model field index?

  4. Any suggestions on how to build the initial index in an offline way?
    I suspect it will probably take many hours to build the initial index.

  5. I suspect we will have to disable_ferret(:always) on our utterance
    model, then update the index manually on some periodic basis (cron job,
    backgroundrb worker, etc.). The reason for this is that we don’t want
    to introduce any delay into the process of storing a new utterance,
    which occurs in realtime during a chat session. Anyone have experience
    doing this?

Any advice is appreciated!

Best Regards,

Danny B.

On 4/20/07, Danny B. [email protected] wrote:

so, how has your scaling experience been?
Yes. I have server models with more the 4M rows, all indexed with AAF.
My experience has been that AAF is very stable. Most of my challenges
have been with ferret upgrades breaking index format.

  1. We would be running one central aaf server instance, talking to it
    over drb from our many application servers. We add tens of thousands of
    utterances per day- anyone out there indexing this many items on a daily
    basis over drb? If so, how has your experience been in terms of
    stability?

Yes. Rock solid.

  1. All of our utterance data is in UTF8, but we don’t know what
    language a particular utterance is in. It’s common to have both latin
    and non-latin text even in the same room. How can I index both types of
    strings effectively within the same model field index?

Why not just use UTF8?

  1. Any suggestions on how to build the initial index in an offline way?
    I suspect it will probably take many hours to build the initial index.

Jens has talked about developing a better rebuild_index for AAF that
does this.

However, if your search system isn’t online (ie, the feature isn’t
enabled in the front end), why would you need anything special? The
AAF DRb server can server requests while you’re running a rebuild (as
long as you don’t use the current rebuild_index method).

  1. I suspect we will have to disable_ferret(:always) on our utterance
    model, then update the index manually on some periodic basis (cron job,
    backgroundrb worker, etc.). The reason for this is that we don’t want
    to introduce any delay into the process of storing a new utterance,
    which occurs in realtime during a chat session. Anyone have experience
    doing this?

It’s pretty fast. The only time you’d see a slowdown is when you
encounter a lock in the DRb server.

-ryan

Ryan K. wrote:

Yes. I have server models with more the 4M rows, all indexed with AAF.
My experience has been that AAF is very stable. Most of my challenges
have been with ferret upgrades breaking index format.

Yes. Rock solid.

Great to know- thanks very much!

  1. All of our utterance data is in UTF8, but we don’t know what
    language a particular utterance is in. It’s common to have both latin
    and non-latin text even in the same room. How can I index both types of
    strings effectively within the same model field index?

Why not just use UTF8?

Sorry, I should have been more clear- what I was referring to was not
storage, but rather tokenization. My understanding is that many people
use a simple Regex-based one-token-per-character tokenizer for non-Latin
languages, but, since our languages are mixed, I wasn’t sure what type
of approach to tokenization would be best. Clearly we can’t use that
one-token-per-character analyzer on latin text, right?

However, if your search system isn’t online (ie, the feature isn’t
enabled in the front end), why would you need anything special? The
AAF DRb server can server requests while you’re running a rebuild (as
long as you don’t use the current rebuild_index method).

Perhaps I’m remembering incorrectly, but my recollection was that, the
first time I created a new record for a model that uses aaf, the whole
instance blocked while aaf was creating the index. Did I remember that
wrong?

If that is the way that it works, then, clearly, I need to start the
rebuild from outside of the application, before any users can create new
model objects.

Further, are you saying that model creations during the rebuild won’t
block (I guess they realize that a rebuild is already happening and just
return immediately)?

  1. I suspect we will have to disable_ferret(:always) on our utterance
    model, then update the index manually on some periodic basis (cron job,
    backgroundrb worker, etc.). The reason for this is that we don’t want
    to introduce any delay into the process of storing a new utterance,
    which occurs in realtime during a chat session. Anyone have experience
    doing this?

It’s pretty fast. The only time you’d see a slowdown is when you
encounter a lock in the DRb server.

And what would cause that? Do normal model creates cause a lock?

Thanks so much for your info. so far and for any further advice you can
give me.

Best Regards,

Danny

On Mon, Apr 23, 2007 at 03:52:51AM +0200, Danny B. wrote: […]

Sorry, I should have been more clear- what I was referring to was not
storage, but rather tokenization. My understanding is that many
people use a simple Regex-based one-token-per-character tokenizer for
non-Latin languages, but, since our languages are mixed, I wasn’t sure
what type of approach to tokenization would be best. Clearly we can’t
use that one-token-per-character analyzer on latin text, right?

Right :slight_smile:

Some heuristics to get an idea about which language you’re working on
right now might be a good idea to select a proper analyzing algorithm.

The Nutch search engine (Java, Lucene-based) seems to have something
like that, possibly we could port this:
http://wiki.apache.org/nutch/MultiLingualSupport

However, if your search system isn’t online (ie, the feature isn’t
enabled in the front end), why would you need anything special? The
AAF DRb server can server requests while you’re running a rebuild (as
long as you don’t use the current rebuild_index method).

Perhaps I’m remembering incorrectly, but my recollection was that, the
first time I created a new record for a model that uses aaf, the whole
instance blocked while aaf was creating the index. Did I remember that
wrong?

No, that’s correct. You can force a rebuild by calling
Model.rebuild_index from the console.

If that is the way that it works, then, clearly, I need to start the
rebuild from outside of the application, before any users can create new
model objects.

Further, are you saying that model creations during the rebuild won’t
block (I guess they realize that a rebuild is already happening and just
return immediately)?

Unfortunately the DRb server doesn’t realize this, yet. As Ryan wrote, I
plan to rework the re-indexing stuff in the near future, most likely
then there will be some kind of index rotation and a queue remembering
model updates that occured while a rebuild is going on.

And what would cause that? Do normal model creates cause a lock?

Index updates are synchronized as there only may be one thread writing
to the index at a time. In case immediate indexing of new or updated
records is not needed, I see no problem in doing this later from cron or
backgroundrb based on some flag or timestamp.

Ferret is fast, but you also have to take into account the DRb round
trip time, so this really could make sense for a chat application.

cheers,
Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

On Mon, Apr 23, 2007 at 04:42:50PM +0200, Danny B. wrote:

itself and decides how to tokenized based on the input string?
Yeah, that’s what I was thinking of.

That uber-analyzer could determine the language/type of language used in
a document and then delegate to a specialized analyzer. Same would have
to be done for query analysis - here (because of small text size) it
would be good if a hint could be supplied by the application (i.e. user
profile, ui language used).

[…]

we get the index up to date?
I’d just remember the time I started rebuild_index, and after it’s
finished, index all records that have been created afterwards, not doing
a rebuild_index but reading/indexing them one by one.

Or even better, let your background indexer (that later will handle the
regular index updates) do this by marking all records older than the
rebuild timestamp as ‘already indexed’ and then starting it to handle
these not yet indexed records.

regards,
Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Some heuristics to get an idea about which language you’re working on
right now might be a good idea to select a proper analyzing algorithm.

I think that’s probably the only way to do this effectively, but how can
I specify a particular analyzer on a per-model-instance basis? I only
recall seeing that aaf allows analyzer specification on a per-model
basis.

Or perhaps you were thinking of a single analyzer that does heuristics
itself and decides how to tokenized based on the input string?

No, that’s correct. You can force a rebuild by calling
Model.rebuild_index from the console.

OK, fair enough.

Further, are you saying that model creations during the rebuild won’t
block (I guess they realize that a rebuild is already happening and just
return immediately)?

Unfortunately the DRb server doesn’t realize this, yet. As Ryan wrote, I
plan to rework the re-indexing stuff in the near future, most likely
then there will be some kind of index rotation and a queue remembering
model updates that occured while a rebuild is going on.

So how would you suggest that ever get the index “caught up”? The first
rebuild_index will probably take many hours, and, while that’s building,
thousands on new model instances will be created. Since we can’t turn
on automatic indexing (at least until the index is up to date), how do
we get the index up to date?

Index updates are synchronized as there only may be one thread writing
to the index at a time. In case immediate indexing of new or updated
records is not needed, I see no problem in doing this later from cron or
backgroundrb based on some flag or timestamp.

Ferret is fast, but you also have to take into account the DRb round
trip time, so this really could make sense for a chat application.

Yeah, sounds like we would have to do it periodically, and tell users to
expect a few minutes of index latency. That’s fine- it certainly beats
the typical latency that already we get from Google crawlers, which are
currently our only local search functionality.

Best Regards,

Danny