Storing a search index on the filesystem

Hi Folks.

I’ve recently been working on a search plugin for Rails. It’s going
well, but I wondered if there was a better way to store the index
than my current method.

The top level of the index is basically a hash with terms as keys,
and a hash of records (and lots of other vectors) as values. I’ll
make it an array of record IDs to simplify the example below.

{
“foo” => [ 123, 456, 789]
“bar” => [ 321, 654, 987]
}

This index is then split up into shards or partitions (to borrow some
DB lingo) and stored on the filesystem via marshal. When I want to
load a particular term, only the shard containing that term is
loaded. The size of the shards can be adjusted, which affects
performance in various ways. Small shards give faster queries, but
slower indexing due to the number that need to be loaded.

OK, so that’s how I’m doing it now, but I wondered if anyone had a
better suggestion. Desirable features include…

  • Random read + write
  • Fast, even if it’s due to the random access, saves loading a whole
    shard of terms I’m not likely to use.
  • Compression, currently the index size is >= database size in dumped
    SQL

Any suggestions are welcome.

Thanks.

Douglas F Shearer
[email protected]