Parallel indexing with unique id?

bryan123 · March 25, 2008, 6:30am

Hello all,
Is it possible to use parallel indexing and still ensure unique
documents in
the merged index? Using the canned example, I’m ending up with
non-unique
entries. It’s just adding them all together even though I’ve defined
unique
a :key.

How can I tell the IndexWriter to keep my uniqueness constraints?

For example, imagine that I have two indexes of a phone book:

“index_one” contains a unique set of names A-through-P (let’s say the
key is
their phone number).

“index_two” contains a unique set of names K-through-Z.

When I merge them, I would hope to get a unique index of A-through-Z,
but
I’m getting double entries where they overlap, K-through-P.

Here’s some code to demonstrate. My :id field is a long-ish unique
alphanumeric string. In the example below, “one” and “two” are actually
identical copies, each containing about 60,000 docs. I was hoping to get
a
combined index containing the same 60,000 docs, but ended up with
120,000.

Any help will be greatly appreciated. Thanks!

####################

one = “Documents/bucket/index_1”
two = “Documents/bucket/index_2”
merged = “Documents/bucket/merged_index”

pfa = PerFieldAnalyzer.new(LetterAnalyzer.new)
pfa[:id] = WhiteSpaceAnalyzer.new

field_infos = FieldInfos.new(:term_vector => :no)
field_infos.add_field(:id, :index => :untokenized)

index_two = Ferret::I.new(
:key => :id,
:max_buffer_memory => 0x8000000,
:merge_factor => 5,
:path => one,
:analyzer => pfa,
:field_infos => field_infos)

index_one = Ferret::I.new(
:key => :id,
:max_buffer_memory => 0x8000000,
:merge_factor => 5,
:path => two,
:analyzer => pfa,
:field_infos => field_infos)

readers = []
readers << IndexReader.new(one)
readers << IndexReader.new(two)

puts "size of index_one = "+index_one.size.to_s
puts "size of index_two = "+index_two.size.to_s

index_writer = IndexWriter.new(:path => merged)
index_writer.add_readers(readers)
index_writer.close()
readers.each{ |reader| reader.close() }

i = Ferret::I.new(:path => merged)

puts "size before optimize = "+i.size.to_s
i.optimize
puts "size after optimize = "+i.size.to_s

bryan123 · March 25, 2008, 9:13am

Hi!

On Mon, Mar 24, 2008 at 11:29:14PM -0600, R. Bryan Hughes wrote:

Hello all,
Is it possible to use parallel indexing and still ensure unique documents in
the merged index? Using the canned example, I’m ending up with non-unique
entries. It’s just adding them all together even though I’ve defined unique
a :key.

How can I tell the IndexWriter to keep my uniqueness constraints?

You can’t. The :key option is only interpreted by Ferret’s Index class,
which will delete any already existing records with the same key field
value before adding a new record.

Cheers,
Jens

–
Jens Krämer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

bryan123 · March 26, 2008, 4:23pm

Thanks! You saved me lots of time.