I’m a ferret newbie, so hopefully I’m missing something simple
I am using ferret to index data about 36,000 products from a MySQL
database. The index has one document for each product, with these
important fields:
id: the id (unique) of the product record in the database
content: a concatenation of several bits of information from the product
and associated records
I have a few tools to manage my index. First, my bulk indexer creates a
new index from scratch with every product in the database. Second, my
updater can delete a single product from the index and re-insert it.
Finally, given an id, I can dump all the stored fields in the index for
the associated document (to help hunt down this problem, I started
storing the content field; normally it would not be stored).
If I run this query against a newly created index:
content:“blood pressure”
I get 4 hits. But I know there are more expected results. I can easily
find an example product that should be returned but is not. If I look
this product up in the index by id, and dump the results, I can see that
the content field has the correct data stored (and “blood pressure”
appears in this field more than once). But for some reason it isn’t
returned from the above query.
Oddly, if I then update this product (ie: delete the document, and
re-insert it using the update tool) my query suddenly begins including
this product – and 5 hits instead of 4. I have been able to repeat this
for several “mising” products.
I have re-run my bulk indexer several times with identical results. The
bulk indexer is a little complex for the sake of performance, and I
suspected that it was somehow broken, so I modified it so that it simply
callse the updater code once for each product. This method was slower,
and the resulting index was clearly different. My same “blood pressure”
query now returned 14 hits instead of 4, but it was still missing many.
I was again able to make missing products start working by removing them
from the index and reinsterting them.
If I dump one of these mysterious documents before and after updating,
according to diff they are identical. It is as though the product data
is stored in the index correctly, but the actual index (the index in
the index, if you will…) is borked in some way.
I have been able to reproduce these exact results in two configurations:
1: Ruby 1.8.4 / Ferret 10.6 / Mac OS X 10.4.7 on Intel
2: Ruby 1.8.4 / Ferret 10.8 / Debian Linux on Intel
In one case, after a full index the normal way, it seemed to return 5
results for my query instead of 4. It is possible I made a mistake, or
perhaps the exact number of results is semi-random per run of the
indexer.
If anybody can help me understand what is going on, I would be very
appreciative.
Thanks,
Geoff
PS: Here is some relevant code in case it helps. If you need more,
please ask, but this should be everything that matters. If necessary, I
can try to produce a simple test case the reproduces the problem…
— bulk indexer —
create an empty index…
fi = Ferret::Index::FieldInfos.new(:term_vector => :no)
fi.add_field(:id, :index => :untokenized, :term_vector => :no, :store =>
:yes)
fi.add_field(:content, :index => :yes, :term_vector => :no, :store =>
:no)
fi.create_index(“search-index-new”)
open it…
index = Ferret::Index::Index.new(:path => ‘search-index-new’, :analyzer
=> Ferret::Analysis::AsciiStandardAnalyzer.new )
get the products…
start = Time.new
puts ‘loading product data’
offset = 0
batch_size = 100
loop do
prods = Vandelay::Product.find(:all, :limit => batch_size, :offset =>
offset, :include => [:descriptions, :categories, {:skus =>
:supplieritems}])
offset += batch_size
break if prods.size == 0
populate_index(index, prods)
end
optimize it…
puts ‘optimizing index…’
index.optimize
index.close
and finally copy it into place
FileUtils.remove_dir(‘search-index’)
FileUtils.move(‘search-index-new’, ‘search-index’)
— populate_index method —
def populate_index(index, products)
get the ids of every product for caching purposes…
ids = products.collect {|p| p.id}
pre-cache all the keywords for the products
kwcache = {}
Vandelay::Keyword.find_by_sql([“select productId, term from
product_keywords where productId in (?)”, ids]).each {|kw|
sym = kw.productId.to_sym
kwcache[sym] = [] if !kwcache[sym]
kwcache[sym] << kw.term
}
pre-cache all the attribute values for the products
attr_cache = {}
Vandelay::ProductStringAttribute.find_by_sql([“select productId, name,
value from product_stringattribute where productId in (?)”, ids]).each
{|a|
sym = a.productId.to_sym
attr_cache[sym] = [] if !attr_cache[sym]
attr_cache[sym] << a
}
Vandelay::ProductBooleanAttribute.find_by_sql([“select productId,
name, value from product_booleanattribute where productId in (?)”,
ids]).each {|a|
sym = a.productId.to_sym
attr_cache[sym] = [] if !attr_cache[sym]
attr_cache[sym] << a
}
now populate the index with data
puts “indexing #{products.size} products…”
products.each {|prod|
index << prod.index_document(:keywords => kwcache, :attribute_values
=> attr_cache)
}
end
— updater —
index = Ferret::Index::Index.new(:path => ‘search-index’, :analyzer =>
Ferret::Analysis::AsciiStandardAnalyzer.new )
index.delete(:id, product.id)
index << product.index_document
index.close
— Vandelay::Product::index_document method —
def index_document(caches = {})
result = {}
result[:id] = self.id
result[:active] = self.isActive
add attributes
if caches[:attribute_values] != nil
build_attribute_cache(caches[:attribute_values][self.id.to_sym])
end
ALL_ATTRIBUTES.each { |sa|
result[“attr_#{sa.name}”.to_sym] = self.attribute_value(sa)
}
add content
content = ‘’
content << self.id << ’ ’ << self.name << ’ ’
self.descriptions.each {|d| content << d.text << ’ '}
if caches[:keywords] != nil
kwterms = caches[:keywords][self.id.to_sym]
else
kwterms = self.keywords.collect {|k| k.term}
end
kwterms.each {|k| content << k << ’ '} if kwterms
self.skus.each{|s| content << s.displayName << ’ '}
self.categories.each {|c| content << c.name << ’ '}
result[:content] = content
return result
end