Some documents not found

gtcaz · September 26, 2006, 4:05am

I’m a ferret newbie, so hopefully I’m missing something simple

I am using ferret to index data about 36,000 products from a MySQL
database. The index has one document for each product, with these
important fields:

id: the id (unique) of the product record in the database

content: a concatenation of several bits of information from the product
and associated records

I have a few tools to manage my index. First, my bulk indexer creates a
new index from scratch with every product in the database. Second, my
updater can delete a single product from the index and re-insert it.
Finally, given an id, I can dump all the stored fields in the index for
the associated document (to help hunt down this problem, I started
storing the content field; normally it would not be stored).

If I run this query against a newly created index:

content:“blood pressure”

I get 4 hits. But I know there are more expected results. I can easily
find an example product that should be returned but is not. If I look
this product up in the index by id, and dump the results, I can see that
the content field has the correct data stored (and “blood pressure”
appears in this field more than once). But for some reason it isn’t
returned from the above query.

Oddly, if I then update this product (ie: delete the document, and
re-insert it using the update tool) my query suddenly begins including
this product – and 5 hits instead of 4. I have been able to repeat this
for several “mising” products.

I have re-run my bulk indexer several times with identical results. The
bulk indexer is a little complex for the sake of performance, and I
suspected that it was somehow broken, so I modified it so that it simply
callse the updater code once for each product. This method was slower,
and the resulting index was clearly different. My same “blood pressure”
query now returned 14 hits instead of 4, but it was still missing many.
I was again able to make missing products start working by removing them
from the index and reinsterting them.

If I dump one of these mysterious documents before and after updating,
according to diff they are identical. It is as though the product data
is stored in the index correctly, but the actual index (the index in
the index, if you will…) is borked in some way.

I have been able to reproduce these exact results in two configurations:

1: Ruby 1.8.4 / Ferret 10.6 / Mac OS X 10.4.7 on Intel
2: Ruby 1.8.4 / Ferret 10.8 / Debian Linux on Intel

In one case, after a full index the normal way, it seemed to return 5
results for my query instead of 4. It is possible I made a mistake, or
perhaps the exact number of results is semi-random per run of the
indexer.

If anybody can help me understand what is going on, I would be very
appreciative.

Thanks,

Geoff

PS: Here is some relevant code in case it helps. If you need more,
please ask, but this should be everything that matters. If necessary, I
can try to produce a simple test case the reproduces the problem…

— bulk indexer —

create an empty index…

fi = Ferret::Index::FieldInfos.new(:term_vector => :no)
fi.add_field(:id, :index => :untokenized, :term_vector => :no, :store =>
:yes)
fi.add_field(:content, :index => :yes, :term_vector => :no, :store =>
:no)
fi.create_index(“search-index-new”)

open it…

index = Ferret::Index::Index.new(:path => ‘search-index-new’, :analyzer
=> Ferret::Analysis::AsciiStandardAnalyzer.new )

get the products…

start = Time.new
puts ‘loading product data’

offset = 0
batch_size = 100
loop do
prods = Vandelay::Product.find(:all, :limit => batch_size, :offset =>
offset, :include => [:descriptions, :categories, {:skus =>
:supplieritems}])
offset += batch_size
break if prods.size == 0
populate_index(index, prods)
end

optimize it…

puts ‘optimizing index…’
index.optimize

index.close

and finally copy it into place

FileUtils.remove_dir(‘search-index’)
FileUtils.move(‘search-index-new’, ‘search-index’)

— populate_index method —

def populate_index(index, products)

get the ids of every product for caching purposes…

ids = products.collect {|p| p.id}

pre-cache all the keywords for the products

kwcache = {}
Vandelay::Keyword.find_by_sql([“select productId, term from
product_keywords where productId in (?)”, ids]).each {|kw|
sym = kw.productId.to_sym
kwcache[sym] = [] if !kwcache[sym]
kwcache[sym] << kw.term
}

pre-cache all the attribute values for the products

attr_cache = {}
Vandelay::ProductStringAttribute.find_by_sql([“select productId, name,
value from product_stringattribute where productId in (?)”, ids]).each
{|a|
sym = a.productId.to_sym
attr_cache[sym] = [] if !attr_cache[sym]
attr_cache[sym] << a
}
Vandelay::ProductBooleanAttribute.find_by_sql([“select productId,
name, value from product_booleanattribute where productId in (?)”,
ids]).each {|a|
sym = a.productId.to_sym
attr_cache[sym] = [] if !attr_cache[sym]
attr_cache[sym] << a
}

now populate the index with data

puts “indexing #{products.size} products…”
products.each {|prod|
index << prod.index_document(:keywords => kwcache, :attribute_values
=> attr_cache)
}
end

— updater —

index = Ferret::Index::Index.new(:path => ‘search-index’, :analyzer =>
Ferret::Analysis::AsciiStandardAnalyzer.new )
index.delete(:id, product.id)
index << product.index_document
index.close

— Vandelay::Product::index_document method —

def index_document(caches = {})
result = {}
result[:id] = self.id
result[:active] = self.isActive

add attributes

if caches[:attribute_values] != nil
build_attribute_cache(caches[:attribute_values][self.id.to_sym])
end
ALL_ATTRIBUTES.each { |sa|
result[“attr_#{sa.name}”.to_sym] = self.attribute_value(sa)
}

add content

content = ‘’
content << self.id << ’ ’ << self.name << ’ ’
self.descriptions.each {|d| content << d.text << ’ '}

if caches[:keywords] != nil
kwterms = caches[:keywords][self.id.to_sym]
else
kwterms = self.keywords.collect {|k| k.term}
end
kwterms.each {|k| content << k << ’ '} if kwterms

self.skus.each{|s| content << s.displayName << ’ '}

self.categories.each {|c| content << c.name << ’ '}
result[:content] = content
return result
end

gtcaz · September 26, 2006, 9:47am

On 9/26/06, Geoff C. [email protected] wrote:

PS: Here is some relevant code in case it helps. If you need more,
please ask, but this should be everything that matters. If necessary, I
can try to produce a simple test case the reproduces the problem…

Hi Geoff,

If you could produce a simple test case then that would be great. I’ll
try and find the problem but it can be difficult when I can’t
reproduce the problem here.

Cheers,
Dave

gtcaz · September 26, 2006, 6:11pm

David B. wrote:

If you could produce a simple test case then that would be great. I’ll
try and find the problem but it can be difficult when I can’t
reproduce the problem here.

I’m trying but not having much luck. Maybe someone can help me
understand something that might shed some light on the problem. I can
search for blood pressure in three different ways:

+content:“blood pressure”
This method returns a limited number of results (7 right now) and misses
lots of products that have the exact words “blood pressure” in the
content field. It includes one product that does not have the exact
phrase “blood pressure” but does have the word “pressure” and then,
several words later, the word “blood”.

+content:“pressure blood”
This method returns just 2 results, neither of which has “pressure
blood” in their content. Both have “blood pressure” though.

+content:“blood” +content:“pressure”
This method returns 99 results, which as far as I can tell is every
product with “blood pressure” in the content, plus a few that have both
“blood” and “pressure”

So what is the “right” way to search multi-term phrases like this. I
suspect all my oddness centers on my lack of understanding of how this
should work. Ideally, I’m looking for an exact match on the phrase,
and I was going to play with adding some slop if Ferret supports it.
Note that my content field is tokenized. Does the analyzer on the
Index::Index object matter when searching, or should I be preprocessing
my search phrase in some way?

Thanks!

Geoff

gtcaz · September 26, 2006, 7:09pm

On 9/26/06, David B. [email protected] wrote:

reproduce the problem here.

Never mind, I managed to reproduce the problem. It was a bug after all
and a fix will be released in a moment. I just need to swap to Windows
and compile a win32 gem.

Thanks for letting me know about this, Geoff.
Cheers,
Dave

gtcaz · September 26, 2006, 11:37pm

David B. wrote:

Never mind, I managed to reproduce the problem. It was a bug after all
and a fix will be released in a moment. I just need to swap to Windows
and compile a win32 gem.

You so completely rock. I just tested 10.9 and it works like a champ.

Thank you for all your hard work,

Geoff