Comparing two documents in the index


#1

I want to compare two documents in the index (i.e. retrieve the cosine
similarity/score between two documents term-vector’s). Is this possible
using the standard Ferret functionality?

Thanks in advance,

Jeroen B.


#2

On 5/27/06, Jeroen B. removed_email_address@domain.invalid wrote:

I want to compare two documents in the index (i.e. retrieve the cosine
similarity/score between two documents term-vector’s). Is this possible
using the standard Ferret functionality?

Hi Jeroen,

No problem. Make sure you store term-vectors when you add the field.
That is;

doc.add_field(:field, "yada yada yada",
              Field::Store::NO,                # or YES
              Field::Index::TOKENIZED,   # or UNTOKENIZED
              Field::TermVector::YES)     # or anything else but NO

Then you can retrieve the term vector from an index reader like so;

term_vector = index_reader.get_term_vector(doc_num, :field)
terms = term_vector.terms # array of terms in :field in document
freqs = term_vector.freqs # array of corresponding frequencies

Hope that helps. Is that enough to get you going?

Cheers,
Dave


#3

David B. wrote:

doc.add_field(:field, "yada yada yada",
              Field::Store::NO,                # or YES
              Field::Index::TOKENIZED,   # or UNTOKENIZED
              Field::TermVector::YES)     # or anything else but NO

I got this far:
------ BEGIN CODE SNIPPET ------

Read weblog data

weblogs = YAML::load(File.open(“weblogs.yml”))

Walk over weblogs and save all data.

print “— Analyzing weblogs:\n”
weblogs.each do |weblog, id|
content = “”
print " * Indexing weblog #{weblog}/#{id} "

Load the appropriate file for parsing.

weblogdata = YAML::load(File.open("./data/#{id}"))

weblogdata[:posts].each do |id, post|
# Clean up content
# by removing all UBB blocks. This will cut-out some content. I
consider this
# loss a plus :smiley:
content = content + “\n\n” +
post[:text].gsub(/[[^]]+][^[]+[[^]]+]/i, “”)
#content.gsub!(/[[^]]+][^[]+[[^]]+]/i, “”)
end

Create a new document

doc = Document.new
doc.add_field(:id, weblog, Field::Store::YES, Field::Index::TOKENIZED,
Field::TermVector::NO)
doc.add_field(:content, content, Field::Store::NO,
Field::Index::TOKENIZED, Field::TermVector::YES)

And add to the index.

index << doc
index.flush

print “done.\n”
end
------ END CODE SNIPPET ------

I Index about 23000 weblogs with their weblog id as the document id and
the content by termvector. Now I want to compare two weblogs. So what
you suggest is that I retrieve the term-vectors for both documents and
calculate the dotproduct of the two vectors myself; or is there a nice
Ferret-way to do this?

Thanks in advance,

Jeroen B.


#4

On 5/27/06, Jeroen B. removed_email_address@domain.invalid wrote:

I Index about 23000 weblogs with their weblog id as the document id and
the content by termvector. Now I want to compare two weblogs. So what
you suggest is that I retrieve the term-vectors for both documents and
calculate the dotproduct of the two vectors myself; or is there a nice
Ferret-way to do this?

Until now I haven’t really used the TermVectors so this probably isn’t
the best way to do it but here goes (this is very rough);

def cosine_similarity(index_reader, doc1, doc2)
  tv1 = index_reader.get_term_vector(doc1, :data)
  terms1 = tv1.terms
  freqs1 = tv1.freqs
  matrix = {}
  terms1.size.times {|i| matrix[terms1[i]] = [freqs1[i], 0]}

  tv2 = index_reader.get_term_vector(doc2, :data)
  terms2 = tv2.terms
  freqs2 = tv2.freqs
  terms2.size.times {|i| (matrix[terms2[i]] ||= [0])[1] = freqs2[i]}

  dot_product = matrix.values.inject(0) {|dp, (a,b)| dp += a*b}
  lengths_product = Math.sqrt(freqs1.inject(0) {|sp, f| sp += f*f} *
                              freqs2.inject(0) {|sp, f| sp += f*f})
  dot_product / lengths_product
end

I’d be interested to hear how you go with this. If performance is poor
I can add something like this to the C code.

Hope this helps,
Dave


#5

David B. wrote:

Until now I haven’t really used the TermVectors so this probably isn’t
the best way to do it but here goes (this is very rough);

I’m going to try this out now. I’ll also try extracting all term vectors
from doc1 and using them as a query on doc2 (using a BooleanQuery). They
use this kind of method in “Lucene in Action” (somewhere around page 190
if I recall correctly).

Thanks for your quick responses; I’ll let you know how things work out.

Cheers,

Jeroen B.


#6

Yes it is a more like this query, but: I only want the relevance score
for document B given document A as the query (so weblog:B AND
all_terms_from_A)

I’ll look into it; thesis is due in 4 weeks so I’ve got loads of time :smiley:

Cheers,

Jeroen B.


#7

On Sun, May 28, 2006 at 07:36:25AM +0900, David B. wrote:

If it’s a “More Like This” query that you are trying to write, I
recommend you look at the Lucene code here;

http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_0/contrib/similarity/src/java/org/apache/lucene/search/similar/MoreLikeThis.java?revision=409698&view=markup

or you check out the port of this that lives in acts_as_ferret :slight_smile:

http://projects.jkraemer.net/acts_as_ferret/browser/trunk/plugin/acts_as_ferret/lib/acts_as_ferret.rb
from Line 525 till around 720.

It’s part of Lucene 2.0 now. I’ll be adding MoreLikeThis Queries in
the near future.

Dave, that’s a nice idea. Should I try to prepare a patch for this based
on what I did in acts_as_ferret ? Would be ruby-only, though. But as the
whole more like this thing more or less is about building a
BooleanQuery,
I think speed is no issue here.

Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer removed_email_address@domain.invalid
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66


#8

On 5/29/06, Jens K. removed_email_address@domain.invalid wrote:

On Sun, May 28, 2006 at 07:36:25AM +0900, David B. wrote:

It’s part of Lucene 2.0 now. I’ll be adding MoreLikeThis Queries in
the near future.

Dave, that’s a nice idea. Should I try to prepare a patch for this based
on what I did in acts_as_ferret ? Would be ruby-only, though. But as the
whole more like this thing more or less is about building a BooleanQuery,
I think speed is no issue here.

Hi Jens,

That’d be great but not just yet. I may be making a few adjustments to
the API in the coming week. I’ll be sure to discuss possible changes
with you guys when the time comes.

Gotta run. Cheers,
Dave


#9

On 5/28/06, Jeroen B. removed_email_address@domain.invalid wrote:

David B. wrote:

Until now I haven’t really used the TermVectors so this probably isn’t
the best way to do it but here goes (this is very rough);

I’m going to try this out now. I’ll also try extracting all term vectors
from doc1 and using them as a query on doc2 (using a BooleanQuery). They
use this kind of method in “Lucene in Action” (somewhere around page 190
if I recall correctly).

If it’s a “More Like This” query that you are trying to write, I
recommend you look at the Lucene code here;

http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_0/contrib/similarity/src/java/org/apache/lucene/search/similar/MoreLikeThis.java?revision=409698&view=markup

It’s part of Lucene 2.0 now. I’ll be adding MoreLikeThis Queries in
the near future.

Cheers,
Dave