Forum: Ferret Comparing two documents in the index

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
C885f6cc78a3ff23589f01bf4912d98c?d=identicon&s=25 Jeroen Bulters (Guest)
on 2006-05-26 19:38
I want to compare two documents in the index (i.e. retrieve the cosine
similarity/score between two documents term-vector's). Is this possible
using the standard Ferret functionality?

Thanks in advance,

Jeroen Bulters
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-05-27 01:56
(Received via mailing list)
On 5/27/06, Jeroen Bulters <jeroenbulters@gmail.com> wrote:
> I want to compare two documents in the index (i.e. retrieve the cosine
> similarity/score between two documents term-vector's). Is this possible
> using the standard Ferret functionality?

Hi Jeroen,

No problem. Make sure you store term-vectors when you add the field.
That is;

    doc.add_field(:field, "yada yada yada",
                  Field::Store::NO,                # or YES
                  Field::Index::TOKENIZED,   # or UNTOKENIZED
                  Field::TermVector::YES)     # or anything else but NO

Then you can retrieve the term vector from an index reader like so;

    term_vector = index_reader.get_term_vector(doc_num, :field)
    terms = term_vector.terms # array of terms in :field in document
    freqs = term_vector.freqs # array of corresponding frequencies

Hope that helps. Is that enough to get you going?

Cheers,
Dave
C885f6cc78a3ff23589f01bf4912d98c?d=identicon&s=25 Jeroen Bulters (Guest)
on 2006-05-27 15:09
David Balmain wrote:

>     doc.add_field(:field, "yada yada yada",
>                   Field::Store::NO,                # or YES
>                   Field::Index::TOKENIZED,   # or UNTOKENIZED
>                   Field::TermVector::YES)     # or anything else but NO

I got this far:
------ BEGIN CODE SNIPPET ------
# Read weblog data
weblogs = YAML::load(File.open("weblogs.yml"))

# Walk over weblogs and save all data.
print "--- Analyzing weblogs:\n"
weblogs.each do |weblog, id|
  content = ""
  print " * Indexing weblog #{weblog}/#{id} "
  # Load the appropriate file for parsing.
  weblogdata = YAML::load(File.open("./data/#{id}"))

  weblogdata[:posts].each do |id, post|
    # Clean up content
    # by removing all UBB blocks. This will cut-out some content. I
consider this
    # loss a plus :D
    content = content + "\n\n" +
post[:text].gsub(/\[[^\]]+\][^\[]+\[[^\]]+\]/i, "")
    #content.gsub!(/\[[^\]]+\][^\[]+\[[^\]]+\]/i, "")
  end

  # Create a new document
  doc = Document.new
  doc.add_field(:id, weblog, Field::Store::YES, Field::Index::TOKENIZED,
Field::TermVector::NO)
  doc.add_field(:content, content, Field::Store::NO,
Field::Index::TOKENIZED, Field::TermVector::YES)

  # And add to the index.
  index << doc
  index.flush

  print "done.\n"
end
------ END CODE SNIPPET ------

I Index about 23000 weblogs with their weblog id as the document id and
the content by termvector. Now I want to compare two weblogs. So what
you suggest is that I retrieve the term-vectors for both documents and
calculate the dotproduct of the two vectors myself; or is there a nice
Ferret-way to do this?

Thanks in advance,

Jeroen Bulters
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-05-27 16:57
(Received via mailing list)
On 5/27/06, Jeroen Bulters <jeroenbulters@gmail.com> wrote:
> I Index about 23000 weblogs with their weblog id as the document id and
> the content by termvector. Now I want to compare two weblogs. So what
> you suggest is that I retrieve the term-vectors for both documents and
> calculate the dotproduct of the two vectors myself; or is there a nice
> Ferret-way to do this?

Until now I haven't really used the TermVectors so this probably isn't
the best way to do it but here goes (this is very rough);

    def cosine_similarity(index_reader, doc1, doc2)
      tv1 = index_reader.get_term_vector(doc1, :data)
      terms1 = tv1.terms
      freqs1 = tv1.freqs
      matrix = {}
      terms1.size.times {|i| matrix[terms1[i]] = [freqs1[i], 0]}

      tv2 = index_reader.get_term_vector(doc2, :data)
      terms2 = tv2.terms
      freqs2 = tv2.freqs
      terms2.size.times {|i| (matrix[terms2[i]] ||= [0])[1] = freqs2[i]}

      dot_product = matrix.values.inject(0) {|dp, (a,b)| dp += a*b}
      lengths_product = Math.sqrt(freqs1.inject(0) {|sp, f| sp += f*f} *
                                  freqs2.inject(0) {|sp, f| sp += f*f})
      dot_product / lengths_product
    end

I'd be interested to hear how you go with this. If performance is poor
I can add something like this to the C code.

Hope this helps,
Dave
C885f6cc78a3ff23589f01bf4912d98c?d=identicon&s=25 Jeroen Bulters (Guest)
on 2006-05-27 17:40
David Balmain wrote:
> Until now I haven't really used the TermVectors so this probably isn't
> the best way to do it but here goes (this is very rough);

I'm going to try this out now. I'll also try extracting all term vectors
from doc1 and using them as a query on doc2 (using a BooleanQuery). They
use this kind of method in "Lucene in Action" (somewhere around page 190
if I recall correctly).

Thanks for your quick responses; I'll let you know how things work out.

Cheers,

Jeroen Bulters
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-05-28 00:39
(Received via mailing list)
On 5/28/06, Jeroen Bulters <jeroenbulters@gmail.com> wrote:
> David Balmain wrote:
> > Until now I haven't really used the TermVectors so this probably isn't
> > the best way to do it but here goes (this is very rough);
>
> I'm going to try this out now. I'll also try extracting all term vectors
> from doc1 and using them as a query on doc2 (using a BooleanQuery). They
> use this kind of method in "Lucene in Action" (somewhere around page 190
> if I recall correctly).

If it's a "More Like This" query that you are trying to write, I
recommend you look at the Lucene code here;

    http://svn.apache.org/viewvc/lucene/java/branches/...

It's part of Lucene 2.0 now. I'll be adding MoreLikeThis Queries in
the near future.

Cheers,
Dave
C885f6cc78a3ff23589f01bf4912d98c?d=identicon&s=25 Jeroen Bulters (Guest)
on 2006-05-28 14:37
Yes it is a more like this query, but: I only want the relevance score
for document B given document A as the query (so weblog:B AND
all_terms_from_A)

I'll look into it; thesis is due in 4 weeks so I've got loads of time :D

Cheers,

Jeroen Bulters
C9dd93aa135988cabf9183d3210665ca?d=identicon&s=25 Jens Kraemer (Guest)
on 2006-05-29 09:34
(Received via mailing list)
On Sun, May 28, 2006 at 07:36:25AM +0900, David Balmain wrote:
> If it's a "More Like This" query that you are trying to write, I
> recommend you look at the Lucene code here;
>
> 
http://svn.apache.org/viewvc/lucene/java/branches/...

or you check out the port of this  that lives in acts_as_ferret :-)

http://projects.jkraemer.net/acts_as_ferret/browse...
from Line 525 till around 720.


> It's part of Lucene 2.0 now. I'll be adding MoreLikeThis Queries in
> the near future.

Dave, that's a nice idea. Should I try to prepare a patch for this based
on what I did in acts_as_ferret ? Would be ruby-only, though. But as the
whole more like this thing more or less is about building a
BooleanQuery,
I think speed is no issue here.

Jens


--
webit! Gesellschaft für neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer       kraemer@webit.de
Schnorrstraße 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-05-29 09:56
(Received via mailing list)
On 5/29/06, Jens Kraemer <kraemer@webit.de> wrote:
><snip>
>
> On Sun, May 28, 2006 at 07:36:25AM +0900, David Balmain wrote:
> > It's part of Lucene 2.0 now. I'll be adding MoreLikeThis Queries in
> > the near future.
>
> Dave, that's a nice idea. Should I try to prepare a patch for this based
> on what I did in acts_as_ferret ? Would be ruby-only, though. But as the
> whole more like this thing more or less is about building a BooleanQuery,
> I think speed is no issue here.

Hi Jens,

That'd be great but not just yet. I may be making a few adjustments to
the API in the coming week. I'll be sure to discuss possible changes
with you guys when the time comes.

Gotta run. Cheers,
Dave
This topic is locked and can not be replied to.