I want to compare two documents in the index (i.e. retrieve the cosine
similarity/score between two documents term-vector’s). Is this possible
using the standard Ferret functionality?
Thanks in advance,
Jeroen B.
I want to compare two documents in the index (i.e. retrieve the cosine
similarity/score between two documents term-vector’s). Is this possible
using the standard Ferret functionality?
Thanks in advance,
Jeroen B.
On 5/27/06, Jeroen B. [email protected] wrote:
I want to compare two documents in the index (i.e. retrieve the cosine
similarity/score between two documents term-vector’s). Is this possible
using the standard Ferret functionality?
Hi Jeroen,
No problem. Make sure you store term-vectors when you add the field.
That is;
doc.add_field(:field, "yada yada yada",
Field::Store::NO, # or YES
Field::Index::TOKENIZED, # or UNTOKENIZED
Field::TermVector::YES) # or anything else but NO
Then you can retrieve the term vector from an index reader like so;
term_vector = index_reader.get_term_vector(doc_num, :field)
terms = term_vector.terms # array of terms in :field in document
freqs = term_vector.freqs # array of corresponding frequencies
Hope that helps. Is that enough to get you going?
Cheers,
Dave
David B. wrote:
doc.add_field(:field, "yada yada yada", Field::Store::NO, # or YES Field::Index::TOKENIZED, # or UNTOKENIZED Field::TermVector::YES) # or anything else but NO
I got this far:
------ BEGIN CODE SNIPPET ------
weblogs = YAML::load(File.open(“weblogs.yml”))
print “— Analyzing weblogs:\n”
weblogs.each do |weblog, id|
content = “”
print " * Indexing weblog #{weblog}/#{id} "
weblogdata = YAML::load(File.open("./data/#{id}"))
weblogdata[:posts].each do |id, post|
# Clean up content
# by removing all UBB blocks. This will cut-out some content. I
consider this
# loss a plus
content = content + “\n\n” +
post[:text].gsub(/[[^]]+][^[]+[[^]]+]/i, “”)
#content.gsub!(/[[^]]+][^[]+[[^]]+]/i, “”)
end
doc = Document.new
doc.add_field(:id, weblog, Field::Store::YES, Field::Index::TOKENIZED,
Field::TermVector::NO)
doc.add_field(:content, content, Field::Store::NO,
Field::Index::TOKENIZED, Field::TermVector::YES)
index << doc
index.flush
print “done.\n”
end
------ END CODE SNIPPET ------
I Index about 23000 weblogs with their weblog id as the document id and
the content by termvector. Now I want to compare two weblogs. So what
you suggest is that I retrieve the term-vectors for both documents and
calculate the dotproduct of the two vectors myself; or is there a nice
Ferret-way to do this?
Thanks in advance,
Jeroen B.
On 5/27/06, Jeroen B. [email protected] wrote:
I Index about 23000 weblogs with their weblog id as the document id and
the content by termvector. Now I want to compare two weblogs. So what
you suggest is that I retrieve the term-vectors for both documents and
calculate the dotproduct of the two vectors myself; or is there a nice
Ferret-way to do this?
Until now I haven’t really used the TermVectors so this probably isn’t
the best way to do it but here goes (this is very rough);
def cosine_similarity(index_reader, doc1, doc2)
tv1 = index_reader.get_term_vector(doc1, :data)
terms1 = tv1.terms
freqs1 = tv1.freqs
matrix = {}
terms1.size.times {|i| matrix[terms1[i]] = [freqs1[i], 0]}
tv2 = index_reader.get_term_vector(doc2, :data)
terms2 = tv2.terms
freqs2 = tv2.freqs
terms2.size.times {|i| (matrix[terms2[i]] ||= [0])[1] = freqs2[i]}
dot_product = matrix.values.inject(0) {|dp, (a,b)| dp += a*b}
lengths_product = Math.sqrt(freqs1.inject(0) {|sp, f| sp += f*f} *
freqs2.inject(0) {|sp, f| sp += f*f})
dot_product / lengths_product
end
I’d be interested to hear how you go with this. If performance is poor
I can add something like this to the C code.
Hope this helps,
Dave
David B. wrote:
Until now I haven’t really used the TermVectors so this probably isn’t
the best way to do it but here goes (this is very rough);
I’m going to try this out now. I’ll also try extracting all term vectors
from doc1 and using them as a query on doc2 (using a BooleanQuery). They
use this kind of method in “Lucene in Action” (somewhere around page 190
if I recall correctly).
Thanks for your quick responses; I’ll let you know how things work out.
Cheers,
Jeroen B.
Yes it is a more like this query, but: I only want the relevance score
for document B given document A as the query (so weblog:B AND
all_terms_from_A)
I’ll look into it; thesis is due in 4 weeks so I’ve got loads of time
Cheers,
Jeroen B.
On Sun, May 28, 2006 at 07:36:25AM +0900, David B. wrote:
If it’s a “More Like This” query that you are trying to write, I
recommend you look at the Lucene code here;http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_0/contrib/similarity/src/java/org/apache/lucene/search/similar/MoreLikeThis.java?revision=409698&view=markup
or you check out the port of this that lives in acts_as_ferret
http://projects.jkraemer.net/acts_as_ferret/browser/trunk/plugin/acts_as_ferret/lib/acts_as_ferret.rb
from Line 525 till around 720.
It’s part of Lucene 2.0 now. I’ll be adding MoreLikeThis Queries in
the near future.
Dave, that’s a nice idea. Should I try to prepare a patch for this based
on what I did in acts_as_ferret ? Would be ruby-only, though. But as the
whole more like this thing more or less is about building a
BooleanQuery,
I think speed is no issue here.
Jens
–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66
On 5/29/06, Jens K. [email protected] wrote:
On Sun, May 28, 2006 at 07:36:25AM +0900, David B. wrote:
It’s part of Lucene 2.0 now. I’ll be adding MoreLikeThis Queries in
the near future.Dave, that’s a nice idea. Should I try to prepare a patch for this based
on what I did in acts_as_ferret ? Would be ruby-only, though. But as the
whole more like this thing more or less is about building a BooleanQuery,
I think speed is no issue here.
Hi Jens,
That’d be great but not just yet. I may be making a few adjustments to
the API in the coming week. I’ll be sure to discuss possible changes
with you guys when the time comes.
Gotta run. Cheers,
Dave
On 5/28/06, Jeroen B. [email protected] wrote:
David B. wrote:
Until now I haven’t really used the TermVectors so this probably isn’t
the best way to do it but here goes (this is very rough);I’m going to try this out now. I’ll also try extracting all term vectors
from doc1 and using them as a query on doc2 (using a BooleanQuery). They
use this kind of method in “Lucene in Action” (somewhere around page 190
if I recall correctly).
If it’s a “More Like This” query that you are trying to write, I
recommend you look at the Lucene code here;
http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_0/contrib/similarity/src/java/org/apache/lucene/search/similar/MoreLikeThis.java?revision=409698&view=markup
It’s part of Lucene 2.0 now. I’ll be adding MoreLikeThis Queries in
the near future.
Cheers,
Dave
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.
Sponsor our Newsletter | Privacy Policy | Terms of Service | Remote Ruby Jobs