Direct access to similarity, idf, tf


#1

Hello,

I need access to similarity, tf, idf and some other content from the
same level.

I’ve been studying the code for three days now and I’m almost sure that
there’s no way to do that “directly” from ruby code.

As I really need them, I was thinking about writing out some ruby code
for the relevant ferret c code to get access to tf, idf and so on. Or
maybe write a wrapper for lucene-java code, because it seams to be
easier. But I do think I could help a bit ferret development writing the
wrapper for ferret’s code.

So, I have some questions regarding these issues:

  1. Is there an implemented way to get access to tf, idf, …? (so am I
    wrong?)

  2. What do you think it’s easier: writing a wrapper for lucene-java or
    for ferret?

  3. Will it be good for ferret to have this code I’m proposing written?

Thanks in advance

Ricardo Panaggio
ps: sorry if my English is difficult to read =/


#2

I am very interested in this if you find a way to do it

Charles


#3

Her’s a monkey patch that should do the trick:

class Ferret::Index::IndexReader

TFIDF_THRESH = 0.0

return [doc_id, [term, tfidf] ]

doc_id starts at 0

tfidf drops values > THRESH

def each_tfidf_vec(field=:id, thresh=TFIDF_THRESH, &block)

doc_freq = {} # [term] => doc_freq
terms(field).each { |term, df| doc_freq[term] = df }
num_terms = doc_freq.size

 (0...num_docs).each do |doc_id|
  tv_terms = term_vector(doc_id, field).terms
  tf_norm = tv_terms.size

  tfidf_vec = tv_terms.map do |tv_term|
    term = tv_term.text
    tf = tv_term.positions.size
    df = doc_freq[term]

    tfidf = (tf.to_f/tf_norm.to_f) * Math.log(num_docs.to_f/df.to_f)

    [term,tfidf] if tfidf > thresh
  end

  #remove nil values (tfidf < thresh)
  tfidf_vec.compact!

  yield doc_id, tfidf_vec
end

end

end