Forum: Ferret Direct access to similarity, idf, tf

A75d441d24957e7ea43c182dca09278d?d=identicon&s=25 Ricardo Panaggio (panaggio)
on 2009-08-24 21:47
Hello,

I need access to similarity, tf, idf and some other content from the
same level.

I've been studying the code for three days now and I'm almost sure that
there's no way to do that "directly" from ruby code.

As I *really need* them, I was thinking about writing out some ruby code
for the relevant ferret c code to get access to tf, idf and so on. Or
maybe write a wrapper for lucene-java code, because it seams to be
easier. But I do think I could help a bit ferret development writing the
wrapper for ferret's code.

So, I have some questions regarding these issues:

1) Is there an implemented way to get access to tf, idf, ...? (so am I
wrong?)

2) What do you think it's easier: writing a wrapper for lucene-java or
for ferret?

3) Will it be good for ferret to have this code I'm proposing written?

Thanks in advance

Ricardo Panaggio
ps: sorry if my English is difficult to read =/
1ae78e5099a40f308b172a5a65416cab?d=identicon&s=25 Charles Charles (charlesmartin14)
on 2009-09-23 21:14
I am very interested in this if you find a way to do it

Charles
1ae78e5099a40f308b172a5a65416cab?d=identicon&s=25 Charles Charles (charlesmartin14)
on 2010-09-27 21:40
Her's a monkey patch that should do the trick:

class Ferret::Index::IndexReader

  TFIDF_THRESH = 0.0

  # return [doc_id, [term, tfidf] ]
  #  doc_id starts at 0
  #  tfidf drops values > THRESH
  #
  def each_tfidf_vec(field=:id, thresh=TFIDF_THRESH,  &block)

    doc_freq = {} # [term] => doc_freq
    terms(field).each { |term, df| doc_freq[term] = df }
    num_terms = doc_freq.size

     (0...num_docs).each do |doc_id|
      tv_terms = term_vector(doc_id, field).terms
      tf_norm = tv_terms.size

      tfidf_vec = tv_terms.map do |tv_term|
        term = tv_term.text
        tf = tv_term.positions.size
        df = doc_freq[term]

        tfidf = (tf.to_f/tf_norm.to_f) * Math.log(num_docs.to_f/df.to_f)

        [term,tfidf] if tfidf > thresh
      end

      #remove nil values (tfidf < thresh)
      tfidf_vec.compact!

      yield doc_id, tfidf_vec
    end
  end



end
This topic is locked and can not be replied to.