Forum: Ferret Direct access to similarity, idf, tf

Announcement (2017-05-07): is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see and for other Rails- und Ruby-related community platforms.
A75d441d24957e7ea43c182dca09278d?d=identicon&s=25 Ricardo Panaggio (panaggio)
on 2009-08-24 21:47

I need access to similarity, tf, idf and some other content from the
same level.

I've been studying the code for three days now and I'm almost sure that
there's no way to do that "directly" from ruby code.

As I *really need* them, I was thinking about writing out some ruby code
for the relevant ferret c code to get access to tf, idf and so on. Or
maybe write a wrapper for lucene-java code, because it seams to be
easier. But I do think I could help a bit ferret development writing the
wrapper for ferret's code.

So, I have some questions regarding these issues:

1) Is there an implemented way to get access to tf, idf, ...? (so am I

2) What do you think it's easier: writing a wrapper for lucene-java or
for ferret?

3) Will it be good for ferret to have this code I'm proposing written?

Thanks in advance

Ricardo Panaggio
ps: sorry if my English is difficult to read =/
1ae78e5099a40f308b172a5a65416cab?d=identicon&s=25 Charles Charles (charlesmartin14)
on 2009-09-23 21:14
I am very interested in this if you find a way to do it

1ae78e5099a40f308b172a5a65416cab?d=identicon&s=25 Charles Charles (charlesmartin14)
on 2010-09-27 21:40
Her's a monkey patch that should do the trick:

class Ferret::Index::IndexReader


  # return [doc_id, [term, tfidf] ]
  #  doc_id starts at 0
  #  tfidf drops values > THRESH
  def each_tfidf_vec(field=:id, thresh=TFIDF_THRESH,  &block)

    doc_freq = {} # [term] => doc_freq
    terms(field).each { |term, df| doc_freq[term] = df }
    num_terms = doc_freq.size

     (0...num_docs).each do |doc_id|
      tv_terms = term_vector(doc_id, field).terms
      tf_norm = tv_terms.size

      tfidf_vec = do |tv_term|
        term = tv_term.text
        tf = tv_term.positions.size
        df = doc_freq[term]

        tfidf = (tf.to_f/tf_norm.to_f) * Math.log(num_docs.to_f/df.to_f)

        [term,tfidf] if tfidf > thresh

      #remove nil values (tfidf < thresh)

      yield doc_id, tfidf_vec

This topic is locked and can not be replied to.