Hello, I need access to similarity, tf, idf and some other content from the same level. I've been studying the code for three days now and I'm almost sure that there's no way to do that "directly" from ruby code. As I *really need* them, I was thinking about writing out some ruby code for the relevant ferret c code to get access to tf, idf and so on. Or maybe write a wrapper for lucene-java code, because it seams to be easier. But I do think I could help a bit ferret development writing the wrapper for ferret's code. So, I have some questions regarding these issues: 1) Is there an implemented way to get access to tf, idf, ...? (so am I wrong?) 2) What do you think it's easier: writing a wrapper for lucene-java or for ferret? 3) Will it be good for ferret to have this code I'm proposing written? Thanks in advance Ricardo Panaggio ps: sorry if my English is difficult to read =/
on 2009-08-24 21:47
on 2010-09-27 21:40
Her's a monkey patch that should do the trick:
class Ferret::Index::IndexReader
TFIDF_THRESH = 0.0
# return [doc_id, [term, tfidf] ]
# doc_id starts at 0
# tfidf drops values > THRESH
#
def each_tfidf_vec(field=:id, thresh=TFIDF_THRESH, &block)
doc_freq = {} # [term] => doc_freq
terms(field).each { |term, df| doc_freq[term] = df }
num_terms = doc_freq.size
(0...num_docs).each do |doc_id|
tv_terms = term_vector(doc_id, field).terms
tf_norm = tv_terms.size
tfidf_vec = tv_terms.map do |tv_term|
term = tv_term.text
tf = tv_term.positions.size
df = doc_freq[term]
tfidf = (tf.to_f/tf_norm.to_f) * Math.log(num_docs.to_f/df.to_f)
[term,tfidf] if tfidf > thresh
end
#remove nil values (tfidf < thresh)
tfidf_vec.compact!
yield doc_id, tfidf_vec
end
end
end
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.