Location of match?

johnnybutler7 · April 12, 2006, 9:39pm

Is it possible with Ferret to find the location of the matches in a
document? For example imagine I have 100 documents and I search with
the phrase “bob~0.5” and that returns 3 matching documents. How can I
then find all locations in a specific document where it matched
“bob~0.5”. What I need is something like an array that contains the
start index and length for each match within a given document. Does
this exact? Should I break up my matching document into subdocuments
then search on that?

Also for my application I will be searching for fairly large pieces
of text ( many sentences long ) and doing fuzzy matching. I suppose
what I am doing is very similar to trying to find matching phrases
within an essay to catch people plagiarizing (that’s not what I’m
doing at all, but it’s close enough in terms of methods).

Are both of these possible with Ferret? Is there another technology I
should look at for doing this? I will have a relatively small index
size ( somewhere between 100 and 500 ) and so I’m not really
concerned with speed issues.

Thanks so much for any help!

-John

johnnybutler7 · April 18, 2006, 4:54am

Hi John,

On 4/13/06, John B. [email protected] wrote:

Is it possible with Ferret to find the location of the matches in a
document? For example imagine I have 100 documents and I search with
the phrase “bob~0.5” and that returns 3 matching documents. How can I
then find all locations in a specific document where it matched
“bob~0.5”. What I need is something like an array that contains the
start index and length for each match within a given document. Does
this exact? Should I break up my matching document into subdocuments
then search on that?

A search result highlighter is coming in a future version of Ferret.
This will enable you to find the position of the match in a document.
I can’t say when.

At University we had an assignment to write a program that would find
similar documents with the purpose of catching people plagiarizing. I
used a running hash, something similar to this;

require 'ferret'

NUM_WORDS = 5

def hash_doc(filename)
  stk = Ferret::Analysis::StandardTokenizer.new("")
  words = []
  hashes = []
  File.open(filename) do |f|
    f.each do |line|
      stk.text = line
      while tk = stk.next()
        words << tk.text
        if words.size == NUM_WORDS
          hashes << words.hash
          words.shift
        end
      end
    end
  end
  return hashes.sort!
end

def hash_cmp(hash1, hash2)
  same = 0
  size_avg = (hash1.size + hash2.size)/2
  h1 = hash1.pop
  h2 = hash2.pop
  while (not hash1.empty? and not hash2.empty?)
    if (h2 == h1)
      same += 1
      h1 = hash1.pop
      h2 = hash2.pop
    else
      if (h1 > h2)
        h1 = hash1.pop
      else
        h2 = hash2.pop
      end
    end
  end
  return same.to_f/size_avg
end

puts hash_cmp(hash_doc(ARGV[0]), hash_doc(ARGV[1]))

I’m not sure if this would work better for you than using a really
long phrase query.

Cheers,
Dave

johnnybutler7 · June 13, 2006, 4:29am

Hi David,

David B. wrote:

A search result highlighter is coming in a future version of Ferret.
This will enable you to find the position of the match in a document.
I can’t say when.

This would be awesome and also what I’m looking for too - has there been
any progress on this at all since you’re last post we might be able to
take a look at?

Can we help in any way?

Cheers,

Marcus