Forum: Ferret Location of match?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
0f607d91b5c5e0da6ed274d6478a14b8?d=identicon&s=25 John Butler (Guest)
on 2006-04-12 21:39
(Received via mailing list)
Is it possible with Ferret to find the location of the matches in a
document? For example imagine I have 100 documents and I search with
the phrase "bob~0.5" and that returns 3 matching documents. How can I
then find all locations in a specific document where it matched
"bob~0.5". What I need is something like an array that contains the
start index and length for each match within a given document. Does
this exact? Should I break up my matching document into subdocuments
then search on that?

Also for my application I will be searching for fairly large pieces
of text ( many sentences long ) and doing fuzzy matching. I suppose
what I am doing is very similar to trying to find matching phrases
within an essay to catch people plagiarizing (that's not what I'm
doing at all, but it's close enough in terms of methods).

Are both of these possible with Ferret? Is there another technology I
should look at for doing this? I will have a relatively small index
size ( somewhere between 100 and 500 ) and so I'm not really
concerned with speed issues.

Thanks so much for any help!

-John
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-04-18 04:54
(Received via mailing list)
Hi John,

On 4/13/06, John Butler <john@likealightbulb.com> wrote:
> Is it possible with Ferret to find the location of the matches in a
> document? For example imagine I have 100 documents and I search with
> the phrase "bob~0.5" and that returns 3 matching documents. How can I
> then find all locations in a specific document where it matched
> "bob~0.5". What I need is something like an array that contains the
> start index and length for each match within a given document. Does
> this exact? Should I break up my matching document into subdocuments
> then search on that?

A search result highlighter is coming in a future version of Ferret.
This will enable you to find the position of the match in a document.
I can't say when.

>
At University we had an assignment to write a program that would find
similar documents with the purpose of catching people plagiarizing. I
used a running hash, something similar to this;

    require 'ferret'

    NUM_WORDS = 5

    def hash_doc(filename)
      stk = Ferret::Analysis::StandardTokenizer.new("")
      words = []
      hashes = []
      File.open(filename) do |f|
        f.each do |line|
          stk.text = line
          while tk = stk.next()
            words << tk.text
            if words.size == NUM_WORDS
              hashes << words.hash
              words.shift
            end
          end
        end
      end
      return hashes.sort!
    end

    def hash_cmp(hash1, hash2)
      same = 0
      size_avg = (hash1.size + hash2.size)/2
      h1 = hash1.pop
      h2 = hash2.pop
      while (not hash1.empty? and not hash2.empty?)
        if (h2 == h1)
          same += 1
          h1 = hash1.pop
          h2 = hash2.pop
        else
          if (h1 > h2)
            h1 = hash1.pop
          else
            h2 = hash2.pop
          end
        end
      end
      return same.to_f/size_avg
    end

    puts hash_cmp(hash_doc(ARGV[0]), hash_doc(ARGV[1]))

I'm not sure if this would work better for you than using a really
long phrase query.

Cheers,
Dave
0b9633a418ceba073e40c37feefef679?d=identicon&s=25 Marcus Crafter (Guest)
on 2006-06-13 04:29
Hi David,

David Balmain wrote:
> A search result highlighter is coming in a future version of Ferret.
> This will enable you to find the position of the match in a document.
> I can't say when.

This would be awesome and also what I'm looking for too - has there been
any progress on this at all since you're last post we might be able to
take a look at?

Can we help in any way?

Cheers,

Marcus
This topic is locked and can not be replied to.