Forum: Ruby Index and Query (#54)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
james (Guest)
on 2005-11-17 16:02
(Received via mailing list)
This was a fun quiz for me because I really don't know anything about
indexing
documents.  I learned a ton just by reading the varied solutions.

Before we get into the bit magic, check out this simple Hash style
solution by
Dale M.:

	class IndexHash
		def initialize( documents=nil )
			@index = Hash.new( [] )
			input( documents ) if documents
		end

		def input( documents )
			documents.each_pair do |symbol, contents|
				contents.split.each { |word| insert( symbol, word) }
			end
		end

		def insert( document_symbol, word )
			w = word.downcase
			unless @index[w].include?( document_symbol )
			  @index[w] += [ document_symbol ]
			end
		end

		def find( *strings )
			result = []
			strings.each do |string|
				string.split.each do |word|
					result += @index[ word.downcase ]
				end
			end
			result.uniq
		end

		def words
			@index.keys.sort
		end
	end

We see in initialize() that the index is just a Hash.  The input()
method is
also easy to understand, as it just adds each word of a document to the
Hash, as
an Array of symbolic document names where the word can be found (see
insert()).

The other side of the equation is find().  It takes an Array of Strings,
dices
those up into words, combs the index for each word, and adds all the
connected
document symbols to the result Array it returns.

David B. has done some timing of the submitted solutions:

	http://www.davebalmain.com/articles/2005/11/15/rub...

This isn't the fastest solution for searches, but it still helped me
understand
what we are shooting for here.

Now let's take it a step further and look at Bob S.'s bit
solution:

	#!/usr/local/bin/ruby

	# document indexing/searching class
	class Index

	  # default index file name
	  INDEX_FILE = 'index.dat'

	  # loads existing index file, if any
	  def initialize(index_file = INDEX_FILE)
	    @terms = {}
	    @index = {}
	    @index_file = index_file
	    if File.exists? @index_file
	      @terms, @index = Marshal.load(
	        File.open(@index_file, 'rb') {|f| f.read})
	    end
	  end

	  # ...

We immediately see that this code has two Hashes, one for the terms and
one for
the index.  We can also see that Marshal is used to save and load these
Hashes.

Now let's look at the methods that add to the index:

	  # ...

	  # sets the current document being indexed
	  def document=(name)
	    @document = name
	  end

	  # adds given term to the index under the current document
	  def <<(term)
	    raise "No document defined" unless defined? @document
	    unless @terms.include? term
	      @terms[term] = @terms.length
	    end
	    i = @terms[term]
	    @index[@document] ||= 0
	    @index[@document] |= 1 << i
	  end

	  # ...

The first method just sets the name of the document we are currently
dealing
with.  The second adds a single term to the index, under that document
name.

The first step in adding a term is to place it in the terms Hash.  The
key is
the term and the value is the number of pairs already in the Hash
(basically a
numerical index).  Why didn't Bob just use an Array here?  Because it
would slow
down lookups.  You would have to walk the Array to find the term in
question
each time you needed to know its index.

Once you have an index for the new term, it's time to record it in the
index
Hash, under the current document name.  Each document name is given a
single
Integer for a value.  The bit at the term index is then just flipped on
to
indicate the presence of a term.  This will make for some big numbers,
but
remember that Ruby will automatically switch to Bignum as needed.

Now we need the tools to get the info back out:

	  # ...

	  # finds documents containing all of the specified terms.
	  # if a block is given, each document is supplied to the
	  # block, and nil is returned. Otherwise, an array of
	  # documents is returned.
	  def find(*terms)
	    results = []
	    @index.each do |document, mask|
	      if terms.all? { |term| @terms[term] && mask[@terms[term]] != 0 }
	        block_given? ? yield(document) : results << document
	      end
	    end
	    block_given? ? nil : results
	  end

	  # dumps the entire index, showing each term and the documents
	  # containing that term
	  def dump
	    @terms.sort.each do |term, value|
	      puts "#{term}:"
	      @index.sort.each do |document, mask|
	        puts "  #{document}" if mask[@terms[term]] != 0
	      end
	    end
	  end

	  # saves the index data to disk
	  def save
	    File.open(@index_file, 'wb') do |f|
	      Marshal.dump([@terms, @index], f)
	    end
	  end

	end

	# ...

Again, find() is our primary search method.  It walks the document
listing,
checking for any document containing all the terms.  (That's different
from the
first solution we looked at which found documents containing any terms.)
A term
is found, simply by checking to see if a given bit is on.  Ruby makes
this easy
since the indexing method, [](), returns bits for Integers.  If you
provided a
block to find(), each document is passed when found.  Otherwise, find()
collects
and returns an Array of the documents.

Both dump() and save() are obvious and do exactly what the comments say
they do.

Here's the last bit of code:

	# ...
	if $0 == __FILE__
	  idx = Index.new
	  case ARGV.shift
	    when 'add'
	      ARGV.each do |fname|
	        idx.document = fname
	        IO.foreach(fname) do |line|
	          line.downcase.scan(/\w+/) { |term| idx << term }
	        end
	      end
	      idx.save
	    when 'find'
	      idx.find(*ARGV.collect { |s| s.downcase }) do |document|
	        puts document
	      end
	    when 'dump'
	      idx.dump
	    else
	      print <<-EOS
	Usage: #$0 add file [file...]       Adds files to index
	       #$0 find term [term...]      Lists files containing all term(s)
	       #$0 dump                     Dumps raw index data
	      EOS
	  end
	end

That's a clever interface in very little code.  It reads the first
argument to
see if you want to "add", "find", or "dump" (similar to svn, cvs, or
gem).  The
rest of the arguments are the files to "add" or the terms to "find".

The other interesting element at David B.'s result comparison page,
is the
chart of capabilities.  Think about how you might support queries like
"Apple
AND NOT fruit".  See David's Simple Ferret solution for a start on this
kind of
logic.

Then think about how you might add the ability to search for phrases
instead of
terms, like "Ruby P.ming Language".  This problem space is vast and
interesting to explore, I think.

My thanks to everyone who worked the quiz and to David B. for
results that
helped me know what to even look at.

Tomorrow we will explore James's favorite "trick taking game"...
This topic is locked and can not be replied to.