DictionaryMatcher (#103)

bbazzarrakk · November 25, 2006, 12:21am

The three rules of Ruby Q.:

Please do not post any solutions or spoiler discussion for this quiz
until
48 hours have passed from the time on this message.
Support Ruby Q. by submitting ideas as often as you can:

http://www.rubyquiz.com/

Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby T. follow the discussion. Please reply to the original quiz
message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Ken B.

From time to time someone asks on ruby-talk how they can write a regexp
of the
form:

/alligator|crocodile|bear|dinosaur|...|seven-thousandth-word/

It’s not hard to write such a regexp, but Ruby has in internal limit on
how big
the regular expression can be, so users find they can’t do this matching
function easily.

Implement a class DictionaryMatcher that determines whether any of the
strings
added to it are substrings of a string S. This should function as almost
a
drop-in replacement for a Regexp, therefore your implementation should
support
the following operations:

# creates a new empty matcher
dm=DictionaryMatcher.new

# adds strings to the matcher
dm << "string"
dm << "Ruby"

# determines whether a given word was one of those added to the matcher
dm.include?("Ruby")                 # => true
dm.include?("missing")              # => false
dm.include?("stringing you along")  # => false

# Regexp-like substing search
dm =~ "long string"            # => 5
dm =~ "rub you the wrong way"  # => nil

# will automatically work as a result of implementing
# DictionaryMatcher#=~ (see String#=~)
"long string" =~ dm  # => true

# implement the rest of the interface implemented by Regexps (well,

almost)
class DictionaryMatcher
alias_method :===, :=~
alias_method :match, :=~
end

If you can add additional features, like a case insensitivity option
when
creating a new DictionaryMatcher this is also very useful.

bbazzarrakk · November 25, 2006, 12:57am

Hi –

On Sat, 25 Nov 2006, Ruby Q. wrote:

function easily.
dm << “string”

If you can add additional features, like a case insensitivity option when
creating a new DictionaryMatcher this is also very useful.

What’s the ruling on priority and “greediness”? In other words,
given:

dm << “hi”
dm << “child”

what would:

dm =~ “children”

give? Would it be different if you added “child” first? Or is there
a rule about finding the longest match?

David

bbazzarrakk · November 25, 2006, 1:19am

On 11/24/06, Ruby Q. [email protected] wrote:

    # will automatically work as a result of implementing
    # DictionaryMatcher#=~ (see String#=~)
    "long string" =~ dm  # => true

On my machine (ruby 1.8.4), I don’t get this result. For me, ‘long
string’ =~ /string/ returns the same as /string/ =~ ‘long string’,
which is 5, not true.

Just a heads-up for anyone else using the provided code as the basis
for a test case.

Jamie

bbazzarrakk · November 25, 2006, 2:09am

Hi –

On Sat, 25 Nov 2006, Jamie M. wrote:

give? Would it be different if you added “child” first? Or is there
a rule about finding the longest match?

I’m doing the same thing a regexp does.

In this case, both /hi|child/ =~ ‘children’ and /child|hi/ =~
‘children’ return 0, so I’m returning the index of the first character
in the string that matches the uber-regexp we’re standing in for.

Actually, a better example of what I was pondering would be:

dm << “t”
dm << “ten”

dm =~ “tenth”

However, I guess as long as it’s just the starting offset that’s
needed, and not the ending offset (i.e., we don’t have to know how
long the match is), it won’t be an issue. My instinct was to want to
know which string did the matching, but for starting offset purposes
it doesn’t matter.

David

bbazzarrakk · November 25, 2006, 1:35am

On 11/24/06, [email protected] [email protected] wrote:

give? Would it be different if you added “child” first? Or is there
a rule about finding the longest match?

I’m doing the same thing a regexp does.

In this case, both /hi|child/ =~ ‘children’ and /child|hi/ =~
‘children’ return 0, so I’m returning the index of the first character
in the string that matches the uber-regexp we’re standing in for.

Jamie

bbazzarrakk · November 26, 2006, 1:22am

Jamie M. wrote:

for a test case.
One of the unwritten rules of Ruby Q. is to do whatever seems
appropriate with the interface, given the real interface.

You’ll notice, for example, that a Regexp returns the offset from =~, a
boolean from === (despite the documentation’s claims to the contrary),
and a MatchData object from #match.
The alias_method part of the interface simplifies this aspect of the
interface to DictionaryMatcher, although it defintiely makes more sense
to invent a MatchData object of some sort that can tell you what the
matched word was, and use that as the return value to match.

–Ken

bbazzarrakk · November 25, 2006, 10:36am

<dblack wobblini.net> writes:

Hi,

give?
I’d say that given the example:

dm << "string"
dm.include?("stringing you along")  # => false

your example would return nil, as the purpose of DictionaryMatcher seems
to be
to match whole words, and not derivatives

bbazzarrakk · November 27, 2006, 1:12am

On Sat, 25 Nov 2006 08:21:24 +0900, Ruby Q. wrote:

dm << “Ruby”

will automatically work as a result of implementing

creating a new DictionaryMatcher this is also very useful.
This DictionaryMatcher implementation implements a trie (prefix tree)
combined with Knuth-Morris-Pratt string matching on O(n) time. The funny
thing is I gave this solution as an answer to the same question on my
algorithms final last semester, and it was marked wrong. The professor
just didn’t understand what I was doing.

I violated the interface a little since knowing which words matched can
also be pretty important, and added the #scan method, because the
questioner who inspired me to write this quiz actually told me that he
wanted to count how many matches there were in the string.

require ‘enumerator’

The DictionaryMatcher class holds a collection of strings. It allows

lookups to

determine which strings are included, and it can be used similarly

to a +Regexp+ in substring matching.

class DictionaryMatcher
#Create a DictionaryMatcher with no words in it
def initialize
@internal=Node.new
end

#Add a word to the DictionaryMatcher
def add string
[email protected] string
parent_indexes=compute_failure_function string
array.zip(parent_indexes).each do |node,parentindex|
node.failure=array[parentindex] if parentindex
end
nil
end
alias_method :<<, :add

#Determine whether +string+ was previously added to the
#DictionaryMatcher.
def include? string
@internal.include? string
end

#Determine whether one of the words in the DictionaryMatcher is a
substring of
#+string+. Returns a DictionaryMatcher::MatchData object if found,
+nil+ if not
#found.
def match string
internal_match(string){|md| return md}
return nil
end

#Scans +string+ for all occurrances of strings in the
DictionaryMatcher.
#Overlapping matches are skipped (only the first one is yielded), and
#when some strings in the
#DictionaryMatcher are substrings of others, only the shortest match
at a given
#position is found.
def scan string
internal_match(string){|matchdata| yield matchdata}
nil
end

#Case equality. Similar to =~, but returns true or false.
def === string
not match(string).nil?
end

#Determines whether one of the words in the DictionaryMatcher is a
substring of
#+string+. Returns the index of the match if found, +nil+ if not
#found.
def =~ string
x=match(string)
x && x.index
end

#Contains the index matched, and the word matched
MatchData=Struct.new(:index,:match)

private
#Doing this globally for the whole word feels kludgy, but I didn’t
#want to figure out how to do this as a per-node function.
#Basically copied from Cormen, Leiserson, Rivest and Stein
#“Introduction to Algorithms” 2nd ed.
def compute_failure_function p
m=p.size
pi=[0,0]
k=0
2.upto m do |q|
k=pi[k] while k>0 and p[k] != p[q-1]
k=k+1 if p[k]==p[q-1]
pi[q]=k
end
pi
end

def internal_match string
node=@internal
e=Enumerable::Enumerator.new(string,:each_byte)
e.each_with_index do |b,index|
advance=false
until advance
nextnode=node.transitions[b]
if not nextnode
advance=true if node==node.failure #loops happen at the root
node=node.failure
elsif nextnode.endword?
yield
MatchData.new(index-node.depth,string[index-node.depth…index])
advance=true
node=@internal
else
advance=true
node=nextnode
end
end
end
end

class Node #:nodoc:
attr_accessor :transitions, :failure, :endword, :depth
alias_method :endword?, :endword
def initialize depth=0
@depth=depth
@transitions={}
@endword=false
end
def add string,offset=0, array=[]
first=string[offset]
if offset==string.size
@endword=true
array << self
return array
end
array << self
node=(@transitions[first] ||= Node.new(@depth+1))
return node.add(string,offset+1,array)
end
def include? string, offset=0
first=string[offset]
if offset==string.size
return @endword
elsif not @transitions.include?(first)
return false
else
return @transitions[first].include?(string,offset+1)
end
end
def inspect; “x”; end
end

end

bbazzarrakk · November 27, 2006, 5:24am

Author: Lou S. [email protected]

Date: Sun Nov 26 10:43:34 EST 2006

q103.rb - Solution to Rubyquiz 103 (DictionaryMatcher)

Implements DictionaryMatcher using a Trie. This version of

DictionaryMatcher only matches complete words, but it wouldn’t

be too hard to modify to match any substring.

class Trie
def initialize
@children = Hash.new {|h,k| h[k] = Trie.new}
end

attr_accessor :value

def []=(key, value)
insert(key, 0, value)
key
end

def
get(key,0)
end

def each &blk
_each &blk
end

include Enumerable

def keys
inject([]) {|keys,(k,v)| keys << k; keys}
end

def values
inject([]) {|vals,(k,v)| vals << v; vals}
end

def each_key &blk
keys.each &blk
end

def each_value &blk
values.each &blk
end

def inspect(indent=0)
buff = ‘’
i = ’ ’ * indent
buff << i + “value: #{value}\n” if value
return buff unless @children.size > 0
@children.each {|k,c|
buff << “#{i}#{k} =>\n”
buff << c.inspect(indent+2)
}
buff
end

protected

def _each(key=‘’, &blk)
blk.call(key,value) if key != ‘’ and value
@children.keys.sort.each do |k|
@children[k]._each(key + k,&blk)
end
end

def insert(key,offset,value)
if offset == key.length - 1
@children[key[offset,1]].value = value
else
@children[key[offset,1]].insert(key,offset+1,value)
end
end

def get(key,offset)
if offset == key.length - 1
@children[key[offset,1]].value
else
return nil unless @children.has_key?(key[offset,1])
@children[key[offset,1]].get(key,offset+1)
end
end
end

class DictionaryMatcher
def initialize(opts={})
@ignore_case = opts[:ignore_case]
@trie = Trie.new
end

def ignores_case?
@ignore_case
end

def <<(word)
word = word.downcase if ignores_case?
@trie[word] = true
end

def words
@trie.keys
end

def include?(word)
!@trie[(ignores_case?? word.downcase : word)].nil?
end

def =~(string)
words = string.split
positions = words.inject({}) { |h,w|
h[w] = string.index(w) unless h[w]; h
}
words.each do |word|
return positions[word] if
include?(ignores_case?? word.downcase : word)
end
return nil
end

alias_method :===, :=~
alias_method :match, :=~
end

bbazzarrakk · November 27, 2006, 5:42am

Well, I’m not sure I can add anything not covered by Adam’s solution
(forwarded in by James), but here goes.

I started with the naive solution:

class DictionaryMatcher < Array
def =~(string)
self.map{|e| Regexp.new(e) =~ string }.compact.min
end
end

I then fleshed it out with a case insensitive option, and more tests.

Jamie

class DictionaryMatcher < Array
alias_method :===, :=~
alias_method :match, :=~

def initialize(default = [], options = nil)
super(default)

unless options.nil? or options.is_a? Fixnum
  options = Regexp::IGNORECASE
end
@regexp_options = options

end

def =~(string)
self.map{|e| Regexp.new(e, @regexp_options) =~ string }.compact.min
end
end

class DictionaryMatcherTest < Test::Unit::TestCase
def test_acceptance
# creates a new empty matcher
dm=DictionaryMatcher.new

# adds strings to the matcher
dm << "string"
dm << "Ruby"

# determines whether a given word was one of those added to the

matcher
assert_equal true, dm.include?(“Ruby”) # => true
assert_equal false, dm.include?(“missing”) # => false
assert_equal false, dm.include?(“stringing you along”) # => false

# Regexp-like substing search
assert_equal 5,   dm =~ "long string"            # => 5
assert_equal nil, dm =~ "rub you the wrong way"  # => nil

# will automatically work as a result of implementing
# DictionaryMatcher#=~ (see String#=~)
assert_equal 5, "long string" =~ dm  # => true

end

def test_include_eh
dm = DictionaryMatcher.new([‘string’, ‘ruby’, ‘foo’])
assert_equal true, dm.include?(‘string’ )
assert_equal true, dm.include?(‘ruby’ )
assert_equal true, dm.include?(‘foo’ )
assert_equal false, dm.include?(‘stringa’)
assert_equal false, dm.include?(‘astring’)
end

def test_equals_tilde
dm = DictionaryMatcher.new([‘string’, ‘ruby’, ‘foo’])
assert_equal 0, dm =~ ‘string’
assert_equal 0, dm =~ ‘string two’
assert_equal 6, dm =~ ‘three string’
assert_equal 5, dm =~ ‘four string five’
assert_equal nil, dm =~ ‘strng’

assert_equal 0,  'string' =~ dm
assert_equal 2,  'a string b' =~ dm
assert_equal nil, 'strng' =~ dm

end

def test_case_sensitivity
dm = DictionaryMatcher.new([‘Foo’,‘bar’])
assert_equal 0, dm =~ ‘Foo’
assert_equal 0, dm =~ ‘bar’
assert_equal nil, dm =~ ‘foo’
assert_equal nil, dm =~ ‘Bar’
end

def test_case_insensitivity
dm = DictionaryMatcher.new([‘Foo’,‘bar’], true)
assert_equal 0, dm =~ ‘Foo’
assert_equal 0, dm =~ ‘bar’
assert_equal 0, dm =~ ‘foo’
assert_equal 0, dm =~ ‘Bar’
end

def test_greediness
dm = DictionaryMatcher.new([‘hi’,‘child’])
r = /hi|child/
assert_equal r =~ ‘children’, dm =~ ‘children’

dm = DictionaryMatcher.new(['child','hi'])
r = /child|hi/
assert_equal r =~ 'children', dm =~ 'children'

end

bbazzarrakk · November 27, 2006, 6:32am

On Mon, 27 Nov 2006 11:47:38 +0900, Edwin F. wrote:

trie is faster because the has table has to hash all characters in the
valid string is seen).
trie = trie[character code] # drop down 1 level
The search algorithm is surprisingly simple. The search space is the
end
the best bet.
doubles
the run time.

[CODE SNIPPED]

I have posted the code, Test::Unit tests, the test file, and dictionary
file on my web site here:

http://finecomputerconsultants.com/ruby_quiz/dictionary_matcher-0.1.tgz

I hope I have not violated any copyrights/lefts by supplying the text
files; if so, let me know and I will remedy this.

Wow. My solution’s a real slowpoke compared to yours, even thought
mine’s
asymptotically faster: O(string.length) versus
O(string.length* max(keys.length))

I wonder what’s slowing mine down so much, since these are in many ways
the same.

                    user     system      total        real

Edwin F. – fill 0.810000 0.130000 0.940000 ( 0.939537)
Edwin F. – test 5.580000 0.630000 6.210000 ( 6.244736)
Ken B. – fill 4.550000 0.460000 5.010000 ( 5.010708)
Ken B. – test 25.210000 0.960000 26.170000 ( 27.519233)

(tested using the files you provided – I modified your interface just a
little to alias your #find_all_matches method to #scan)

I’d be happy to benchmark anybody else’s solution who implements #scan.

#!/usr/bin/env ruby
open(“practical-file-system-design.txt”) do |f|
FILEDATA=f.read
end

open(“words_en.txt”) do |f|
DICTIONARY=f.readlines.map{|x| x.chomp}
end

require ‘benchmark’
include Benchmark
#the following files contain various implementations, renamed so as not
to
#conflict with each other
require ‘trie’
require ‘finedm’

TESTCLASSES={“Ken B.” => KenDictionaryMatcher,
“Edwin F.” => FineDictionaryMatcher}

bm(TESTCLASSES.keys.collect{|x| x.length}.max + 8) do |benchmarker|
matcher=nil
TESTCLASSES.each do |name,klass|
benchmarker.report(“#{name} – fill”) do
matcher=klass.new
DICTIONARY.each {|x| matcher << x}
end
benchmarker.report(“#{name} – test”) do
matcher.scan(FILEDATA){}
end
end
end
END