This solution uses a digital trie to encode the strings to search.
A good background to this data structure can be found here:
Ternary Search Trees | Dr Dobb's,
which also discusses a better solution that I did not explore (ternary
search trees).
A trie has a search speed that is O(m), where m is the number of
characters in the string to find in the dictionary. A hash table is
O(1), which is theoretically faster than a trie, but in practice the
trie is faster because the has table has to hash all characters in the
string to create the hash key, whereas the trie can reject a string by
matching as little as 1 character.
The solution trades memory usage for speed. Every string in the
dictionary is organized into a hierarchy of character codes.
Essentially, the dictionary encodes a DFA (deterministic finite
automaton) where each character code is a transition from one node to
the next one. This provides very fast search speeds, especially when a
non-matching string is encountered. In other words, the trie can reject
incorrect strings very quickly (as soon as a non-matching prefix to a
valid string is seen).
To store a string in the dictionary, we start at the root (which is just
a hash
or array of the character codes of the first character of every string
in the
dictionary).
trie = root
for each character code in the string to store
trie[character code] ||= {} # or [], if arrays used
trie = trie[character code] # drop down 1 level
end
trie[0] = true # Mark end of word (code 0 will never be used)
It is necessary to mark the end of a word because you can get words that
are prefixes of other words (for example, can, canna, and cannabis).
This allows you to decide if you want to use a least-prefix or greedy
search. This code uses a least-prefix search that returns the shortest
matching string in the dictionary. This is due to the requirements of
the quiz. However, it should be easy enough to code a greedy search.
The search algorithm is surprisingly simple. The search space is the
text that
we want to search for matching strings. The code starts at the first
character of the search space and tries to find a match in the
dictionary anchored at that position. If it does not detect a match, it
moves the anchor to the next character and starts again.
trie = root
for each character code in the search space
break unless character code is in trie
trie = trie[character code]
end
found a valid string if trie[0] exists
Each character in the dictionary to be searched is an index into a hash.
Informal benchmarks on my system (Intel Core 2 Duo E6600 on Win XP 32
bit)
shows a memory usage of about 15MB when using a hash as the basic trie
storage, and 30+Mb using an array, when storing a dictionary of 24,001
words.
The speed seems to be about the same either way, so the hash solution is
the best bet.
The test code finds all words in the 24,001 word dictionary in a text of
about
600KB. On my system, this takes around 2 seconds. This does include a
possibly
incorrect optimization: once a word is matched, the anchor point is
moved to
after the word so that substrings within the word are not matched. This
may
or may not be what is desired, but removing this optimization almost
doubles
the run time.
The code itself is fairly short:
class DictionaryMatcher
attr_reader :word_count
def initialize
@trie = {}
@word_count = 0
end
def add_word(word)
@word_count += 1
container = @trie
word.each_byte do |b|
container[b] = {} unless container.has_key? b
container = container[b]
end
container[0] = true # Mark end of word
end
def include?(word)
container = @trie
word.each_byte do |b|
break unless container.has_key? b
container = container[b]
end
container[0]
end
def =~(text)
text_end = text.length - 1
pos = 0
while pos <= text_end do
container = @trie
pos.upto(text_end) do |i|
b = text[i]
break unless container.has_key? b
container = container[b]
end
return pos if container[0] # Match
pos += 1
end
nil
end
Return container of matches in text [[pos, len], …]
or call block if provided (returns [])
def find_all_matching(text, &block)
matches = []
block = lambda { |pos, len| matches << [pos, len] } unless block
pos = 0
text_end = text.length - 1
while pos <= text_end do
container = @trie
len = 0
pos.upto(text_end) do |i|
b = text[i]
break unless container.has_key?(b)
container = container[b]
len += 1
end
if container[0] # Match
block.call(pos, len)
pos += len # Skip over word
else
pos += 1
end
end
matches
end
implement much of the rest of the interface implemented by Regexps
alias_method :===, :=~
alias_method :match, :=~
alias_method :<<, :add_word
Add words from a file
def add_words(words_file)
IO.foreach(words_file) do |line|
add_word line.chomp
end
end
end
I have posted the code, Test::Unit tests, the test file, and dictionary
file on my web site here:
http://finecomputerconsultants.com/ruby_quiz/dictionary_matcher-0.1.tgz
I hope I have not violated any copyrights/lefts by supplying the text
files; if so, let me know and I will remedy this.