Algorithm for fuzzy string matching

Hello all,

Here is an attempt at a naive fuzzy string matching algorithm. I would
appreciate comments on the code - which is not as simple as I would
like.

I have a couple of speed improvements in mind:

  1. replace EQUAL hash with array (possibly NArray).
  2. a path can be terminated before fully explored if it can be
    calculated that it will never complete (somewhere inside in_bounds?).

Cheers,

Martin

#!/usr/bin/env ruby

require ‘pp’

VERBOSE = false
TEST = true

IUPAC nucleotide pair ambiguity equivalents

EQUAL = {
:AA => true, :BU => true, :TH => true, :UY => true,
:TT => true, :CB => true, :UH => true, :SC => true,
:CC => true, :GB => true, :VA => true, :SG => true,
:GG => true, :TB => true, :VC => true, :CS => true,
:UU => true, :UB => true, :VG => true, :GS => true,
:NA => true, :DA => true, :AV => true, :WA => true,
:NT => true, :DG => true, :CV => true, :WT => true,
:NC => true, :DT => true, :GV => true, :WU => true,
:NG => true, :DU => true, :KG => true, :AW => true,
:NU => true, :AD => true, :KT => true, :TW => true,
:AN => true, :GD => true, :KU => true, :UW => true,
:TN => true, :TD => true, :GK => true, :RA => true,
:CN => true, :UD => true, :TK => true, :RG => true,
:GN => true, :HA => true, :UK => true, :AR => true,
:UN => true, :HC => true, :YC => true, :GR => true,
:NN => true, :HT => true, :YT => true, :MA => true,
:BC => true, :HU => true, :YU => true, :MC => true,
:BG => true, :AH => true, :CY => true, :AM => true,
:BT => true, :CH => true, :TY => true, :CM => true,
}

Class to match two nucleotide sequences allowing for ambiguity codes

and a given maximum number of matches, insertions, and deletions.

class Matcher
attr_accessor :seq1_index, :seq2_index, :matches, :mismatches,
:insertions, :deletions

def initialize(seq1, seq2, max_mismatches = 0, max_insertions = 0,
max_deletions = 0, seq1_index = 0, seq2_index = 0, matches = 0,
mismatches = 0, insertions = 0, deletions = 0)
@seq1 = seq1
@seq2 = seq2
@max_mismatches = max_mismatches
@max_insertions = max_insertions
@max_deletions = max_deletions
@seq1_index = seq1_index
@seq2_index = seq2_index
@matches = matches
@mismatches = mismatches
@insertions = insertions
@deletions = deletions
end

Method to explore all paths for matching two sequences allowing for

a given maximum number of mismatches, insertions and deletions.

def match
paths = []
paths << self

while not paths.empty?
  pp paths if VERBOSE

  new_paths = []

  paths.each do |path|
    if path.match?                               # ---- Match ----
      path_matches = path.dup
      path_matches.matches += 1

      return path_matches if path_matches.ok?

      path_matches.seq1_index += 1
      path_matches.seq2_index += 1

      new_paths << path_matches if path_matches.in_bounds?
    else                                         # ---- Mismatch

      path_mismatches = path.dup
      path_mismatches.mismatches += 1

      return path_mismatches if path_mismatches.ok?

      path_mismatches.seq1_index += 1
      path_mismatches.seq2_index += 1

      new_paths << path_mismatches if path_mismatches.in_bounds?
    end

    if path.insertions < @max_insertions         # ---- Insertion

      path_insertions = path.dup
      path_insertions.insertions += 1

      return path_insertions if path_insertions.ok?

      path_insertions.seq1_index += 1

      new_paths << path_insertions if path_insertions.in_bounds?
    end

    if path.deletions < @max_deletions           # ---- Deletion

      path_deletions = path.dup
      path_deletions.deletions += 1

      return path_deletions if path_deletions.ok?

      path_deletions.seq2_index += 1

      new_paths << path_deletions if path_deletions.in_bounds?
    end
  end

  paths = new_paths
end

end

Method to check if residues match.

def match?
if self.seq1_index < @seq1.length and self.seq2_index < @seq2.length
EQUAL[(@seq1[self.seq1_index].upcase +
@seq2[self.seq2_index].upcase).to_sym]
end
end

Method to check if a path is complete.

def ok?
if self.mismatches <= @max_mismatches and self.insertions <=
@max_insertions and self.deletions <= @max_deletions
if @seq1.length == self.matches + self.mismatches +
self.insertions and
@seq2.length == self.matches + self.mismatches + self.deletions
pp self if VERBOSE
true
end
end
end

Method to check if the path is within the search space.

def in_bounds?
if self.seq1_index <= @seq1.length and self.seq2_index <=
@seq2.length
true
else
false
end
end
end

if VERBOSE
m = Matcher.new(“atcg”, “atcgX”, mismatches=0, insertions=0,
deletions=1)
m.match
end

if TEST and FILE == $PROGRAM_NAME
require “test/unit”

class TestMatcher < Test::Unit::TestCase
def test_perfect_match_returns_ok
m = Matcher.new(“atcg”, “atcg”, mismatches=0, insertions=0,
deletions=0)
assert_not_nil(m.match)
end

def test_perfect_match_with_ambiguity_returns_ok
  m = Matcher.new("atcg", "NNNN", mismatches=0, insertions=0,

deletions=0)
assert_not_nil(m.match)
end

def test_one_mismatch_with_zero_allowed_returns_nil
  m = Matcher.new("atcg", "atGg", mismatches=0, insertions=0,

deletions=0)
assert_equal(nil, m.match)
end

def test_one_mismatch_with_one_allowed_returns_ok
  m = Matcher.new("atcg", "atGg", mismatches=1, insertions=0,

deletions=0)
assert_not_nil(m.match)
end

def test_two_mismatch_with_one_allowed_returns_nil
  m = Matcher.new("atcg", "GtGg", mismatches=1, insertions=0,

deletions=0)
assert_equal(nil, m.match)
end

def test_two_mismatch_with_two_allowed_returns_ok
  m = Matcher.new("atcg", "GtGg", mismatches=2, insertions=0,

deletions=0)
assert_not_nil(m.match)
end

def test_one_insertion_with_zero_allowed_returns_nil
  m = Matcher.new("atGcg", "atcg", mismatches=0, insertions=0,

deletions=0)
assert_equal(nil, m.match)
end

def test_one_insertion_with_one_allowed_returns_ok
  m = Matcher.new("atGcg", "atcg", mismatches=0, insertions=1,

deletions=0)
assert_not_nil(m.match)
end

def test_two_insertion_with_one_allowed_returns_nil
  m = Matcher.new("atGcgC", "atcg", mismatches=0, insertions=1,

deletions=0)
assert_equal(nil, m.match)
end

def test_two_insertion_with_two_allowed_returns_ok
  m = Matcher.new("atGcgC", "atcg", mismatches=0, insertions=2,

deletions=0)
assert_not_nil(m.match)
end

def test_one_deletion_with_zero_allowed_returns_nil
  m = Matcher.new("atcg", "atcAg", mismatches=0, insertions=0,

deletions=0)
assert_equal(nil, m.match)
end

def test_one_deletion_with_one_allowed_returns_ok
  m = Matcher.new("atcg", "atcAg", mismatches=0, insertions=0,

deletions=1)
assert_not_nil(m.match)
end

def test_two_deletion_with_one_allowed_returns_nil
  m = Matcher.new("atcg", "TatcAg", mismatches=0, insertions=0,

deletions=1)
assert_equal(nil, m.match)
end

def test_two_deletion_with_two_allowed_returns_ok
  m = Matcher.new("atcg", "AatcTg", mismatches=0, insertions=0,

deletions=2)
assert_not_nil(m.match)
end

def test_one_mismatch_one_insertions_one_deletion_returns_ok
  m = Matcher.new("TaGcg", "atcgA", mismatches=1, insertions=1,

deletions=1)
assert_not_nil(m.match)
end
end
end

On Mar 23, 2011, at 06:45 , Martin H. wrote:

Hello all,

Here is an attempt at a naive fuzzy string matching algorithm. I would
appreciate comments on the code - which is not as simple as I would
like.

I have a couple of speed improvements in mind:

  1. replace EQUAL hash with array (possibly NArray).

That would generally be slower, depending on use. One thing you might
want to do is clean up the code is match:

  EQUAL[(@seq1[self.seq1_index].upcase +
         @seq2[self.seq2_index].upcase).to_sym]

could be:

  EQUAL[:"#{@seq1[self.seq1_index]}#{@seq2[self.seq2_index]}"]

You’ll have to change your initializer to upcase the input:

@seq1           = seq1.upcase
@seq2           = seq2.upcase

Looks like you’d get a better speed up using strings as your hash keys:

of iterations = 1000000

                      user     system      total        real

null_time 0.130000 0.000000 0.130000 ( 0.131828)
upcase + to_sym 21.320000 0.020000 21.340000 ( 21.366015)
interpolated sym 14.760000 0.010000 14.770000 ( 14.796624)
interpolated str 11.290000 0.020000 11.310000 ( 11.449326)

  1. a path can be terminated before fully explored if it can be
    calculated that it will never complete (somewhere inside in_bounds?).

Dunno… but you have a good test suite, so it is probably pretty easy
to write a test and fix it.

Minor cleanup:

EQUAL = Hash[%w(AA BU TH UY TT CB UH SC CC GB VA SG GG TB VC CS UU UB
VG GS NA DA AV WA NT DG CV WT NC DT GV WU NG DU KG AW
NU AD KT TW AN GD KU UW TN TD GK RA CN UD TK RG GN HA
UK AR UN HC YC GR NN HT YT MA BC HU YU MC BG AH CY AM
BT CH TY CM).map { |pair| [pair.to_sym, true] }]