I wrote the method below by copying the algorithm from http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance (and matrix is a really simple 2d array implementation). But the problem is that it slows wat down as the string size gets bigger. At string length of about 150 it takes 1s, at 500 10s. Is there any way to recode this to get better performance without rewriting it in C, and would rewrting it in C even help or is this just a slow algorithm? def self.distance(string1, string2) string1 = string1.unpack('C*') string2 = string2.unpack('C*') s1n = string1.length s2n = string2.length m = DLDiff::Matrix.new(s1n+1, s2n+1) cost = 0 (0..s1n).each {|i| m[i,0] = i} (1..s2n).each {|j| m[0,j] = j} (1..s1n).each do |i| (1..s2n).each do |j| cost = string1[i] == string2[j] ? 0 : 1 m[i, j] = [ m[i-1, j] + 1, m[i, j-1] + 1, m[i-1, j-1] + cost ].min m[i, j] = [ m[i,j], m[i-2,j-2] + cost].min if(i > 1 && j > 1 && string1[i] == string2[j-1] && string1[i-1] == string2[j]) end end m[s1n, s2n] end class Matrix def initialize(columns, rows) ac = Array.new(columns, 0) @am = Array.new(rows, 0) @am = @am.collect{|r| ac.dup} end def [](c, r) @am[r][c] end def []=(c,r,value) @am[r][c] = value end def inspect @am.collect{|a| a.inspect}.join("\n") end def to_s @am.to_s end end

on 2007-05-18 03:37

on 2007-05-18 07:40

Jeremy <jwells@servalsystems.co.uk> writes: > I wrote the method below by copying the algorithm from > http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance (and matrix > is a really simple 2d array implementation). But the problem is that > it slows wat down as the string size gets bigger. At string length of > about 150 it takes 1s, at 500 10s. Is there any way to recode this to > get better performance without rewriting it in C, and would rewrting > it in C even help or is this just a slow algorithm? Well, the algorithm itself is O(n*m), where n and m are the size of the strings involved, so on large strings it's going to get slow. I was able to shave about 40% off the time for your method, and fix a bug. The bug was caused because the wikipedia article indexes the strings starting from 1, but indexes the array starting form 0. In ruby both start at 0, of course. To fix this, I added a fake element at the start of both strings. Anyway, here's my slightly faster version: def self.distance(string1, string2) string1 = string1.unpack('C*') string2 = string2.unpack('C*') s1n = string1.length s2n = string2.length string1.unshift -1 string2.unshift -1 m = Array.new(s1n+1){Array.new(s2n+1,0)} cost = 0 a = nil; b = nil (0..s1n).each {|i| m[i][0] = i} (1..s2n).each {|j| m[0][j] = j} (1..s1n).each do |i| (1..s2n).each do |j| cost = string1[i] == string2[j] ? 0 : 1 a = m[i-1][j] + 1 a = b if ((b = m[i][j-1]+1) < a) a = b if ((b = m[i-1][j-1]+cost) < a) if(i > 1 && j > 1 && string1[i] == string2[j-1] && string1[i-1] == string2[j]) then a = b if ((b = m[i-2][j-2] + 1) < a) end m[i][j] = a end end m[s1n][s2n] end

on 2007-05-18 10:42

Daniel Martin wrote: > Well, the algorithm itself is O(n*m), where n and m are the size of > the strings involved, so on large strings it's going to get slow. > > I was able to shave about 40% off the time for your method, and fix a > bug. > > Thanks, thats really useful. I guess that now the data i'm comparing has grown it would be better to use a diff type method than this.