40 million levenshtein distances for two long strings

john_old · May 15, 2008, 3:26am

I am trying to discover similar files to reduce redundancy on a large
project. The ‘Text’ gem works well for this, but even short strings
take a long time. Large strings - like 20k HTML files - take an
amazing amount of time. My script looks like this:

require ‘rubygems’
require ‘text’

a = file_one
b = file_two

puts Text::Levenshtein.distance(a, b)

It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn’t possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?

john_old · May 15, 2008, 4:40am

On Thu, May 15, 2008 at 10:25 AM, John [email protected] wrote:

puts Text::Levenshtein.distance(a, b)

It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn’t possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?

Take a look at the source of the Text gem, the algorithm for
Text::Levenshtein::distance is not too hard to read or long, maybe you
can just modify that to suit your needs?

^ manveru

john_old · May 15, 2008, 8:13am

On May 14, 2008, at 7:25 PM, John wrote:

It would be nice to be able to short-circuit the comparison when the
distance crossed a max value

diff file|head -$max

??

a @ http://codeforpeople.com/

john_old · May 15, 2008, 1:25pm

I have to do a lot of comparisons, about 40 million.

40 million or 400 million?.. :}

I’ve once written a Levenshtein implementation in C, because of
this speed issue. It compares 2 20 KB files in 3 seconds. And
it does come with a threshold.

Interested?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

$ ll lev[12]
-rw-r–r-- 1 erik erik 19275 May 15 12:10 lev1
-rw-r–r-- 1 erik erik 19275 May 15 12:12 lev2

$ cat lev.rb
require “ev/levenshtein”

s1 = File.read(“lev1”)
s2 = File.read(“lev2”)

p EV.levenshtein(s1, s2)

$ time ruby lev.rb
12

real 0m3.105s
user 0m2.843s
sys 0m0.281s

john_old · May 15, 2008, 2:37pm

-------- Original-Nachricht --------

Datum: Thu, 15 May 2008 20:24:42 +0900
Von: Erik V. [email protected]
An: [email protected]
Betreff: Re: 40 million levenshtein distances for two long strings

I have to do a lot of comparisons, about 40 million.

40 million or 400 million?.. :}

I’ve once written a Levenshtein implementation in C, because of
this speed issue. It compares 2 20 KB files in 3 seconds. And
it does come with a threshold.

Interested?

Hi Erik,

Could you post your implementation here ?

Thank you,

Axel

john_old · May 16, 2008, 12:00pm

On May 15, 7:29 pm, John [email protected] wrote:

…48,270,225 to be exact, which (without multithreading) would take a
bit over 300 days. I’m with Axel, I think more than just a couple
people would be interested in that slice of C.

The wikipedia article on Levenshtein distance has a reasonable C
implementation - using Ruby inline on it is fairly trivial.

I used the Wikipedia version from Ruby for some sanity checks on
results for my MSc thesis, but unfortunately I can’t find my cleaned
up code right now (at work - I can look for it later today if you
like, and unless someone else has posted a full version by then)

Vidar

john_old · May 16, 2008, 4:41pm

I’ve requested a new RubyForge project. Until it’s available,
I’ll paste the code on PasteBin.

Comments and suggestions are welcome. To be honest, its’ my
first and only C extensions 'till now…

I’ve never released a C extension as a GEM. Anybody willing to
help?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

Ruby library: http://pastebin.com/m36427195
C extension: http://pastebin.com/m5bdf8ceb
extconf.rb: http://pastebin.com/m4c4c40b5
Unit Tests: http://pastebin.com/m6e956e5e

john_old · May 16, 2008, 4:51pm

The Ruby version of Levenshtein#distance didn’t pass the
tests… :}

Here’s a new version.

gegroet,
Erik V. - http://www.erikveen.dds.nl/

Ruby Library: http://pastebin.com/d17d0a92b

john_old · May 16, 2008, 10:40pm

Let us know when your project is approved! Good work sir.

john_old · May 15, 2008, 8:30pm

…48,270,225 to be exact, which (without multithreading) would take a
bit over 300 days. I’m with Axel, I think more than just a couple
people would be interested in that slice of C.

john_old · May 24, 2008, 5:23pm

I’ve released the Levenshtein module as a gem. The most
expensive part of the algorithm is written in C. But before
entering the most expensive part of the algorithm, it tries to
optimize as much as possible (in Ruby land).

There’s a TAR.GZ archive and a GEM for Linux (tested), OS X
(untested) and Cygwin (tested). I wasn’t able to compile a
version that worked with the One-Click Ruby Installer. I’m not
familiar with Windows, nor with its development environment.
Anyone willing to help?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

The Levenshtein distance is a metric for measuring the amount
of difference between two sequences (i.e., the so called edit
distance). The Levenshtein distance between two strings is
given by the minimum number of operations needed to transform
one string into the other, where an operation is an insertion,
deletion, or substitution of a single character.

More information about the Levenshtein distance algorithm:
Levenshtein distance - Wikipedia .

More information about the library:
levenshtein (0.2.2) .

john_old · May 25, 2008, 12:30pm

John wrote:

I am trying to discover similar files to reduce redundancy on a large
project.

Edit distance isn’t the only way to describe (dis)similarity, you know!
Moreover, what you’re proposing is a really brute-force approach.

Defining similarity is not a trivial exercise. Two identical documents
should have a similarity of one, no matter how measured, but dissimilar
documents can have very different similarity figures depending on how
you measure them.

I suggest you look at the information retrieval literature (search
engines etc).