Ruby Forum Ruby > 40 million levenshtein distances for two long strings

Posted by John (Guest)
on 15.05.2008 03:26
(Received via mailing list)
I am trying to discover similar files to reduce redundancy on a large
project. The 'Text' gem works well for this, but even short strings
take a long time. Large strings - like 20k HTML files - take an
amazing amount of time. My script looks like this:

require 'rubygems'
require 'text'

a = file_one
b = file_two

puts Text::Levenshtein.distance(a, b)


It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn't possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?
Posted by Michael Fellinger (Guest)
on 15.05.2008 04:40
(Received via mailing list)
On Thu, May 15, 2008 at 10:25 AM, John <john.d.perkins@gmail.com> wrote:
>
>  puts Text::Levenshtein.distance(a, b)
>
>
>  It would be nice to be able to short-circuit the comparison when the
>  distance crossed a max value, but that isn't possible. It would be
>  even BETTER to be able to compare long stings like with PHPs
>  similar_text, which has nice percentage output. I have to do a lot of
>  comparisons, about 40 million. Is there something already written?

Take a look at the source of the Text gem, the algorithm for
Text::Levenshtein::distance is not too hard to read or long, maybe you
can just modify that to suit your needs?

^ manveru
Posted by ara.t.howard (Guest)
on 15.05.2008 08:13
(Received via mailing list)
On May 14, 2008, at 7:25 PM, John wrote:

> It would be nice to be able to short-circuit the comparison when the
> distance crossed a max value

diff file|head -$max

??

a @ http://codeforpeople.com/
Posted by Erik Veenstra (Guest)
on 15.05.2008 13:25
(Received via mailing list)
> I have to do a lot of comparisons, about 40 million.

40 million or 400 million?... :}

I've once written a Levenshtein implementation in C, because of
this speed issue. It compares 2 20 KB files in 3 seconds. And
it does come with a threshold.

Interested?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

 $ ll lev[12]
 -rw-r--r-- 1 erik erik 19275 May 15 12:10 lev1
 -rw-r--r-- 1 erik erik 19275 May 15 12:12 lev2

 $ cat lev.rb
 require "ev/levenshtein"

 s1 = File.read("lev1")
 s2 = File.read("lev2")

 p EV.levenshtein(s1, s2)

 $ time ruby lev.rb
 12

 real    0m3.105s
 user    0m2.843s
 sys     0m0.281s
Posted by Axel Etzold (Guest)
on 15.05.2008 14:37
(Received via mailing list)
-------- Original-Nachricht --------
> Datum: Thu, 15 May 2008 20:24:42 +0900
> Von: Erik Veenstra <erikveen@gmail.com>
> An: ruby-talk@ruby-lang.org
> Betreff: Re: 40 million levenshtein distances for two long strings

> > I have to do a lot of comparisons, about 40 million.
> 
> 40 million or 400 million?... :}
> 
> I've once written a Levenshtein implementation in C, because of
> this speed issue. It compares 2 20 KB files in 3 seconds. And
> it does come with a threshold.
> 
> Interested?

Hi Erik,

Could you post your implementation here ?

Thank you,

Axel
Posted by John (Guest)
on 15.05.2008 20:30
(Received via mailing list)
...48,270,225 to be exact, which (without multithreading) would take a
bit over 300 days. I'm with Axel, I think more than just a couple
people would be interested in that slice of C.
Posted by Vidar Hokstad (Guest)
on 16.05.2008 12:00
(Received via mailing list)
On May 15, 7:29 pm, John <john.d.perk...@gmail.com> wrote:
> ...48,270,225 to be exact, which (without multithreading) would take a
> bit over 300 days. I'm with Axel, I think more than just a couple
> people would be interested in that slice of C.

The wikipedia article on Levenshtein distance has a reasonable C
implementation - using Ruby inline on it is fairly trivial.

I used the Wikipedia version from Ruby for some sanity checks on
results for my MSc thesis, but unfortunately I can't find my cleaned
up code right now (at work - I can look for it later today if you
like, and unless someone else has posted a full version by then)

Vidar
Posted by Erik Veenstra (Guest)
on 16.05.2008 16:41
(Received via mailing list)
I've requested a new RubyForge project. Until it's available,
I'll paste the code on PasteBin.

Comments and suggestions are welcome. To be honest, its' my
first and only C extensions 'till now...

I've never released a C extension as a GEM. Anybody willing to
help?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

 Ruby library:   http://pastebin.com/m36427195
 C extension:    http://pastebin.com/m5bdf8ceb
 extconf.rb:     http://pastebin.com/m4c4c40b5
 Unit Tests:     http://pastebin.com/m6e956e5e
Posted by Erik Veenstra (Guest)
on 16.05.2008 16:51
(Received via mailing list)
The Ruby version of Levenshtein#distance didn't pass the
tests... :}

Here's a new version.

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

 Ruby Library:   http://pastebin.com/d17d0a92b
Posted by DJ Jazzy Linefeed (Guest)
on 16.05.2008 22:40
(Received via mailing list)
Let us know when your project is approved! Good work sir.
Posted by Erik Veenstra (Guest)
on 24.05.2008 17:23
(Received via mailing list)
I've released the Levenshtein module as a gem. The most
expensive part of the algorithm is written in C. But before
entering the most expensive part of the algorithm, it tries to
optimize as much as possible (in Ruby land).

There's a TAR.GZ archive and a GEM for Linux (tested), OS X
(untested) and Cygwin (tested). I wasn't able to compile a
version that worked with the One-Click Ruby Installer. I'm not
familiar with Windows, nor with its development environment.
Anyone willing to help?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

The Levenshtein distance is a metric for measuring the amount
of difference between two sequences (i.e., the so called edit
distance). The Levenshtein distance between two strings is
given by the minimum number of operations needed to transform
one string into the other, where an operation is an insertion,
deletion, or substitution of a single character.

More information about the Levenshtein distance algorithm:
http://en.wikipedia.org/wiki/Levenshtein_distance .

More information about the library:
http://www.erikveen.dds.nl/levenshtein/doc/index.html .
Posted by Dave Bass (dogsbody)
on 25.05.2008 12:30
John wrote:
> I am trying to discover similar files to reduce redundancy on a large
> project.

Edit distance isn't the only way to describe (dis)similarity, you know! 
Moreover, what you're proposing is a really brute-force approach.

Defining similarity is not a trivial exercise. Two identical documents 
should have a similarity of one, no matter how measured, but dissimilar 
documents can have very different similarity figures depending on how 
you measure them.

I suggest you look at the information retrieval literature (search 
engines etc).