I am trying to discover similar files to reduce redundancy on a large project. The 'Text' gem works well for this, but even short strings take a long time. Large strings - like 20k HTML files - take an amazing amount of time. My script looks like this: require 'rubygems' require 'text' a = file_one b = file_two puts Text::Levenshtein.distance(a, b) It would be nice to be able to short-circuit the comparison when the distance crossed a max value, but that isn't possible. It would be even BETTER to be able to compare long stings like with PHPs similar_text, which has nice percentage output. I have to do a lot of comparisons, about 40 million. Is there something already written?
on 15.05.2008 03:26
on 15.05.2008 04:40
On Thu, May 15, 2008 at 10:25 AM, John <john.d.perkins@gmail.com> wrote: > > puts Text::Levenshtein.distance(a, b) > > > It would be nice to be able to short-circuit the comparison when the > distance crossed a max value, but that isn't possible. It would be > even BETTER to be able to compare long stings like with PHPs > similar_text, which has nice percentage output. I have to do a lot of > comparisons, about 40 million. Is there something already written? Take a look at the source of the Text gem, the algorithm for Text::Levenshtein::distance is not too hard to read or long, maybe you can just modify that to suit your needs? ^ manveru
on 15.05.2008 08:13
On May 14, 2008, at 7:25 PM, John wrote: > It would be nice to be able to short-circuit the comparison when the > distance crossed a max value diff file|head -$max ?? a @ http://codeforpeople.com/
on 15.05.2008 13:25
> I have to do a lot of comparisons, about 40 million. 40 million or 400 million?... :} I've once written a Levenshtein implementation in C, because of this speed issue. It compares 2 20 KB files in 3 seconds. And it does come with a threshold. Interested? gegroet, Erik V. - http://www.erikveen.dds.nl/ ---------------------------------------------------------------- $ ll lev[12] -rw-r--r-- 1 erik erik 19275 May 15 12:10 lev1 -rw-r--r-- 1 erik erik 19275 May 15 12:12 lev2 $ cat lev.rb require "ev/levenshtein" s1 = File.read("lev1") s2 = File.read("lev2") p EV.levenshtein(s1, s2) $ time ruby lev.rb 12 real 0m3.105s user 0m2.843s sys 0m0.281s
on 15.05.2008 14:37
-------- Original-Nachricht -------- > Datum: Thu, 15 May 2008 20:24:42 +0900 > Von: Erik Veenstra <erikveen@gmail.com> > An: ruby-talk@ruby-lang.org > Betreff: Re: 40 million levenshtein distances for two long strings > > I have to do a lot of comparisons, about 40 million. > > 40 million or 400 million?... :} > > I've once written a Levenshtein implementation in C, because of > this speed issue. It compares 2 20 KB files in 3 seconds. And > it does come with a threshold. > > Interested? Hi Erik, Could you post your implementation here ? Thank you, Axel
on 15.05.2008 20:30
...48,270,225 to be exact, which (without multithreading) would take a bit over 300 days. I'm with Axel, I think more than just a couple people would be interested in that slice of C.
on 16.05.2008 12:00
On May 15, 7:29 pm, John <john.d.perk...@gmail.com> wrote: > ...48,270,225 to be exact, which (without multithreading) would take a > bit over 300 days. I'm with Axel, I think more than just a couple > people would be interested in that slice of C. The wikipedia article on Levenshtein distance has a reasonable C implementation - using Ruby inline on it is fairly trivial. I used the Wikipedia version from Ruby for some sanity checks on results for my MSc thesis, but unfortunately I can't find my cleaned up code right now (at work - I can look for it later today if you like, and unless someone else has posted a full version by then) Vidar
on 16.05.2008 16:41
I've requested a new RubyForge project. Until it's available, I'll paste the code on PasteBin. Comments and suggestions are welcome. To be honest, its' my first and only C extensions 'till now... I've never released a C extension as a GEM. Anybody willing to help? gegroet, Erik V. - http://www.erikveen.dds.nl/ ---------------------------------------------------------------- Ruby library: http://pastebin.com/m36427195 C extension: http://pastebin.com/m5bdf8ceb extconf.rb: http://pastebin.com/m4c4c40b5 Unit Tests: http://pastebin.com/m6e956e5e
on 16.05.2008 16:51
The Ruby version of Levenshtein#distance didn't pass the tests... :} Here's a new version. gegroet, Erik V. - http://www.erikveen.dds.nl/ ---------------------------------------------------------------- Ruby Library: http://pastebin.com/d17d0a92b
on 16.05.2008 22:40
Let us know when your project is approved! Good work sir.
on 24.05.2008 17:23
I've released the Levenshtein module as a gem. The most expensive part of the algorithm is written in C. But before entering the most expensive part of the algorithm, it tries to optimize as much as possible (in Ruby land). There's a TAR.GZ archive and a GEM for Linux (tested), OS X (untested) and Cygwin (tested). I wasn't able to compile a version that worked with the One-Click Ruby Installer. I'm not familiar with Windows, nor with its development environment. Anyone willing to help? gegroet, Erik V. - http://www.erikveen.dds.nl/ ---------------------------------------------------------------- The Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. More information about the Levenshtein distance algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance . More information about the library: http://www.erikveen.dds.nl/levenshtein/doc/index.html .
on 25.05.2008 12:30
John wrote: > I am trying to discover similar files to reduce redundancy on a large > project. Edit distance isn't the only way to describe (dis)similarity, you know! Moreover, what you're proposing is a really brute-force approach. Defining similarity is not a trivial exercise. Two identical documents should have a similarity of one, no matter how measured, but dissimilar documents can have very different similarity figures depending on how you measure them. I suggest you look at the information retrieval literature (search engines etc).