Edit Distance at Wikipedia

pool007 · October 16, 2006, 12:17pm

Hi list.

I read an article posted at Wikipedia about Levenshtein distance (aka
edit
distance).
The location of the document is

In the document, a sample Ruby code goes like the following:

class String
def levenshtein(comparator)
a, b = self.unpack(‘U*’), comparator.unpack(‘U*’)
n, m = a.length, b.length
a, b, n, m = b, a, m, n if n > m
current = [*0…n]
1.upto(m) do |i|
previous, current = current, [i]+[0]*n
1.upto(n) do |j|
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
change += 1 if a[j-1] != b[i-1]
current[j] = [add, delete, change].min
end
end
current[n]
end
end

In the code, there’s a parameter called comparator which seems to be
used to
decode given parameter. But, I can’t understand what exactly the
comparator
is doing.

Does anybody know the detail?

Sincerely,
Minkoo S.

pool007 · October 16, 2006, 12:25pm

On Mon, Oct 16, 2006 at 07:15:52PM +0900, Minkoo wrote:

def levenshtein(comparator)
current[j] = [add, delete, change].min
Does anybody know the detail?
It’s simply the string you’re comparing against; unpack(‘U*’) just turns
the
UTF-8 characters into unsigned integers:

class String
  def levenshtein(comparator)
a, b = self.unpack('U*'), comparator.unpack('U*')
b                                                 # => [102, 111, 111,

98, 97, 114]
n, m = a.length, b.length
a, b, n, m = b, a, m, n if n > m
current = [*0…n]
1.upto(m) do |i|
previous, current = current, [i]+[0]*n
1.upto(n) do |j|
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
change += 1 if a[j-1] != b[i-1]
current[j] = [add, delete, change].min
end
end
current[n]
end
end

"foo".levenshtein("foobar")                    # => 3

pool007 · October 16, 2006, 12:42pm

Minkoo wrote:

def levenshtein(comparator)
[…]
In the code, there’s a parameter called comparator which seems to be
used to
decode given parameter. But, I can’t understand what exactly the comparator
is doing.

I didn’t examine the code, but I’d guess… you want the L. distance
between two strings. One is self, and the other is the parameter named
“comparator”.

Good luck.

pool007 · October 16, 2006, 3:45pm

On Mon, 16 Oct 2006 19:40:07 +0900, Carlos wrote:

“comparator”.
I have now renamed the variable to “other” for clarity.

pool007 · October 16, 2006, 3:12pm

On 10/16/06, Mauricio F. [email protected] wrote:

In the document, a sample Ruby code goes like the following:
add, delete = previous[j]+1, current[j-1]+1
used to
class String
add, delete = previous[j]+1, current[j-1]+1

–
Mauricio F. - http://eigenclass.org - singular Ruby

I’m afraid that I’m not used to character encodings. Does Ruby use UTF-8
by
default?

In other words, suppose that I’ve launched irb and fired
“foo”.levenshtein(“foobar”).
In that case, is the string “foo” encoded as utf-8? Do I always have to
unpack the string like
the code shown above?

Sincerely,
Minkoo S.

pool007 · October 16, 2006, 11:05pm

On Mon, Oct 16, 2006 at 10:11:51PM +0900, Minkoo wrote:

On 10/16/06, Mauricio F. [email protected] wrote:

It’s simply the string you’re comparing against; unpack(‘U*’) just turns
the UTF-8 characters into unsigned integers:

class String
def levenshtein(comparator)
a, b = self.unpack(‘U*’), comparator.unpack(‘U*’)
b # => [102, 111,
111, 98, 97, 114]
[…]

“foo”.levenshtein(“foobar”) # => 3

I’m afraid that I’m not used to character encodings. Does Ruby use UTF-8 by
default?

Ruby’s Strings as of 1.8 are just a sequence of bytes. If a String
happens to
hold a UTF-8 sequence, some operations will work properly when you set
$KCODE=“u”, e.g.

s = “\303\241” # “á” == acute
s.scan(/\w/) # => []
$KCODE=“u”
s.scan(/\w/) # => [“á”]

Some other methods are covered by jcode.rb.

In other words, suppose that I’ve launched irb and fired
“foo”.levenshtein(“foobar”).
In that case, is the string “foo” encoded as utf-8?

This depends on your locale; Ruby’s Strings do not know about encodings
in 1.8
(they will very soon in 1.9 according to matz’ latest messages in
ruby-core;
search the ruby-talk archives for extensive discussions of the upcoming
M17N).

Do I always have to unpack the string like the code shown above?

In this particular case, it depends on two things:

the encoding you’re using
whether the Levenshtein should be relative to characters or bytes

In general, it’s up to you to keep track of the encodings and handle
Strings
appropriately.

pool007 · October 17, 2006, 8:45am

On 10/16/06, Minkoo [email protected] wrote:

I’m afraid that I’m not used to character encodings. Does Ruby use UTF-8 by
default?

As of Ruby 1.8, it doesn’t, but see the other responses.

In other words, suppose that I’ve launched irb and fired
“foo”.levenshtein(“foobar”).
In that case, is the string “foo” encoded as utf-8?

The code makes that assumption, correct or not. If you’re one of those
ignorant yokels who believes that characters are bytes, it’s also a
safe assumption. Now, if you’re one of those people who does i18n,
l10n, and m17n before breakfast, you’ll have alarm bells going off
inside your head immediately, as such assumptions are in general very
dangerous to make. A cardinal rule in this kind of programming is that
strings are meaningless without an attached encoding.

Do I always have to
unpack the string like
the code shown above?

It would be better, at least that keeps your options open when you
attempt to internationalize your program. You wind up supporting
Unicode at the very least, and that makes the transition to
internationalization a bit easier, as Unicode is a reasonable choice
as a character set if one wishes to do internationalization.

pool007 · October 17, 2006, 10:59am

On 17/10/06, Dido S. [email protected] wrote:

The code makes that assumption, correct or not. If you’re one of those
ignorant yokels who believes that characters are bytes, it’s also a
safe assumption. Now, if you’re one of those people who does i18n,
l10n, and m17n before breakfast, you’ll have alarm bells going off
inside your head immediately, as such assumptions are in general very
dangerous to make. A cardinal rule in this kind of programming is that
strings are meaningless without an attached encoding.

The version of Levenshtein in the Text project
(http://rubyforge.org/projects/text) will work on either UTF-8
codepoints or bytes depending on the value of $KCODE. It’s the best we
can do for now; in the future, of course, we will be able to decide
based on the strings’ encodings themselves.

</blatant plug>

Paul.