I read an article posted at Wikipedia about Levenshtein distance (aka
edit
distance).
The location of the document is
In the document, a sample Ruby code goes like the following:
class String
def levenshtein(comparator)
a, b = self.unpack(‘U*’), comparator.unpack(‘U*’)
n, m = a.length, b.length
a, b, n, m = b, a, m, n if n > m
current = [*0…n]
1.upto(m) do |i|
previous, current = current, [i]+[0]*n
1.upto(n) do |j|
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
change += 1 if a[j-1] != b[i-1]
current[j] = [add, delete, change].min
end
end
current[n]
end
end
In the code, there’s a parameter called comparator which seems to be
used to
decode given parameter. But, I can’t understand what exactly the
comparator
is doing.
On Mon, Oct 16, 2006 at 07:15:52PM +0900, Minkoo wrote:
def levenshtein(comparator)
current[j] = [add, delete, change].min
Does anybody know the detail?
It’s simply the string you’re comparing against; unpack(‘U*’) just turns
the
UTF-8 characters into unsigned integers:
class String
def levenshtein(comparator)
a, b = self.unpack('U*'), comparator.unpack('U*')
b # => [102, 111, 111,
98, 97, 114]
n, m = a.length, b.length
a, b, n, m = b, a, m, n if n > m
current = [*0…n]
1.upto(m) do |i|
previous, current = current, [i]+[0]*n
1.upto(n) do |j|
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
change += 1 if a[j-1] != b[i-1]
current[j] = [add, delete, change].min
end
end
current[n]
end
end
def levenshtein(comparator)
[…]
In the code, there’s a parameter called comparator which seems to be
used to
decode given parameter. But, I can’t understand what exactly the comparator
is doing.
I didn’t examine the code, but I’d guess… you want the L. distance
between two strings. One is self, and the other is the parameter named
“comparator”.
In the document, a sample Ruby code goes like the following:
add, delete = previous[j]+1, current[j-1]+1
used to
class String
add, delete = previous[j]+1, current[j-1]+1
I’m afraid that I’m not used to character encodings. Does Ruby use UTF-8
by
default?
In other words, suppose that I’ve launched irb and fired
“foo”.levenshtein(“foobar”).
In that case, is the string “foo” encoded as utf-8? Do I always have to
unpack the string like
the code shown above?
It’s simply the string you’re comparing against; unpack(‘U*’) just turns
the UTF-8 characters into unsigned integers:
class String
def levenshtein(comparator)
a, b = self.unpack(‘U*’), comparator.unpack(‘U*’)
b # => [102, 111,
111, 98, 97, 114]
[…]
“foo”.levenshtein(“foobar”) # => 3
I’m afraid that I’m not used to character encodings. Does Ruby use UTF-8 by
default?
Ruby’s Strings as of 1.8 are just a sequence of bytes. If a String
happens to
hold a UTF-8 sequence, some operations will work properly when you set
$KCODE=“u”, e.g.
In other words, suppose that I’ve launched irb and fired
“foo”.levenshtein(“foobar”).
In that case, is the string “foo” encoded as utf-8?
This depends on your locale; Ruby’s Strings do not know about encodings
in 1.8
(they will very soon in 1.9 according to matz’ latest messages in
ruby-core;
search the ruby-talk archives for extensive discussions of the upcoming
M17N).
Do I always have to unpack the string like the code shown above?
In this particular case, it depends on two things:
the encoding you’re using
whether the Levenshtein should be relative to characters or bytes
In general, it’s up to you to keep track of the encodings and handle
Strings
appropriately.
I’m afraid that I’m not used to character encodings. Does Ruby use UTF-8 by
default?
As of Ruby 1.8, it doesn’t, but see the other responses.
In other words, suppose that I’ve launched irb and fired
“foo”.levenshtein(“foobar”).
In that case, is the string “foo” encoded as utf-8?
The code makes that assumption, correct or not. If you’re one of those
ignorant yokels who believes that characters are bytes, it’s also a
safe assumption. Now, if you’re one of those people who does i18n,
l10n, and m17n before breakfast, you’ll have alarm bells going off
inside your head immediately, as such assumptions are in general very
dangerous to make. A cardinal rule in this kind of programming is that
strings are meaningless without an attached encoding.
Do I always have to
unpack the string like
the code shown above?
It would be better, at least that keeps your options open when you
attempt to internationalize your program. You wind up supporting
Unicode at the very least, and that makes the transition to
internationalization a bit easier, as Unicode is a reasonable choice
as a character set if one wishes to do internationalization.
The code makes that assumption, correct or not. If you’re one of those
ignorant yokels who believes that characters are bytes, it’s also a
safe assumption. Now, if you’re one of those people who does i18n,
l10n, and m17n before breakfast, you’ll have alarm bells going off
inside your head immediately, as such assumptions are in general very
dangerous to make. A cardinal rule in this kind of programming is that
strings are meaningless without an attached encoding.
The version of Levenshtein in the Text project
(http://rubyforge.org/projects/text) will work on either UTF-8
codepoints or bytes depending on the value of $KCODE. It’s the best we
can do for now; in the future, of course, we will be able to decide
based on the strings’ encodings themselves.
</blatant plug>
Paul.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.