Peter Norvig wrote a simple spelling corrector in 20 lines of Python 2.5, so I thought I'd see what it looks like in Ruby. I'm not too pleased with my version, if anyone can make it more elegant, that would be great. Some of the sticking points were: 1) List comprehensions in Python made the edits1 function more elegant IMO. Hopefully someone can improve that function. 2) The boolean expression in the correct function evaluates empty sets/arrays as false in Python but not in Ruby, so I had to add the "result.empty? ? nil : result" expression to several functions. I expect there's a better way to handle this also. 3) Minor point, but apparently Python has a built in constant for the set of lower case characters "string.lowercase", so I just defined a constant. Otherwise, the translation was pretty straightforward. Here's a link to Norvig's page: http://www.norvig.com/spell-correct.html That page includes a link to a text file that I saved locally as holmes.txt: http://www.norvig.com/holmes.txt Note: I wrapped a few of the longer lines for posting. def words text text.downcase.scan(/[a-z]+/) end def train features model = Hash.new(1) features.each {|f| model[f] += 1 } return model end NWORDS = train(words(File.new('holmes.txt').read)) LETTERS = 'abcdefghijklmnopqrstuvwxyz' def edits1 word n = word.length deletion = (0...n).collect {|i| word[0...i]+word[i+1..-1] } transposition = (0...n-1).collect { |i| word[0...i]+word[i+1,1]+word[i,1]+word[i+2..-1] } alteration = [] n.times {|i| LETTERS.each_byte { |l| alteration << word[0...i]+l.chr+word[i+1..-1] } } insertion = [] (n+1).times {|i| LETTERS.each_byte { |l| insertion << word[0...i]+l.chr+word[i..-1] } } result = deletion + transposition + alteration + insertion result.empty? ? nil : result end def known_edits2 word result = [] edits1(word).each {|e1| edits1(e1).each { |e2| result << e2 if NWORDS.has_key?(e2) }} result.empty? ? nil : result end def known words result = words.find_all {|w| NWORDS.has_key?(w) } result.empty? ? nil : result end def correct word (known([word]) or known(edits1(word)) or known_edits2(word) or [word]).max {|a,b| NWORDS[a] <=> NWORDS[b] } end Brian Adkins http://lojic.com/blog/
on 10.04.2007 03:06
on 10.04.2007 05:38
Brian Adkins wrote: > 3) Minor point, but apparently Python has a built in constant for the set > of lower case characters "string.lowercase", so I just defined a constant. ... > LETTERS = 'abcdefghijklmnopqrstuvwxyz' A minor comment: you can construct this string with ("a".."z").to_a.join => "abcdefghijklmnopqrstuvwxyz"
on 10.04.2007 09:52
On 4/10/07, Joel VanderWerf <vjoel@path.berkeley.edu> wrote: > Brian Adkins wrote: > > 3) Minor point, but apparently Python has a built in constant for the set > > of lower case characters "string.lowercase", so I just defined a constant. > ... > > LETTERS = 'abcdefghijklmnopqrstuvwxyz' > > A minor comment: you can construct this string with > > ("a".."z").to_a.join > => "abcdefghijklmnopqrstuvwxyz" Almost nobody, not even the gurus ever use [*a..b] instead of (a..b).to_a. The performance is about the same, the former seems faster for short arrays/ranges ~ 500 and from about 1k elements the later gets slightly faster. I feel that [*a..b] is the right/better/more logical construct to use if one wants an array, or do I miss something? Cheers Robert
on 10.04.2007 17:06
On Tue, 10 Apr 2007 16:52:03 +0900, Robert Dober wrote: >> => "abcdefghijklmnopqrstuvwxyz" > > Almost nobody, not even the gurus ever use [*a..b] instead of (a..b).to_a. Interesting. I didn't realize that would work with ranges. Here's what I read in "Programming Ruby" p. 348: "Following these parameters may be a single parameter prefixed with an asterisk. If this parameter is an array, Ruby replaces it with zero or more parameters corresponding to the elements of the array." I guess to_a is called on the range implicitly prior to evaluating *.
on 10.04.2007 18:11
On 4/10/07, Brian Adkins <lojicdotcomNOSPAM@gmail.com> wrote: > >> > more parameters corresponding to the elements of the array." > > I guess to_a is called on the range implicitly prior to evaluating *. Hmm I guessed rather not, but I am wrong: 429/5 > cat ast.rb # vim: sts=2 sw=2 tw=0 expandtab nu: a = [*1..2] robert@PC:~/log/ruby/tests/theory 18:09:04 430/6 > ruby -rprofile ast.rb % cumulative self self total time seconds seconds calls ms/call ms/call name 0.00 0.00 0.00 1 0.00 0.00 Enumerable.to_a 0.00 0.00 0.00 1 0.00 0.00 Range#each 0.00 0.01 0.00 1 0.00 10.00 #toplevel Good thinking there Brian !! Cheers Robert > > > > The performance is about the same, the former seems faster for short > > arrays/ranges ~ 500 > > and from about 1k elements the later gets slightly faster. > > I feel that [*a..b] is the right/better/more logical construct to use > > if one wants an array, or do I miss something? Well I did as shown above, but I still rather go for the shorter ;)
on 11.04.2007 10:56
Brian Adkins wrote: > "result.empty? ? nil : result" expression to several functions. I expect > holmes.txt: http://www.norvig.com/holmes.txt > return model > alteration = [] > n.times {|i| LETTERS.each_byte { > |l| alteration << word[0...i]+l.chr+word[i+1..-1] } } > insertion = [] > (n+1).times {|i| LETTERS.each_byte { > |l| insertion << word[0...i]+l.chr+word[i..-1] } } > result = deletion + transposition + alteration + insertion > result.empty? ? nil : result > end Letters = ('a'..'z').to_a class String def replace( which, what ) self[0...which] + what + self[which+1 .. -1] end end def edits1 word n = word.size return nil if n.zero? deletion = (0...n).map{|pos| word.replace(pos, "") } transposition = (0...n-1).map{|i| word[0...i]+word[i+1,1]+word[i,1]+word[i+2..-1] } alteration = (0 ... n).to_a.map{|pos| Letters.map{|ch| word.replace(pos, ch) }}.flatten insertion = (0 .. n).to_a.map{|pos| Letters.map{|ch| word[0...pos] + ch + word[pos..-1] }}.flatten deletion + transposition + alteration + insertion end
on 12.04.2007 01:10
On Apr 11, 2:54 am, "William James" <w_a_x_...@yahoo.com> wrote: > class String > def replace( which, what ) > self[0...which] + what + self[which+1 .. -1] > end > end Cleaner, IMHO: class String def replace( where, what ) s2 = dup s2[ where ] = what s2 end end
on 12.04.2007 01:50
This may be a bit OT, but I am really struck as to how elegantly both of these languages (and the programmers as well) handle this seemingly difficult problem. I've been working with ruby mainly for fun over the past 8-9 months and seeing this solution (in Python and ruby) really heightens my respect for the elegance and expressiveness of both languages. </rave>
on 12.04.2007 18:05
I am so glad I find this topic here! I was on the plane yesterday, and
just like Peter Norwig I managed to write a spelling corrector. An
obvious difference is that he started from nothing, while I started
from his code... Anyway, I did it in Ruby, it works, but... it's dead
slow. If I need to compute the second array (the one with 2 mistakes),
it takes about 1 second per suggestion.
My edit1 function is like this:
----------------------------
def edits1(word)
n = word.length
return ( (0..n-1).collect {|i| word[0,i] + word[i+1, n-1]} +
#deletion
(0..n-2).collect {|i| word[0,i] + word[i+1,1] +word[i,1] +
word[i+2, n-1]} + #transposition
('a'..'z').collect {|c| (0..n-1).collect { |i| word[0,i] + c +
word[i+1, n-1]} } + #alteration
('a'..'z').collect {|c| (0..n).collect { |i| word[0,i] + c +
word[i, n]} } ).flatten #insertion
end
---------------------------------------------
I believe this is where I lose most of the time? when I call it, it
looks like that:
--------------------------------------------
edits2 = (edits1(word).collect {|i| edits1(i) & NWORDS }).flatten
--------------------------------------------
Can anyone see here a flaw in logic, or some function that should be
avoided because it's known as slow?
on 12.04.2007 20:37
At the risk of being off-topic or uninformed, I guess this > -------------------------------------------- > edits2 = (edits1(word).collect {|i| edits1(i) & NWORDS }).flatten > -------------------------------------------- is overkill. The thing is that there are WAY too many off-by-two mistakes but too few words to be worth that computation. Moreover, most of the "corrections" are nonsense: "xx" "zx" "hh" "jk" "bb" almost surely. This is where dictionary metrics must have come into being, I think. (I have not the original post, sorry, so this may be out of place). Take into account that for a 4-letter word like "riot" you are letting the user have up to 2 mistakes, that means from: "at" to (at least) "tricot" which ... seems too loose to me. I am utterly unaware of the way dictionaries work, but I bet they use hashes (in the sense of "a (key, value) list" and in the proper classical sense of "a good function for distinguishing and identifying items"), as well as ad-hoc metrics A LOT. I'd rather stick with words of the same length and at most off-by-one mistakes (or in obvious cases, for longer words, other possibilities). There must be a kind of metric-hash pair to play with the duality: number of mistakes "allowed" / number of "similar" real words / length of the word. Obviously, "termodynamics" gives rise to few doubts, but what about "lykes" "bykes" "Mykes" "Bites" "Bytes" "Likes" "Like" "Myke" "Mike" "Byke"... and this without thinking. I guess the hash/metric problem is the real game in Dictionary software. I'd be glad to help if I find the time, though. But it seems a trodden path to me (apart from the doing it for the joy of it, of course). 2c Pedro El 12/04/2007, a las 18:04, Nicolas Buet escribió: > n = word.length > end > avoided because it's known as slow? >> >> </rave> >> >> >> -- >> Posted via http://www.ruby-forum.com/. >> >> > Pedro Fortuny Ayuso pfortuny@gmail.com http://pfortuny.sdf-eu.org C/Capuchinos 14, 1-S. 47006 Valladolid. SPAIN
on 12.04.2007 21:37
> I am so glad I find this topic here! I was on the plane yesterday, and > just like Peter Norwig I managed to write a spelling corrector. An > obvious difference is that he started from nothing, while I started > from his code... Anyway, I did it in Ruby, it works, but... it's dead > slow. If I need to compute the second array (the one with 2 mistakes), > it takes about 1 second per suggestion. I avoided doing two-character edits. Instead I use Text::Metaphone and mutate that string and Text::Levenshtein for filtering by distance. It seems to work fairly well. (You will need the text gem to try this out.) The code (together with the holmes.txt as a sample for training it) is available at http://flgr.0x42.net/code/spell_correct/
on 13.04.2007 04:15
On Fri, 13 Apr 2007 01:04:39 +0900, Nicolas Buet wrote: > n = word.length > --------------------------------------------- > > I believe this is where I lose most of the time? when I call it, it > looks like that: > > -------------------------------------------- > edits2 = (edits1(word).collect {|i| edits1(i) & NWORDS }).flatten > -------------------------------------------- > > Can anyone see here a flaw in logic, or some function that should be > avoided because it's known as slow? You may want to compare to my original at the start of this thread. I don't have your code, so I can't benchmark it, but I have no noticeable lag when I correct a word. It may have something to do with you having two ranges, two collects and a flatten just to get the insertion set. Compare ('a'..'z').collect {|c| (0..n).collect { |i| word[0,i] + c + word[i, n]} } ).flatten to: (n+1).times {|i| LETTERS.each_byte { |l| insertion << word[0...i]+l.chr+word[i..-1] } }
on 15.04.2007 07:56
> > from his code... Anyway, I did it in Ruby, it works, but... it's dead > > slow. If I need to compute the second array (the one with 2 mistakes), > > it takes about 1 second per suggestion. > > Can anyone see here a flaw in logic, or some function that should be > > avoided because it's known as slow? > You may want to compare to my original at the start of this thread. I > don't have your code, so I can't benchmark it, but I have no noticeable > lag when I correct a word. It may have something to do with you having two > ranges, two collects and a flatten just to get the insertion set. Compare I haven't taken the time to put this together as code -- sorry -- but just off the top of my head, in Norvig's post, the Python version, he uses set() a lot. My Python's rusty - well, that sounds a bit obscene, but what I mean is, my skill in Python is imperfect - anyway, point is, I'm not **certain**, but I think to get the Python set() in Ruby you need a gem. Using arrays instead of sets could be the source of the bottleneck. Hang on, I just checked the docs (the docs being Google) and you don't need a gem, it's in the standard library. All you need to do is: require 'set' and then either xyz = Set.new or xyz = %w{x y z}.to_set (etc.) at which point you'll probably get the benefit of Norvig's design, and things will go faster.