String 'close-to' comparison

Hello, I have an array. It contains approximately twenty elements which
are strings. I also have one string - this string was obtained using an
OCR system. One of the strings in the array should ‘match’ the string
gotten using the OCR system - unfortunately OCRs aren’t perfect!

I want to take this string, and compare it to every string in the array,
and attempt to return the closest match.

I.E.,
array = [‘Hello there, how are you?’, ‘What did you do over your
break?’, 'I like my coffee brown.", “I just bought a new car.”]
string = “What did you d0 over your brcak?”

And then have my comparison function return array[1]. As you can see,
string has some ‘OCR errors’ - it’s usually 80-95% accurate, if not
dead-on.


Thanks, Kyle ‘Phenax’ Hunter
http://keletech.org/blog/

If you know all the possibilities that your OCR system could pick up
then you could always do something like this…

knownStrings = [‘Hello’,‘Goodbye’]
out = []
OCR_strings = #new array of strings

OCR_strings.each do |ocr|
matches,len = 0,0
knownStrings.each do |known|
len = known.length
(len-1).times do |i|
if (i+1) >= ocr.length
break
else
if ocr[i] == known[i]
matches += 1
end
end
if matches / known.length > 0.85
out << known
else
out << “!#{known}”
end
end
end
end

…completely untested but i think you know what im getting at

  if matches / known.length > 0.85
    out << known
  else
    out << "!#{known}"
  end

should be more like

if matches / known.length > 0.85
out << known
end

Hi,

Kyle H. wrote:

Hello, I have an array. It contains approximately twenty elements which
are strings. I also have one string - this string was obtained using an
OCR system. One of the strings in the array should ‘match’ the string
gotten using the OCR system - unfortunately OCRs aren’t perfect!

I want to take this string, and compare it to every string in the array,
and attempt to return the closest match.

I.E.,
array = [‘Hello there, how are you?’, ‘What did you do over your
break?’, 'I like my coffee brown.", “I just bought a new car.”]
string = “What did you d0 over your brcak?”

And then have my comparison function return array[1]. As you can see,
string has some ‘OCR errors’ - it’s usually 80-95% accurate, if not
dead-on.


Thanks, Kyle ‘Phenax’ Hunter
http://keletech.org/blog/
Here is a simple score matching code:

array = [‘Hello there, how are you?’, ‘What did you do over your
break?’,
‘I like my coffee brown.’, ‘I just bought a new car.’]
string = “What did you d0 over your brcak?”

def comp(str1,str2)
a=str1.split(’’).uniq
b=str2.split(’’).uniq
(a+b).uniq.length*1.0/(a.length+b.length)
end

puts array.sort_by{|x|comp(string,x)}.first

Regards,
Park H.

On Apr 2, 2008, at 10:32 PM, Kyle H. wrote:


Thanks, Kyle ‘Phenax’ Hunter
http://keletech.org/blog/

Posted via http://www.ruby-forum.com/.

http://amatch.rubyforge.org/

a @ http://codeforpeople.com/

On Apr 2, 10:32 pm, Kyle H. [email protected] wrote:

break?’, 'I like my coffee brown.", “I just bought a new car.”]
string = “What did you d0 over your brcak?”

And then have my comparison function return array[1]. As you can see,
string has some ‘OCR errors’ - it’s usually 80-95% accurate, if not
dead-on.


Thanks, Kyle ‘Phenax’ Hunterhttp://keletech.org/blog/

Posted viahttp://www.ruby-forum.com/.

It sounds like what you want is something like the Levenshtein
distance (http://en.wikipedia.org/wiki/Levenshtein_distance).

HTH,
Chris

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs