Forum: Ruby on Rails Fuzzy searching

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
114a64bf4e0cb600120f348b6447c258?d=identicon&s=25 jennyw (Guest)
on 2005-11-26 20:16
(Received via mailing list)
Hi, everyone,

Just wondering if someone had come up with a good way to do fuzzy
searches if you use MySQL as your database (we tried switching to
PostgreSQL, but that ended up adding even more problems). Ferret sounds
great, but reading through the discussion it looks like we need to solve
the problem of write conflicts. I just wrote a post in ruby-talk about
using KirbyBase maybe to solve this. Of course, I'm hoping someone has
come up with an even better solution! ;-)

Anyway, any advice would be appreciated.

Thanks!

Jen
Ce60c4f78a63b0695e4dafc4bd7964f7?d=identicon&s=25 vanek (Guest)
on 2005-11-27 00:50
(Received via mailing list)
Here's a simple fuzzy-search algorithm you can plug in to whatever you
decide to use.
Most of this code was stolen from php: http://us2.php.net/similar_text
(in comments section).
O(m*n), where m and n are lengths of the two target strings.
You may want to rewrite it to make it look more Rubyish.


# this is the main algorithm.
# compute longest common subsequence btn strings s1 and s2
def lcs_length(s1, s2)

   m = [s1.length,128].min
   n = [s2.length,128].min
   x = 1 + [m,n].max

   # this table will be used to compute the LCS-Length,
   # only the first 128 chars per string are considered
   lcs_tbl = []
   # initialize 2-dimensional array to 0
   (0...x).each { || lcs_tbl << Array.new(x,0) }

   (1..m).each { |i|
	  (1..n).each { |j|
			 if (s1[i-1] == s2[j-1])
			   lcs_tbl[i][j] = lcs_tbl[i-1][j-1] + 1
			 elsif (lcs_tbl[i-1][j] >= lcs_tbl[i][j-1])
			   lcs_tbl[i][j] = lcs_tbl[i-1][j]
			 else
			   lcs_tbl[i][j] = lcs_tbl[i][j-1]
			 end
	  }
   }

   return lcs_tbl[m][n]
end




def get_lcs(s1, s2, verbose=false)

   # ok, now replace all spaces and weird characters with 'normal'
ascii-7
   s1 = normalize_string(s1);
   s2 = normalize_string(s2);

   # check for pathological cases
   return 0 if [s1.length, s2.length].max < 1

   puts "comparing strings #{s1} and #{s2}" if verbose
   lcs = lcs_length(s1,s2) # compute longest common subsequence

   # average string length
   ms = (s1.length + s2.length) / 2.0

   (puts "ms: #{ms} lcs: #{lcs}";puts) if verbose
   return (lcs*100)/ms
end




   # code stolen from one of the ruby MLs.
   def normalize_string(foo)
     foo = foo.downcase.strip
     foo.gsub!(/[Ã?ÁÃ?Ã?âäàãáäåāÄ?Ä?Ç?Ç?ǡǻȁÈ?ȧẵặ]/,'a')
     foo.gsub!(/[ëêéèẽÄ?Ä?Ä?ẻÈ?È?ẹȩÄ?á¸?á¸?ềếá»?á»?á¸?á¸?á»?ḝ]/,'e')
     foo.gsub!(/[Ã?ÍÃ?ĨÏiìíîĩīĭïá»?ǐá»?įÈ?È?ḭɨḯ]/,'i')
     foo.gsub!(/[Ã?Ã?Ã?Ã?Ã?òóôõōŏȯöỏÅ?Ç?ȍȏơǫọɵøá»?á»?á»?á»?ȱȫȭṍṏá¹?á¹?ờá»?ỡá»?ợǭá»?Ç¿]/,'o')
     foo.gsub!(/[Ã?Ã?Ã?ŨÃ?ùúûũūŭüủůűÇ?È?È?ưụṳųṷṵṹṻÇ?Ç?Ç?Ç?Ç?ừứữửự]/,'u')
     foo.gsub!(/[ỳýŷỹȳẏÿỷ�ƴỵ]/,'y')
     foo.gsub!(/[Å?]/,'oe')
     foo.gsub!(/[�ǼǢæ]/,'ae')
     foo.gsub!(/[ñǹÅ?]/,'n')
     foo.gsub!(/[�ç]/,'c')
     foo.gsub!(/[Ã?]/,'b')
     foo.gsub!(/[Å?]/,'oe')
     foo.gsub!(/[ij]/,'ij')
     foo.gsub!(/[\s\'\"\\\/\?\.\=\+\&\%]/,'_')
     foo.gsub!(/_+/,'_')
     foo
   end


   # a little test
   [
	  ["a","a"],["ab","ab"],["abc","abc"],
	  ["a","b"],["ab","bb"],["abc","abb"],
	  ["a","1"],["ab","_b"],["abcd","a cd"],
	  ["abcd","abcde"],["abcd","abcdefghi"],["abcde","abcdf"],
	  ["abcd","zbcd"],["abcd","efghiabcd"],["abcde","fabcd"],
	  ["abc","acb"],["abcde","abzde"],["abcde","abde"],
	  ["abc","ac"],["abcde","ab1de"],["abcde","ab de"]
   ].each { |e|
   	puts get_lcs(e[0],e[1],true)
	puts
   }
7a28bf7b0932a9486c21f1a33eb431ba?d=identicon&s=25 alltom (Guest)
on 2005-11-27 04:23
(Received via mailing list)
On 11/26/05, Lou Vanek <vanek@acd.net> wrote:
> Here's a simple fuzzy-search algorithm you can plug in to whatever you decide to use.
> Most of this code was stolen from php: http://us2.php.net/similar_text (in comments 
section).
> O(m*n), where m and n are lengths of the two target strings.
> You may want to rewrite it to make it look more Rubyish.

I believe that Jen was looking for a solution for doing the fuzzy
search on the MySQL server itself.

Sincerely,

Tom Lieber
tom@alltom.com
http://AllTom.com/
Ce60c4f78a63b0695e4dafc4bd7964f7?d=identicon&s=25 vanek (Guest)
on 2005-11-27 14:25
(Received via mailing list)
MySQL does not have a fuzzy search capability, nor does the stored
procedure language have arrays, so fuzzy search embedded w/in
the MySQL database is not possible to do efficiently (although you
can fake arrays either with temp tables or gobs of string ops).
Which leaves you with downloading the data and performing the
match outside of the db. On the surface this appears inefficient,
because, after all, the data has to be moved to the client each
time a search is processed. But considering that the search itself
is O(m*n), data transfer is not the overriding concern (unless you
have an unbearably slow transfer speed).
In fact, many fuzzy search algorithms have a much worse running-time
order [O(n^3)], so pulling the data off the server so that a more
efficient algorithm can be used may even be faster than doing the
computation centrally on the server.
114a64bf4e0cb600120f348b6447c258?d=identicon&s=25 jennyw (Guest)
on 2005-11-28 02:25
(Received via mailing list)
I may try that algorithm Lou posted. I'm going to choose between that
and using Ferret to do this ... David Balmain said that locking in
Ferret should be able to handle this without a server process except for
really high concurrency applications. Since I'm dealing with less than
10,000 records, I don't think this will be a big issue.

Thanks!

Jen
Ccaa63b09f5168501b194266b9a1e7cc?d=identicon&s=25 Scott Willson (Guest)
on 2005-12-15 03:43
(Received via mailing list)
I ran across some odd behavior with multiple subclasses of
ActiveRecord. Sometimes find() and find_all() would find subclass
records and sometimes not, depending on whether a subclass had been
loaded or not. I can imagine why this happens, and how it might be
difficult to fix, but it was surprising and took me a while to figure
out.

Here's a concrete example. Say I have three classes that inherit from
ActiveRecord, and I have one instance of each in my 'animals' table:
Dog < Mammal < Animal < ActiveRecord::Base

Run these command in order and you get:
Mammal.count() = 1
Dog.count() = 1
Mammal.count() = 2!

Here's the SQL:
SHOW FIELDS FROM animals
SELECT COUNT(*) FROM animals WHERE ( (animals.`type` = 'Mammal' ) )
SHOW FIELDS FROM animals
SELECT COUNT(*) FROM animals WHERE ( (animals.`type` = 'Dog' ) )
SELECT COUNT(*) FROM animals WHERE ( (animals.`type` = 'Mammal' OR
animals.`type` = 'Dog' ) )

Sure looks like Mammal/Rails doesn't know about Dog until I use Dog.
There's a workaround:
Dog
Mammal.count() = 2
Dog.count() = 1
Mammal.count() = 2

Not sure how easy it would be to fix, or if it even needs fixing. If
anyone wants a simple example, download:
http://www.butlerpress.com/sti.zip

Scott
47fb897113ebddae565f2f4b9f45ca1d?d=identicon&s=25 jonathan (Guest)
on 2005-12-15 07:16
Scott Willson wrote:
> Not sure how easy it would be to fix, or if it even needs fixing. If
> anyone wants a simple example, download:
> http://www.butlerpress.com/sti.zip

What about explicitly using the 'model :x' statements?

--J
This topic is locked and can not be replied to.