Forum: Ruby Proximity searches in Ruby

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Cf25fbf53c67e27d95845e77e949b56f?d=identicon&s=25 Stuart Clarke (sclarke)
on 2008-12-05 14:28
Does Ruby have the ability to perform proximity searches on data. For
example find the word "hello" and the word "world" within 10 words of
eachother and print out some data?

Thanks a lot
391f9b787cdc12aa2c179713f5103e3a?d=identicon&s=25 Ilan Berci (iberci)
on 2008-12-06 00:36
No proximity searches with 1.8.. you would need a full fledged text
search engine such as lucerne, sphinx for that type of capability

ilan


Stuart Clarke wrote:
> Does Ruby have the ability to perform proximity searches on data. For
> example find the word "hello" and the word "world" within 10 words of
> eachother and print out some data?
>
> Thanks a lot
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-12-06 13:25
(Received via mailing list)
On 06.12.2008 00:29, Ilan Berci wrote:
> No proximity searches with 1.8.. you would need a full fledged text
> search engine such as lucerne, sphinx for that type of capability

Or, depending on requirements (especially performance) cook your own.

def proximity_search(text, distance, w1, w2)
   w1 = w1.downcase
   w2 = w2.downcase
   index1 = []
   index2 = []

   text.scan(/\w+/).each_with_index do |w,idx|
     case
       when w1 == w.downcase
         index1 << idx
       when w2 == w.downcase
         index2 << idx
     end
   end

   found = false

   index1.each do |i1|
     index2.each do |i2|
       if (i1 - i2).abs <= distance
         found = true
         yield i1, i2 if block_given?
       end
     end
   end

   found
end

text = <<EOT
Hello world, how are you today? I said "Hello"
to the other guy but he would not answer although
all the world could hear me.
EOT

proximity_search text, 3, "hello", "World" do |a,b|
   printf "Found at positions %3d and %3d\n", a, b
end

p proximity_search(text, 3, "hello", "World")

Cheers

  robert
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2008-12-06 13:44
(Received via mailing list)
On Sat, Dec 6, 2008 at 1:18 PM, Robert Klemme
<shortcutter@googlemail.com> wrote:
Hmm
maybe you can scale this down

File::read("test.data").scan  %r{Hello.{0,10}World}im

Should work fine for reasonable file sizes ( < Available Memory )

HTH
Robert

--
Il computer non è una macchina intelligente che aiuta le persone
stupide, anzi, è una macchina stupida che funziona solo nelle mani
delle persone intelligenti.
Computers are not smart to help stupid people, rather they are stupid
and will work only if taken care of by smart people.

Umberto Eco
86e33dee4a89a8879a26487051c216a8?d=identicon&s=25 Michael Fellinger (Guest)
on 2008-12-06 14:13
(Received via mailing list)
> File::read("test.data").scan  %r{Hello.{0,10}World}im

That would be 10 characters

^ manveru
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2008-12-06 14:37
(Received via mailing list)
On Sat, Dec 6, 2008 at 2:06 PM, Michael Fellinger
<m.fellinger@gmail.com> wrote:
>> File::read("test.data").scan  %r{Hello.{0,10}World}im
>
> That would be 10 characters
Huh??
cat test.data && echo "=============================" && cat scan.rb
&& ./scan.rb
Hello World hello **
* world


hello do not find this world
=============================
#!/usr/local/bin/ruby
# vim: sw=2 ts=2 ft=ruby expandtab tw=0 nu syn=on:
# file: scan.rb

p File::read("test.data").scan(  %r<Hello.{0,10}World>im )

["Hello World", "hello **\n* world"]

R.
Cf7cd97cdc8ed7d4ae92965b24f0dfad?d=identicon&s=25 Stefan Rusterholz (apeiros)
on 2008-12-06 14:50
Hi there

Thanks to Robert we have something to build up on. If you only need to
know whether w1 and w2 occur in the given distance within the text or
not (true/false) then we can improve his algorithm and exclude the
O(n^2) part:

Robert Klemme wrote:
> def proximity_search(text, distance, w1, w2)
>    w1 = w1.downcase
>    w2 = w2.downcase
>    index1 = []
>    index2 = []
>
>    text.scan(/\w+/).each_with_index do |w,idx|
>      case
>        when w1 == w.downcase
>          index1 << idx
>        when w2 == w.downcase
>          index2 << idx
>      end
>    end
>
>    found = false
>
>    index1.each do |i1|
>      index2.each do |i2|
>        if (i1 - i2).abs <= distance
>          found = true
>          yield i1, i2 if block_given?
>        end
>      end
>    end
>
>    found
> end

module StringProximitySearch
  def proximity_search(w1, w2, distance=1)
    w1   = w1.downcase
    w2   = w2.downcase
    idx1 = -(distance*2)
    idx2 = -(distance*2)

    scan(/\w+/).each_with_index do |w,idx|
      case w.downcase
        when w1
          idx1 = idx
          return true if idx1-idx2 <= distance
        when w2 then index2 << idx
          idx2 = idx
          return true if idx2-idx1 <= distance
      end
    end

    false
  end
end

class String
  include StringProximitySearch
end

your_test_string.proximity_search(w1, w2, distance)

NB: This code is untested.

Regards
Stefan
Cf7cd97cdc8ed7d4ae92965b24f0dfad?d=identicon&s=25 Stefan Rusterholz (apeiros)
on 2008-12-06 15:01
Robert Dober wrote:
> File::read("test.data").scan  %r{Hello.{0,10}World}im

Another nice idea. And another mainly untested improvement:
module StringProximitySearch
  def proximity_search(w1, w2, distance=0)
    self =~
%r{#{Regexp.escape(w1)}\W+(?:\w+\W+){0,#{distance}}#{Regexp.escape(w2)}}im
  end
end

Regards
Stefan
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-12-07 12:15
(Received via mailing list)
On 06.12.2008 14:30, Robert Dober wrote:
> On Sat, Dec 6, 2008 at 2:06 PM, Michael Fellinger <m.fellinger@gmail.com> wrote:
>>> File::read("test.data").scan  %r{Hello.{0,10}World}im
>> That would be 10 characters
> Huh??

I guess Michael's point was that your version is not a scaled down
version of mine because it defines "proximity" differently: your version
uses character count while my version uses word count.

Another difference is that in your case order matters.  Here's an
attempt to fix both:

# untested
s.scan %r{
   \b
   (?:
       hello\W+(?:\w+\W+){0,10}world
   |   world\W+(?:\w+\W+){0,10}hello
   )
   \b
}mix

Cheers

  robert
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2008-12-07 12:58
(Received via mailing list)
On Sun, Dec 7, 2008 at 12:08 PM, Robert Klemme
<shortcutter@googlemail.com> wrote:
>
> I guess Michael's point was that your version is not a scaled down version
> of mine because it defines "proximity" differently: your version uses
> character count while my version uses word count.
Use words, I see, if one uses words I can understand that ;)

Actually I was just wondering if we should not keep track of the
positions we found, what good would a search do if we cannot go there
:(.
>
> Another difference is that in your case order matters.  Here's an attempt to
Does it not? Hmm right in a search engine it indeed does not, agreed...
> }mix
r = []
s.scan( your stuff ) do |m| r << [$`.size, m] end
r

R
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-12-07 13:30
(Received via mailing list)
I played a bit around and came up with a more involved version which
works with arbitrary numbers of words but still fits into a few lines of
code.  Another advantage is that it identifies character positions with
matches in the text.  Note that this allows a maximum distance between
the first and the last word of (distance * (words - 1)).  It could be
modified pretty easily (with an additional method in Array) to keep
*all* words within distance.

Have fun

  robert


#!/bin/env ruby

# search words in arbitrary order where
# pairs of words have a max distance between
# them
ProximitySearchData = Struct.new :word, :wpos, :spos

def proximity_search(text, distance, *words)
   sdata = words.map {|w| ProximitySearchData.new w.downcase}

   wpos = 0

   text.scan %r{\w+}i do |match|
     match = match.downcase
     pos = $`.length

     sdata.each do |sd|
       if sd.word == match
  sd.spos = pos
  sd.wpos = wpos
  break :change
       end
     end == :change and
     sdata.all? {|sd| sd.wpos} and
     sdata.
       sort_by {|sd| sd.wpos}.
       each_cons(2).
       all? {|sd1,sd2| sd2.wpos - sd1.wpos <= distance} and
     yield *sdata.map {|sd| sd.spos}

     wpos += 1
   end
end

text = <<EOT
Hello world, how are you today? I said "Hello"
to the other guy but he would not answer although
all the world could hear me.
EOT

proximity_search text, 3, "hello", "World" do |a,b|
   printf "Found at char positions %3d and %3d\n", a, b
end
Fe5660da5930df2d7b34b6066c2d16bb?d=identicon&s=25 Jens Wille (jwille)
on 2008-12-08 15:29
(Received via mailing list)
hi!

Robert Klemme [2008-12-07 13:23]:
> I played a bit around and came up with a more involved version
> which works with arbitrary numbers of words but still fits into a
> few lines of code.
here's my take on the task: Poor Man's Search ;-)

  <http://github.com/blackwinter/pms>

it's a different approach with support for boolean operators as well
as - back to topic! - proximity operators. pms builds an index for
the input documents and stores the token positions along with the
document numbers. still pretty rough, but working...

> text = <<EOT
> Hello world, how are you today? I said "Hello"
> to the other guy but he would not answer although
> all the world could hear me.
> EOT
>
> proximity_search text, 3, "hello", "World" do |a,b|
>   printf "Found at char positions %3d and %3d\n", a, b
> end

  require 'pms/ext'

  search = text.search('hello').near('world', 3)

  p search.results
  #=> [0]

  p search.results_with_positions
  #=> {0=>[0, 8]}

  p search.matches
  #=> ["Hello world, how are you today? I said \"Hello\"\n"]

cheers
jens
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-12-08 15:57
(Received via mailing list)
2008/12/8 Jens Wille <jens.wille@uni-koeln.de>:
> here's my take on the task: Poor Man's Search ;-)

I like that name. :-)

>> all the world could hear me.
>> EOT
>>
>> proximity_search text, 3, "hello", "World" do |a,b|
>>   printf "Found at char positions %3d and %3d\n", a, b
>> end
>
>  require 'pms/ext'
>
>  search = text.search('hello').near('world', 3)

Nice!  I like the idea to create the search criteria by calling methods.

But I have a remark from an API point of view: I'd probably separate
the search specification from the index building.  If I read your code
properly then you create the search criterion from the input text.  If
you use that approach the index will be recreated for every other
search.  It's probably cleaner and also more efficient if query
creation, index creation and search are separated.

Kind regards

robert
Fe5660da5930df2d7b34b6066c2d16bb?d=identicon&s=25 Jens Wille (jwille)
on 2008-12-08 16:12
(Received via mailing list)
hi robert!

Robert Klemme [2008-12-08 15:50]:
> 2008/12/8 Jens Wille <jens.wille@uni-koeln.de>:
>> here's my take on the task: Poor Man's Search ;-)
>
> I like that name. :-)
thx ;-)

> But I have a remark from an API point of view: I'd probably
> separate the search specification from the index building.  If I
> read your code properly then you create the search criterion from
> the input text.  If you use that approach the index will be
> recreated for every other search.  It's probably cleaner and also
> more efficient if query creation, index creation and search are
> separated.
well, they are (index creation and search, at least). that's why i
considered writing this little lib in the first place. i didn't like
the fact that the other examples had to do all the hard work over
and over again. so i just build an index (once) that you can operate
on as many times as you want. it's not clear from the example though
because i showed the "nice" version. the more ugly one (but surely
more efficient one, you're completely right there) goes as follows:

  require 'pms'

  # build the index - once
  pms = PMS.new(text)

  # perform searches - many
  pms.search('hello').near('world', 3)
  pms.search('you').or('guy')
  pms.search(/wo.*d/).not { |q| q.search('you').or('guy') }

i'm not sure in which way to drive this little effort further, if at
all, but it was definitely fun to write...

cheers
jens
This topic is locked and can not be replied to.