Proximity searches in Ruby

sclarke · December 5, 2008, 2:28pm

Does Ruby have the ability to perform proximity searches on data. For
example find the word “hello” and the word “world” within 10 words of
eachother and print out some data?

Thanks a lot

sclarke · December 6, 2008, 12:36am

No proximity searches with 1.8… you would need a full fledged text
search engine such as lucerne, sphinx for that type of capability

ilan

Stuart C. wrote:

Does Ruby have the ability to perform proximity searches on data. For
example find the word “hello” and the word “world” within 10 words of
eachother and print out some data?

Thanks a lot

sclarke · December 6, 2008, 1:25pm

On 06.12.2008 00:29, Ilan B. wrote:

No proximity searches with 1.8… you would need a full fledged text
search engine such as lucerne, sphinx for that type of capability

Or, depending on requirements (especially performance) cook your own.

def proximity_search(text, distance, w1, w2)
w1 = w1.downcase
w2 = w2.downcase
index1 = []
index2 = []

text.scan(/\w+/).each_with_index do |w,idx|
case
when w1 == w.downcase
index1 << idx
when w2 == w.downcase
index2 << idx
end
end

found = false

index1.each do |i1|
index2.each do |i2|
if (i1 - i2).abs <= distance
found = true
yield i1, i2 if block_given?
end
end
end

found
end

text = <<EOT
Hello world, how are you today? I said “Hello”
to the other guy but he would not answer although
all the world could hear me.
EOT

proximity_search text, 3, “hello”, “World” do |a,b|
printf “Found at positions %3d and %3d\n”, a, b
end

p proximity_search(text, 3, “hello”, “World”)

Cheers

robert

sclarke · December 6, 2008, 1:44pm

On Sat, Dec 6, 2008 at 1:18 PM, Robert K.
[email protected] wrote:
Hmm
maybe you can scale this down

File::read(“test.data”).scan %r{Hello.{0,10}World}im

Should work fine for reasonable file sizes ( < Available Memory )

HTH
Robert

–
Il computer non è una macchina intelligente che aiuta le persone
stupide, anzi, è una macchina stupida che funziona solo nelle mani
delle persone intelligenti.
Computers are not smart to help stupid people, rather they are stupid
and will work only if taken care of by smart people.

Umberto Eco

sclarke · December 6, 2008, 2:13pm

File::read(“test.data”).scan %r{Hello.{0,10}World}im

That would be 10 characters

^ manveru

sclarke · December 6, 2008, 2:37pm

On Sat, Dec 6, 2008 at 2:06 PM, Michael F.
[email protected] wrote:

File::read(“test.data”).scan %r{Hello.{0,10}World}im

That would be 10 characters
Huh??
cat test.data && echo “=============================” && cat scan.rb
&& ./scan.rb
Hello World hello **

world

hello do not find this world

#!/usr/local/bin/ruby

vim: sw=2 ts=2 ft=ruby expandtab tw=0 nu syn=on:

file: scan.rb

p File::read(“test.data”).scan( %r<Hello.{0,10}World>im )

[“Hello World”, “hello *\n world”]

R.

sclarke · December 6, 2008, 3:01pm

Robert D. wrote:

File::read(“test.data”).scan %r{Hello.{0,10}World}im

Another nice idea. And another mainly untested improvement:
module StringProximitySearch
def proximity_search(w1, w2, distance=0)
self =~
%r{#{Regexp.escape(w1)}\W+(?:\w+\W+){0,#{distance}}#{Regexp.escape(w2)}}im
end
end

Regards
Stefan

sclarke · December 6, 2008, 2:50pm

Hi there

Thanks to Robert we have something to build up on. If you only need to
know whether w1 and w2 occur in the given distance within the text or
not (true/false) then we can improve his algorithm and exclude the
O(n^2) part:

Robert K. wrote:

def proximity_search(text, distance, w1, w2)
w1 = w1.downcase
w2 = w2.downcase
index1 = []
index2 = []

text.scan(/\w+/).each_with_index do |w,idx|
case
when w1 == w.downcase
index1 << idx
when w2 == w.downcase
index2 << idx
end
end

found = false

index1.each do |i1|
index2.each do |i2|
if (i1 - i2).abs <= distance
found = true
yield i1, i2 if block_given?
end
end
end

found
end

module StringProximitySearch
def proximity_search(w1, w2, distance=1)
w1 = w1.downcase
w2 = w2.downcase
idx1 = -(distance2)
idx2 = -(distance2)

scan(/\w+/).each_with_index do |w,idx|
  case w.downcase
    when w1
      idx1 = idx
      return true if idx1-idx2 <= distance
    when w2 then index2 << idx
      idx2 = idx
      return true if idx2-idx1 <= distance
  end
end

false

end
end

class String
include StringProximitySearch
end

your_test_string.proximity_search(w1, w2, distance)

NB: This code is untested.

Regards
Stefan

sclarke · December 7, 2008, 12:58pm

On Sun, Dec 7, 2008 at 12:08 PM, Robert K.
[email protected] wrote:

I guess Michael’s point was that your version is not a scaled down version
of mine because it defines “proximity” differently: your version uses
character count while my version uses word count.
Use words, I see, if one uses words I can understand that

Actually I was just wondering if we should not keep track of the
positions we found, what good would a search do if we cannot go there
:(.

Another difference is that in your case order matters. Here’s an attempt to
Does it not? Hmm right in a search engine it indeed does not, agreed…
}mix
r = []
s.scan( your stuff ) do |m| r << [$`.size, m] end
r

R

sclarke · December 7, 2008, 12:15pm

On 06.12.2008 14:30, Robert D. wrote:

On Sat, Dec 6, 2008 at 2:06 PM, Michael F. [email protected] wrote:

File::read(“test.data”).scan %r{Hello.{0,10}World}im
That would be 10 characters
Huh??

I guess Michael’s point was that your version is not a scaled down
version of mine because it defines “proximity” differently: your version
uses character count while my version uses word count.

Another difference is that in your case order matters. Here’s an
attempt to fix both:

untested

s.scan %r{
\b
(?:
hello\W+(?:\w+\W+){0,10}world
| world\W+(?:\w+\W+){0,10}hello
)
\b
}mix

Cheers

robert

sclarke · December 8, 2008, 3:29pm

hi!

Robert K. [2008-12-07 13:23]:

I played a bit around and came up with a more involved version
which works with arbitrary numbers of words but still fits into a
few lines of code.
here’s my take on the task: Poor Man’s Search

http://github.com/blackwinter/pms

it’s a different approach with support for boolean operators as well
as - back to topic! - proximity operators. pms builds an index for
the input documents and stores the token positions along with the
document numbers. still pretty rough, but working…

text = <<EOT
Hello world, how are you today? I said “Hello”
to the other guy but he would not answer although
all the world could hear me.
EOT

proximity_search text, 3, “hello”, “World” do |a,b|
printf “Found at char positions %3d and %3d\n”, a, b
end

require ‘pms/ext’

search = text.search(‘hello’).near(‘world’, 3)

p search.results
#=> [0]

p search.results_with_positions
#=> {0=>[0, 8]}

p search.matches
#=> [“Hello world, how are you today? I said "Hello"\n”]

cheers
jens

sclarke · December 7, 2008, 1:30pm

I played a bit around and came up with a more involved version which
works with arbitrary numbers of words but still fits into a few lines of
code. Another advantage is that it identifies character positions with
matches in the text. Note that this allows a maximum distance between
the first and the last word of (distance * (words - 1)). It could be
modified pretty easily (with an additional method in Array) to keep
all words within distance.

Have fun

robert

#!/bin/env ruby

search words in arbitrary order where

pairs of words have a max distance between

them

ProximitySearchData = Struct.new :word, :wpos, :spos

def proximity_search(text, distance, *words)
sdata = words.map {|w| ProximitySearchData.new w.downcase}

wpos = 0

text.scan %r{\w+}i do |match|
match = match.downcase
pos = $`.length

 sdata.each do |sd|
   if sd.word == match

sd.spos = pos
sd.wpos = wpos
break :change
end
end == :change and
sdata.all? {|sd| sd.wpos} and
sdata.
sort_by {|sd| sd.wpos}.
each_cons(2).
all? {|sd1,sd2| sd2.wpos - sd1.wpos <= distance} and
yield *sdata.map {|sd| sd.spos}

 wpos += 1

end
end

text = <<EOT
Hello world, how are you today? I said “Hello”
to the other guy but he would not answer although
all the world could hear me.
EOT

proximity_search text, 3, “hello”, “World” do |a,b|
printf “Found at char positions %3d and %3d\n”, a, b
end

sclarke · December 8, 2008, 4:12pm

hi robert!

Robert K. [2008-12-08 15:50]:

2008/12/8 Jens W. [email protected]:

here’s my take on the task: Poor Man’s Search

I like that name.
thx

But I have a remark from an API point of view: I’d probably
separate the search specification from the index building. If I
read your code properly then you create the search criterion from
the input text. If you use that approach the index will be
recreated for every other search. It’s probably cleaner and also
more efficient if query creation, index creation and search are
separated.
well, they are (index creation and search, at least). that’s why i
considered writing this little lib in the first place. i didn’t like
the fact that the other examples had to do all the hard work over
and over again. so i just build an index (once) that you can operate
on as many times as you want. it’s not clear from the example though
because i showed the “nice” version. the more ugly one (but surely
more efficient one, you’re completely right there) goes as follows:

require ‘pms’

build the index - once

pms = PMS.new(text)

perform searches - many

pms.search(‘hello’).near(‘world’, 3)
pms.search(‘you’).or(‘guy’)
pms.search(/wo.*d/).not { |q| q.search(‘you’).or(‘guy’) }

i’m not sure in which way to drive this little effort further, if at
all, but it was definitely fun to write…

cheers
jens

sclarke · December 8, 2008, 3:57pm

2008/12/8 Jens W. [email protected]:

here’s my take on the task: Poor Man’s Search

I like that name.

all the world could hear me.
EOT

proximity_search text, 3, “hello”, “World” do |a,b|
printf “Found at char positions %3d and %3d\n”, a, b
end

require ‘pms/ext’

search = text.search(‘hello’).near(‘world’, 3)

Nice! I like the idea to create the search criteria by calling methods.

But I have a remark from an API point of view: I’d probably separate
the search specification from the index building. If I read your code
properly then you create the search criterion from the input text. If
you use that approach the index will be recreated for every other
search. It’s probably cleaner and also more efficient if query
creation, index creation and search are separated.

Kind regards

robert