Does Ruby have the ability to perform proximity searches on data. For
example find the word “hello” and the word “world” within 10 words of
eachother and print out some data?
Thanks a lot
Does Ruby have the ability to perform proximity searches on data. For
example find the word “hello” and the word “world” within 10 words of
eachother and print out some data?
Thanks a lot
No proximity searches with 1.8… you would need a full fledged text
search engine such as lucerne, sphinx for that type of capability
ilan
Stuart C. wrote:
Does Ruby have the ability to perform proximity searches on data. For
example find the word “hello” and the word “world” within 10 words of
eachother and print out some data?Thanks a lot
On 06.12.2008 00:29, Ilan B. wrote:
No proximity searches with 1.8… you would need a full fledged text
search engine such as lucerne, sphinx for that type of capability
Or, depending on requirements (especially performance) cook your own.
def proximity_search(text, distance, w1, w2)
w1 = w1.downcase
w2 = w2.downcase
index1 = []
index2 = []
text.scan(/\w+/).each_with_index do |w,idx|
case
when w1 == w.downcase
index1 << idx
when w2 == w.downcase
index2 << idx
end
end
found = false
index1.each do |i1|
index2.each do |i2|
if (i1 - i2).abs <= distance
found = true
yield i1, i2 if block_given?
end
end
end
found
end
text = <<EOT
Hello world, how are you today? I said “Hello”
to the other guy but he would not answer although
all the world could hear me.
EOT
proximity_search text, 3, “hello”, “World” do |a,b|
printf “Found at positions %3d and %3d\n”, a, b
end
p proximity_search(text, 3, “hello”, “World”)
Cheers
robert
On Sat, Dec 6, 2008 at 1:18 PM, Robert K.
[email protected] wrote:
Hmm
maybe you can scale this down
File::read(“test.data”).scan %r{Hello.{0,10}World}im
Should work fine for reasonable file sizes ( < Available Memory )
HTH
Robert
–
Il computer non è una macchina intelligente che aiuta le persone
stupide, anzi, è una macchina stupida che funziona solo nelle mani
delle persone intelligenti.
Computers are not smart to help stupid people, rather they are stupid
and will work only if taken care of by smart people.
Umberto Eco
File::read(“test.data”).scan %r{Hello.{0,10}World}im
That would be 10 characters
^ manveru
On Sat, Dec 6, 2008 at 2:06 PM, Michael F.
[email protected] wrote:
File::read(“test.data”).scan %r{Hello.{0,10}World}im
That would be 10 characters
Huh??
cat test.data && echo “=============================” && cat scan.rb
&& ./scan.rb
Hello World hello **
#!/usr/local/bin/ruby
p File::read(“test.data”).scan( %r<Hello.{0,10}World>im )
[“Hello World”, “hello *\n world”]
R.
Robert D. wrote:
File::read(“test.data”).scan %r{Hello.{0,10}World}im
Another nice idea. And another mainly untested improvement:
module StringProximitySearch
def proximity_search(w1, w2, distance=0)
self =~
%r{#{Regexp.escape(w1)}\W+(?:\w+\W+){0,#{distance}}#{Regexp.escape(w2)}}im
end
end
Regards
Stefan
Hi there
Thanks to Robert we have something to build up on. If you only need to
know whether w1 and w2 occur in the given distance within the text or
not (true/false) then we can improve his algorithm and exclude the
O(n^2) part:
Robert K. wrote:
def proximity_search(text, distance, w1, w2)
w1 = w1.downcase
w2 = w2.downcase
index1 = []
index2 = []text.scan(/\w+/).each_with_index do |w,idx|
case
when w1 == w.downcase
index1 << idx
when w2 == w.downcase
index2 << idx
end
endfound = false
index1.each do |i1|
index2.each do |i2|
if (i1 - i2).abs <= distance
found = true
yield i1, i2 if block_given?
end
end
endfound
end
module StringProximitySearch
def proximity_search(w1, w2, distance=1)
w1 = w1.downcase
w2 = w2.downcase
idx1 = -(distance2)
idx2 = -(distance2)
scan(/\w+/).each_with_index do |w,idx|
case w.downcase
when w1
idx1 = idx
return true if idx1-idx2 <= distance
when w2 then index2 << idx
idx2 = idx
return true if idx2-idx1 <= distance
end
end
false
end
end
class String
include StringProximitySearch
end
your_test_string.proximity_search(w1, w2, distance)
NB: This code is untested.
Regards
Stefan
On Sun, Dec 7, 2008 at 12:08 PM, Robert K.
[email protected] wrote:
I guess Michael’s point was that your version is not a scaled down version
of mine because it defines “proximity” differently: your version uses
character count while my version uses word count.
Use words, I see, if one uses words I can understand that
Actually I was just wondering if we should not keep track of the
positions we found, what good would a search do if we cannot go there
:(.
Another difference is that in your case order matters. Here’s an attempt to
Does it not? Hmm right in a search engine it indeed does not, agreed…
}mix
r = []
s.scan( your stuff ) do |m| r << [$`.size, m] end
r
R
On 06.12.2008 14:30, Robert D. wrote:
On Sat, Dec 6, 2008 at 2:06 PM, Michael F. [email protected] wrote:
File::read(“test.data”).scan %r{Hello.{0,10}World}im
That would be 10 characters
Huh??
I guess Michael’s point was that your version is not a scaled down
version of mine because it defines “proximity” differently: your version
uses character count while my version uses word count.
Another difference is that in your case order matters. Here’s an
attempt to fix both:
s.scan %r{
\b
(?:
hello\W+(?:\w+\W+){0,10}world
| world\W+(?:\w+\W+){0,10}hello
)
\b
}mix
Cheers
robert
hi!
Robert K. [2008-12-07 13:23]:
I played a bit around and came up with a more involved version
which works with arbitrary numbers of words but still fits into a
few lines of code.
here’s my take on the task: Poor Man’s Search
http://github.com/blackwinter/pms
it’s a different approach with support for boolean operators as well
as - back to topic! - proximity operators. pms builds an index for
the input documents and stores the token positions along with the
document numbers. still pretty rough, but working…
text = <<EOT
Hello world, how are you today? I said “Hello”
to the other guy but he would not answer although
all the world could hear me.
EOTproximity_search text, 3, “hello”, “World” do |a,b|
printf “Found at char positions %3d and %3d\n”, a, b
end
require ‘pms/ext’
search = text.search(‘hello’).near(‘world’, 3)
p search.results
#=> [0]
p search.results_with_positions
#=> {0=>[0, 8]}
p search.matches
#=> [“Hello world, how are you today? I said "Hello"\n”]
cheers
jens
I played a bit around and came up with a more involved version which
works with arbitrary numbers of words but still fits into a few lines of
code. Another advantage is that it identifies character positions with
matches in the text. Note that this allows a maximum distance between
the first and the last word of (distance * (words - 1)). It could be
modified pretty easily (with an additional method in Array) to keep
all words within distance.
Have fun
robert
#!/bin/env ruby
ProximitySearchData = Struct.new :word, :wpos, :spos
def proximity_search(text, distance, *words)
sdata = words.map {|w| ProximitySearchData.new w.downcase}
wpos = 0
text.scan %r{\w+}i do |match|
match = match.downcase
pos = $`.length
sdata.each do |sd|
if sd.word == match
sd.spos = pos
sd.wpos = wpos
break :change
end
end == :change and
sdata.all? {|sd| sd.wpos} and
sdata.
sort_by {|sd| sd.wpos}.
each_cons(2).
all? {|sd1,sd2| sd2.wpos - sd1.wpos <= distance} and
yield *sdata.map {|sd| sd.spos}
wpos += 1
end
end
text = <<EOT
Hello world, how are you today? I said “Hello”
to the other guy but he would not answer although
all the world could hear me.
EOT
proximity_search text, 3, “hello”, “World” do |a,b|
printf “Found at char positions %3d and %3d\n”, a, b
end
hi robert!
Robert K. [2008-12-08 15:50]:
2008/12/8 Jens W. [email protected]:
here’s my take on the task: Poor Man’s Search
I like that name.
thx
But I have a remark from an API point of view: I’d probably
separate the search specification from the index building. If I
read your code properly then you create the search criterion from
the input text. If you use that approach the index will be
recreated for every other search. It’s probably cleaner and also
more efficient if query creation, index creation and search are
separated.
well, they are (index creation and search, at least). that’s why i
considered writing this little lib in the first place. i didn’t like
the fact that the other examples had to do all the hard work over
and over again. so i just build an index (once) that you can operate
on as many times as you want. it’s not clear from the example though
because i showed the “nice” version. the more ugly one (but surely
more efficient one, you’re completely right there) goes as follows:
require ‘pms’
pms = PMS.new(text)
pms.search(‘hello’).near(‘world’, 3)
pms.search(‘you’).or(‘guy’)
pms.search(/wo.*d/).not { |q| q.search(‘you’).or(‘guy’) }
i’m not sure in which way to drive this little effort further, if at
all, but it was definitely fun to write…
cheers
jens
2008/12/8 Jens W. [email protected]:
here’s my take on the task: Poor Man’s Search
I like that name.
all the world could hear me.
EOTproximity_search text, 3, “hello”, “World” do |a,b|
printf “Found at char positions %3d and %3d\n”, a, b
endrequire ‘pms/ext’
search = text.search(‘hello’).near(‘world’, 3)
Nice! I like the idea to create the search criteria by calling methods.
But I have a remark from an API point of view: I’d probably separate
the search specification from the index building. If I read your code
properly then you create the search criterion from the input text. If
you use that approach the index will be recreated for every other
search. It’s probably cleaner and also more efficient if query
creation, index creation and search are separated.
Kind regards
robert
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.
Sponsor our Newsletter | Privacy Policy | Terms of Service | Remote Ruby Jobs