Tough Ruby Homework

I’m trying to take a long piece of text, find a word, and get that word
and the 3 words on either side of it and put that new “string” into
another variable.

Example:

I have a sentence like “Robert likes green beans, girls with moustaches,
and teddy bears. John thinks Robert is strange”. I am searching for
the word “Robert”, so I want to return the following:

[“Robert likes green”, “bears. John thinks Robert is strange.”]
(doesn’t have to be in an array, but you get the idea)

I obviously use index to get the places where “Robert” can be found, but
any suggestion on how to do the rest?

Bonus points: if you can do the same thing for multiple words…back to
the example, but search for “green AND teddy”…you’d get:

[“Robert likes green beans, girls”, “with moustaches, and teddy bears.
John thinks”] as a result.

I’m posting this because I couldn’t seem to find an easy way to do it…

…if it’s homework, why are you simply asking us?

it would be better to write it and ask “how can this be improved?”
than not and ask “how can this be done?”

but that’s just my opinion…
hex

On Wed, Sep 14, 2011 at 9:05 AM, Rory P. [email protected]
wrote:

[“Robert likes green”, “bears. John thinks Robert is strange.”]

I’m posting this because I couldn’t seem to find an easy way to do it…


Posted via http://www.ruby-forum.com/.

Check out String#split (
http://rdoc.info/stdlib/core/1.9.2/String#split-instance_method) that
should
help you get it into an array, which should be a lot easier to work
with.

.serialhex … wrote in post #1021921:

…if it’s homework, why are you simply asking us?

it would be better to write it and ask “how can this be improved?”
than not and ask “how can this be done?”

but that’s just my opinion…
hex

Not a school homework. Its a work homework. Thanks

On Sep 14, 2011, at 9:27 AM, Rory P. wrote:

.serialhex … wrote in post #1021921:

…if it’s homework, why are you simply asking us?

it would be better to write it and ask “how can this be improved?”
than not and ask “how can this be done?”

but that’s just my opinion…
hex

Not a school homework. Its a work homework. Thanks

Still. Better to try your hand at something than to just copy what
someone else says.

–Mark

Check out String#split (
http://rdoc.info/stdlib/core/1.9.2/String#split-instance_method) that
should
help you get it into an array, which should be a lot easier to work
with.

Thanks Josh

Mark H. Nichols wrote in post #1021929:

On Sep 14, 2011, at 9:27 AM, Rory P. wrote:

.serialhex … wrote in post #1021921:

…if it’s homework, why are you simply asking us?

it would be better to write it and ask “how can this be improved?”
than not and ask “how can this be done?”

but that’s just my opinion…
hex

Not a school homework. Its a work homework. Thanks

Still. Better to try your hand at something than to just copy what
someone else says.

–Mark

Dude, if you’re not going to help, why respond to this post? Simply
ignore and move on rather than be an @hole

I’m trying to take a long piece of text, find a word, and get that word
and the 3 words on either side of it and put that new “string” into
another variable.

I don’t know what work homework means.
But, I learn something from these things and maybe someone else will,
too.
So, here is a step towards the first part.
If it is wrong, you can fix it.

str = “If Robert likes green beans, girls with mustaches, and teddy
bears, John thinks Robert is strange.”

f,g = str.split(/\s+/), “Robert”
p (0…f.size).select{|x| f[x]==g}.map{|y| (y-[3,y].min…y+3)}.map{|z|
f[z]}

#> [[“If”, “Robert”, “likes”, “green”, “beans,”], [“bears,”, “John”,
“thinks”, “Robert”, “is”, “strange.”]]

Harry

On Wed, Sep 14, 2011 at 11:27:08PM +0900, Rory P. wrote:

Not a school homework. Its a work homework. Thanks

I think that makes no difference. What have you done first to attempt
to solve this?

On Wed, Sep 14, 2011 at 10:27 AM, Rory P. [email protected]
wrote:

.serialhex … wrote in post #1021921:

…if it’s homework, why are you simply asking us?

Not a school homework. Its a work homework. Thanks

ahh i see, my bad, sorry about that :slight_smile:
hex

Darryl Pierce wrote in post #1021946:

On Wed, Sep 14, 2011 at 11:27:08PM +0900, Rory P. wrote:

Not a school homework. Its a work homework. Thanks

I think that makes no difference. What have you done first to attempt
to solve this?

I know… I know… My post sounded like I didn’t even try because I went
straight to the question without even explaining what I did. Well
here’s what I did.

#!/usr/bin/env ruby
string=“The quick brown fox jumped over the lazy dog”
def get_subsection(word, sentence)
sentence.scan(Regexp.new(word))
end

puts get_subsection(“quick”, string)
puts get_subsection(“lazy”, string)
puts get_subsection(“fox”, string)
puts get_subsection(“dog”, string)

I just need the REGEX portion to select the three words on both sides
and right now I’m struggling with it. I tried looking at “lookaround”
but it got me even more confused.

It might help to note that an array is an enumerable and that enumerable
gives you each_slice. So if you really mean a word and the 3 words on
either side of it, that’s 7 words - and so if you take slices of 7
elements, you can examine each one to see if its middle item is your
word. m.

Thanks to all who tried to help. Here’s the final answer.

#!/usr/bin/env ruby
string=“The quick brown fox jumped over the lazy dog”
def get_subsection(word, sentence)
sentence.scan(Regexp.new(/(?:\W{0,1}\w+\W){0,3}over(?:\W{1}\w+){0,3}/))
end

puts get_subsection(“quick”, string)
puts get_subsection(“lazy”, string)
puts get_subsection(“fox”, string)
puts get_subsection(“dog”, string)

The regex in the middle of the syntax is where I struggled, but with a
little bit of help from the guru I was able to solve the problem. I’m a
Ruby newbie :wink:

Thanks Harry, I’ll try that.

On 15/09/2011 19:56, Pascua 9804 wrote:

puts get_subsection(“lazy”, string)
puts get_subsection(“fox”, string)
puts get_subsection(“dog”, string)

The regex in the middle of the syntax is where I struggled, but with a
little bit of help from the guru I was able to solve the problem.

I think that is a maintenance nightmare!

As Jamie Zawinski said - Some people, when confronted with a problem,
think “I know, I’ll use regular expressions.” Now they have two
problems.

For large source texts it will be horribly slow, and memory hungry, and
for large search lists it will slow down even more. Huge, slow and hard
to maintain = not good.

What the OP wanted was a sequence of 7 words, where the 4th is the word
sought, and the string can be missing words “before” or “after” the
source string.

So you need two parallel lists of strings. The first is a list of
tokens from the source, where each token is separated from the next by
white-space. The second are words, created from the tokens by removing
punctuation.

Slide through the source, token at a time, and if the forth word of the
word list is one of the ones you want,
use the token list to reconstruct the fragment of the source, (without
newlines) and emit the result.

In order to handle the start-up and close-down properly, I would
consider preloading the token list with null strings, and arrange the
“get next token” function to return three null strings after end of
file, before signalling the end.
However there are other methods.

This is one pass, so you don’t need the source all in memory. It will be
order source size in time, and order the number of words sought in
space. Fast, compact and easy to alter the rule or length of the lists.

Regards

Ian

On Thu, Sep 15, 2011 at 10:32 PM, Ian H. [email protected]
wrote:

As Jamie Zawinski said - Some people, when confronted with a problem, think
“I know, I’ll use regular expressions.” Now they have two problems.

Nah.

For large source texts it will be horribly slow, and memory hungry, and for
large search lists it will slow down even more. Huge, slow and hard to
maintain = not good.

That entirely depends on the problem to solve and the approach with
regexp chosen.

What the OP wanted was a sequence of 7 words, where the 4th is the word
sought, and the string can be missing words “before” or “after” the source
string.

So you need two parallel lists of strings. The first is a list of tokens
from the source, where each token is separated from the next by white-space.
The second are words, created from the tokens by removing punctuation.

I’d work with a single list of words and non words interchanged. That
should make generation of the combined matching sequence easier.

This is one pass, so you don’t need the source all in memory. It will be
order source size in time, and order the number of words sought in space.
Fast, compact and easy to alter the rule or length of the lists.

I find this simpler:

def word_scan(s, *words)
return to_enum(:word_scan, s, *words) unless block_given?
return if words.empty?

s.scan /\b#{Regexp.union words}\b/ do |wd|
pre = $`
post = $’
yield pre[/(?:\w+\W+){0,3}\z/] + wd + post[/\A(?:\W+\w+){0,3}/]
end
end

s = “Robert likes green beans, girls with moustaches, and teddy bears.
John thinks Robert is strange”

puts 1
word_scan(s, “Robert”) {|m| p m}

puts 2
word_scan(s, “green”, “teddy”) {|m| p m}
p word_scan(s, “green”, “teddy”).to_a

You can use it with and without block following the idiom to get an
Enumerable if there is no block. However, for really large inputs
your approach is likely better.

Kind regards

robert

Bonus points: if you can do the same thing for multiple words…back to
the example, but search for “green AND teddy”…you’d get:

[“Robert likes green beans, girls”, “with moustaches, and teddy bears.
John thinks”] as a result.

Is this what you want with multiple words?

astring = “Robert likes green beans, girls with mustaches, and teddy
bears. John thinks Robert is strange.”

def my_get(str, num, substrings)
f,g,n = str.delete(",.").split(/\s+/), substrings, num
s = f.size
(0…s).select{|x| g.include?(f[x])}.map{|y|
([0,y-n].max…[y+n,s-1].min)}.map{|z| f[z]}
end

p my_get(astring,3,[“Robert”,“and”,“teddy”,“bears”,“strange”])

Output

#> [[“Robert”, “likes”, “green”, “beans”], [“girls”, “with”,
“mustaches”, “and”, “teddy”, “bears”, “John”], [“with”, “mustaches”,
“and”, “teddy”, “bears”, “John”, “thinks”], [“mustaches”, “and”,
“teddy”, “bears”, “John”, “thinks”, “Robert”], [“bears”, “John”,
“thinks”, “Robert”, “is”, “strange”], [“thinks”, “Robert”, “is”,
“strange”]]

Harry

Actually, I guess this makes a little more sense and is a little faster.

def my_get(str, num, substrings)
f,g,n = str.delete(",.").split(/\s+/), substrings, num
s = f.size
(0…s).select{|x| g.include?(f[x])}.map{|y|
f[([0,y-n].max…[y+n,s-1].min)]}
end

astring = “Robert likes green beans, girls with mustaches, and teddy
bears. John thinks Robert is strange.”

p my_get(astring,3,[“Robert”,“and”,“teddy”,“bears”,“strange”])

Harry