Confused regarding text example

Hey,

So as I am reviewing Ruby I came across the old find strings with xyz
pattern.

The code is as follows:

class WordIndex
def initialize
@index = {}
end
def add_to_index(obj, *phrases)
phrases.each do |phrase|
phrase.scan(/\w[\w’]+/) do |word| # extract each word
word.downcase!
@index[word] = [] if @index[word].nil?
@index[word].push(obj)
end
end
end
def lookup(word)
@index[word.downcase]
end
end

So here is my understanding:

phrases contains the phases/patterns to look for. |phrase| is replaced
by each of the phrases to check for.

After that, I become a bit ocnfused. I know they are looking for the
phrase in each of the words (.\w\w’]+/) but how does that work?

How is each string broken down into “words”? w

hy is that done anyway (why not just find the pattern and move on?).

Also, what is obj exactly (other than an object)? How is it being
formed so it can be pushed into the stack?

Finally, what does index[word] represent? I am guessing a hash…

class WordIndex
def initialize
@index = {}
end
def add_to_index(obj, *phrases)
phrases.each do |phrase|
phrase.scan(/\w[\w’]+/) do |word| # extract each word
word.downcase!
@index[word] = [] if @index[word].nil?
@index[word].push(obj)
end
end
end
def lookup(word)
@index[word.downcase]
end
end

phrases contains the phases/patterns to look for. |phrase| is replaced
by each of the phrases to check for.

the *phrases in the method signature says that it takes a variable
number of
arguments, you use those as an array of values.

phrases.each do |phrase|

code

end
this piece of code defines a block aka closer that takes one argument.
the
each method on array will call this block for every item in the array as
the
argument to this block

After that, I become a bit ocnfused. I know they are looking for the
phrase in each of the words (.\w\w’]+/) but how does that work?

/\w[\w’]+/ is a shorthand for writing Regexp.new(“/\w[\w’]+/”)
the scan method on a string will return an array of all the matches of
that
regexp:
http://ruby-doc.org/core/classes/String.html#M000827

How is each string broken down into “words”? w

hy is that done anyway (why not just find the pattern and move on?).

The code appears to be adding obj to a hash based on the downcased
version
of each word in each phrase.

Also, what is obj exactly (other than an object)? How is it being
formed so it can be pushed into the stack?

obj is just passed in, it could be anything, it could be the filename
that
these words were pulled from… or whatever you would want to retrieve
based
on the words in the phrases associated with it.

Finally, what does index[word] represent? I am guessing a hash…
See above, you are right!
@index = {}
is shorthand for @index = Hash.new
the same way @foo = [] is shorthand for @foo = Array.new

Hope this helps. If I have made anything unclear please feel free to
email
me.

Alright, I think I got it.

So basically, WordIndex’s obj could be a file, an array of strings, etc.

Phrases is basically a series of phrases (my mind just went blank on the
correct terminology).

Scan breaks down the series into an array containing the phrases.

The do |word| end look is activated with each item in the array. The
string is modified and is turned into all lower case,. Next, a check is
made tosee if index[word] is empty. If so, then it’s turned into an
empty Hash.

Finally, the object is pushed into the Hash that had the wanted string.
Of course, if, say a file, had the same string three times…it would be
listed three times.

And the thing being “returned” is the @index[word] so to speak, correct?
That is, the way you can access the Hash.

On Fri, Sep 19, 2008 at 9:25 PM, Dave L. [email protected]
wrote:

end
@index[word.downcase]
end
end

So here is my understanding:

phrases contains the phases/patterns to look for. |phrase| is replaced
by each of the phrases to check for.

What it seems to me is that the code is a way to classify objects based
on each
word of a set of phrases, which is somehow related to that object. For
example,
it could be an html page and each phrase is a part of the text present
on the page,
a file and its contents, etc.

In the method *phrases means that phrases will be an array containing
all parameters
passed to the method. For example:

irb(main):001:0> def a(obj, *phrases)
irb(main):002:1> puts phrases.class
irb(main):003:1> p phrases
irb(main):004:1> end
=> nil
irb(main):005:0> a(“an obj”, “the first phrase”, “the second one”, “3rd
one”)
Array
[“the first phrase”, “the second one”, “3rd one”]
=> nil

Then, the each method will call the block (do … end) passing each of
the phrases as the block parameter (phrase).

After that, I become a bit ocnfused. I know they are looking for the
phrase in each of the words (.\w\w’]+/) but how does that work?

Scanning will iterate through the phrase searching for the patters,
and then pass the block
each result. The result is the section of the string that matches. And
what sections
match this pattern? A word letter, followed by any word letter or a '.
For example:

irb(main):019:0> re = /\w[\w’]+/
=> /\w[\w’]+/
irb(main):020:0> “this is a normal sentence”.scan(re) {|x| p x}
“this”
“is”
“normal”
“sentence”

As you can see, the “a” has been skipped, since the regexp is asking
for at least two consecutive word letters, or a word letter and a '.

How is each string broken down into “words”? w

Scanning the sentence with that regexp does the splitting.

hy is that done anyway (why not just find the pattern and move on?).

Because what the code is doing is retreiving every two letter or more
word
to index the object through that word. For each word then, it is
creating an
array where it stores the object. This means that at the end you have a
hash with each word in every sentence as a key, and whose value is an
array containing all objs related to that word.

Also, what is obj exactly (other than an object)? How is it being
formed so it can be pushed into the stack?

Well, let’s imagine an example of how this method could be used:
I want to read all txt files in a folder and index the name of the file
based on each word in the file.

index = WordIndex.new
Dir[“*.txt”].each do |file|
lines = File.open(file) {|f| f.readlines.map {|l| l.chomp}}
index.add_to_index(file, *lines)
end

Now we can locate all files that contain a word:

%w{cat dog}.each {|word| puts "Files that contain ‘#{word}’:
#{index.lookup (word)}

For example I have three files:

one.txt:
I have a dog
I have a cat

two.txt:
I have a dog, and nothing else.
I do have a car too.

three.txt:
I love my cat
I don’t love anything else

$ ruby wordindex.rb

Files that contain ‘cat’: one.txt,three.txt
Files that contain ‘dog’: one.txt,two.txt

So, you have indexed the contents of the files per word.

Finally, what does index[word] represent? I am guessing a hash…

@index is a hash. @index[word] is an array.

And btw, you can initialize a hash like this and remove a line:

@index = Hash.new {|h,k| h[k] = []}

and remove this line:

@index[word] = [] if @index[word].nil?

Hope this helps,

Jesus.