Counting words

jamal86 · April 28, 2006, 7:47pm

I’ve research this but am still having trouble getting it right …
Can someone give me code that counts the number of words in a string via
RegExp and MatchData objects? I think I’d like a word to be defined as
contiguous characters surrounded by white space (or the start/end of the
string), though am open to other interpretations.

Jamal

jamal86 · April 28, 2006, 7:50pm

On Sat, Apr 29, 2006 at 02:43:30AM +0900, Jamal M. wrote:

I’ve research this but am still having trouble getting it right …
Can someone give me code that counts the number of words in a string via
RegExp and MatchData objects? I think I’d like a word to be defined as
contiguous characters surrounded by white space (or the start/end of the
string), though am open to other interpretations.

Here is a naive implementation:

class String
def words
scan(/\b\S+\b/)
end
end

‘this is a sentence with some words’.words
=> [“this”, “is”, “a”, “sentence”, “with”, “some”, “words”]
‘this is a sentence with some words’.words.size
=> 7

marcel

jamal86 · April 28, 2006, 7:50pm

On 4/28/06, Jamal M. [email protected] wrote:

I’ve research this but am still having trouble getting it right …
Can someone give me code that counts the number of words in a string via
RegExp and MatchData objects? I think I’d like a word to be defined as
contiguous characters surrounded by white space (or the start/end of the
string), though am open to other interpretations.

Jamal

I’m a bit of a nuby, and this is my first post to the list, but I
think the following one-liner will do the job:

number_of_words = string.split(/\s/).length

I haven’t tested it because I’m at work without access to a Ruby
interpreter :(.

jamal86 · April 28, 2006, 7:53pm

On 4/28/06, Bira [email protected] wrote:

number_of_words = string.split(/\s/).length

Eh, sorry. I meant to write:

number_of_words = string.split(/\s+/).length

The “+” is needed to cover words with more than one whitespace
character between them.

jamal86 · April 28, 2006, 8:46pm

2006/4/28, Jamal M. [email protected]:

I’ve research this but am still having trouble getting it right …
Can someone give me code that counts the number of words in a string via
RegExp and MatchData objects? I think I’d like a word to be defined as
contiguous characters surrounded by white space (or the start/end of the
string), though am open to other interpretations.

s.scan(/\w+/).size

jamal86 · April 28, 2006, 11:04pm

One way is like this:

irb(main):020:0> a=“This is a test.”
=> “This is a test.”
irb(main):021:0> a.scan(/\b\S.*?\b/).size
=> 4
irb(main):022:0>

The Regexp in line 21 rewritten in a more readable form is:

a.scan(/
\b (?# a word boundary )
\S (?# a character that is not a space )
.? (?# maybe () some more characters (.), but don’t be greedy
(?))
\b (?# a word boundary )
/x

btw, the Regexp above actually works because of the x at the end,
meaning an extended regexp.

Regards,
JJ

On Friday, April 28, 2006, at 04:35PM, Jamal M.
[email protected] wrote:

I’ve research this but am still having trouble getting it right …
Can someone give me code that counts the number of words in a string via
RegExp and MatchData objects? I think I’d like a word to be defined as
contiguous characters surrounded by white space (or the start/end of the
string), though am open to other interpretations.

Jamal

Help everyone. If you can’t do that, then at least be nice.

jamal86 · May 3, 2006, 7:08pm

“Marcel Molina Jr.” [email protected] writes:

def words
scan(/\b\S+\b/)
end
end

And quite bit more efficient, memory-wise:

class String
def count_words
n = 0
scan(/\b\S+\b/) { n += 1}
n
end
end

Making String#count take regexps would be nice (same for #delete).

jamal86 · April 28, 2006, 11:10pm

Bira wrote:

–
Bira
http://compexplicita.blogspot.com
http://sinfoniaferida.blogspot.com

Just plain string.split.length will work as well, and should handle line
breaks too:

irb(main):001:0> “these are some words”.split.length
=> 4
irb(main):002:0> “these are \n some\nwords”.split.length
=> 4
irb(main):003:0> “these are \n some\nwords”.split
=> [“these”, “are”, “some”, “words”]
irb(main):004:0>

Hope that helps.

-Justin