Suggestions for a parsing strategy?

Hi all,

I have input strings that can look like this:

Common, Commerc(e, ial)

I need to parse these into the three words that this represents:

Common, Commerce, Commercial.

I’m a little new to ruby, and hence wondering what direction would be
best to go in? (.scan, regexes … something else?) For me, the
complication I’m not sure how to deal with is the two “levels” of the
comma as a separator.

Thanks,
Robb

On Friday 18 July 2008 21:59:56 Robb wrote:

I’m a little new to ruby, and hence wondering what direction would be
best to go in? (.scan, regexes … something else?) For me, the
complication I’m not sure how to deal with is the two “levels” of the
comma as a separator.

One way would be to find the exceptions first. Replace anything that
matches
the

Commerc(e, ial)

pattern with the two words, as the literal string “Commerce,
Commercial”. Then
you can just do a simple split on the commas, and maybe strip
whitespace.

On Fri, Jul 18, 2008 at 10:59 PM, Robb [email protected] wrote:

Hi all,

I have input strings that can look like this:

Common, Commerc(e, ial)

I need to parse these into the three words that this represents:

Common, Commerce, Commercial.

This code does a lot of what you describe, providing the parenthetical
only appears at the end.

====

s = “Common, Commerc(e, ial), Computer, Con(ic, ehead, temporary)”

def parse_word_list(s)
s.scan(/(\w+)(((.*?)))?/).map { |root, junk, suffixes|
[root, suffixes && suffixes.split(", ")]
}
end

list = parse_word_list(s)

see what’s produced

p list

use it to generate all words

list.each do |root, suffix_list|
if suffix_list
suffix_list.each do |suffix|
puts “#{root}#{suffix}”
end
else
puts root
end
end

====

Hope that helps,

Eric

====

LearnRuby.com offers Rails & Ruby HANDS-ON public & ON-SITE workshops.
Please visit http://LearnRuby.com for all the details.

On Friday 18 July 2008 22:38:43 Eric I. wrote:

s.scan(/(\w+)(((.*?)))?/).map { |root, junk, suffixes|

This pattern looks really useful… Looking at the docs for scan, it
looks
like it can take a block.

Which just leaves one question: Why isn’t this an Enumerator in Ruby
1.9? I
don’t think the original meaning (of producing an array) is made much
more
difficult by the form

s.scan(/…/).to_a

And I suspect that it would most often be useful for things like #map,
if not
used in block form outright. Making it an Enumerator would be somewhat
more
efficient than building a whole array first – and more responsive, if
it’s a
large string.

On Jul 19, 11:38 am, “Eric I.” [email protected] wrote:

s.scan(/(\w+)(((.*?)))?/).map { |root, junk, suffixes|

s.scan(/(\w+)(?:((.*?)))?/) can avoid the “junk”
^^,Your pattern is great & helpful

David M. wrote:

Why isn’t [the return value of scan] an Enumerator in Ruby 1.9?

Or 1.8.7 for that matter. Yes, I’ve been asking myself this very same
question
since the release of 1.9.

And I suspect that it would most often be useful for things like #map, if
not used in block form outright. Making it an Enumerator would be somewhat
more efficient than building a whole array first – and more responsive, if
it’s a large string.

Also it’d allow you to use the matchdata object inside map if you need
to. The
way it is now you’d have to do:
string.enum_for(:scan,/re/).map do
md = Regexp.last_match
do_something_with md
end

instead of just
string.scan.map do…end

Making it an Enumerator would be somewhat more
efficient than building a whole array first – and more responsive, if it’s a
large string.

There is also the StringScanner class that can be used to return one
match at a time.