Suggestions for a parsing strategy?

Hi all,

I have input strings that can look like this:

Common, Commerc(e, ial)

I need to parse these into the three words that this represents:

Common, Commerce, Commercial.

I’m a little new to ruby, and hence wondering what direction would be
best to go in? (.scan, regexes … something else?) For me, the
complication I’m not sure how to deal with is the two “levels” of the
comma as a separator.


On Friday 18 July 2008 21:59:56 Robb wrote:

I’m a little new to ruby, and hence wondering what direction would be
best to go in? (.scan, regexes … something else?) For me, the
complication I’m not sure how to deal with is the two “levels” of the
comma as a separator.

One way would be to find the exceptions first. Replace anything that

Commerc(e, ial)

pattern with the two words, as the literal string “Commerce,
Commercial”. Then
you can just do a simple split on the commas, and maybe strip

On Fri, Jul 18, 2008 at 10:59 PM, Robb [email protected] wrote:

Hi all,

I have input strings that can look like this:

Common, Commerc(e, ial)

I need to parse these into the three words that this represents:

Common, Commerce, Commercial.

This code does a lot of what you describe, providing the parenthetical
only appears at the end.


s = “Common, Commerc(e, ial), Computer, Con(ic, ehead, temporary)”

def parse_word_list(s)
s.scan(/(\w+)(((.*?)))?/).map { |root, junk, suffixes|
[root, suffixes && suffixes.split(", ")]

list = parse_word_list(s)

see what’s produced

p list

use it to generate all words

list.each do |root, suffix_list|
if suffix_list
suffix_list.each do |suffix|
puts “#{root}#{suffix}”
puts root


Hope that helps,


==== offers Rails & Ruby HANDS-ON public & ON-SITE workshops.
Please visit for all the details.

On Friday 18 July 2008 22:38:43 Eric I. wrote:

s.scan(/(\w+)(((.*?)))?/).map { |root, junk, suffixes|

This pattern looks really useful… Looking at the docs for scan, it
like it can take a block.

Which just leaves one question: Why isn’t this an Enumerator in Ruby
1.9? I
don’t think the original meaning (of producing an array) is made much
difficult by the form


And I suspect that it would most often be useful for things like #map,
if not
used in block form outright. Making it an Enumerator would be somewhat
efficient than building a whole array first – and more responsive, if
it’s a
large string.

On Jul 19, 11:38 am, “Eric I.” [email protected] wrote:

s.scan(/(\w+)(((.*?)))?/).map { |root, junk, suffixes|

s.scan(/(\w+)(?:((.*?)))?/) can avoid the “junk”
^^,Your pattern is great & helpful

David M. wrote:

Why isn’t [the return value of scan] an Enumerator in Ruby 1.9?

Or 1.8.7 for that matter. Yes, I’ve been asking myself this very same
since the release of 1.9.

And I suspect that it would most often be useful for things like #map, if
not used in block form outright. Making it an Enumerator would be somewhat
more efficient than building a whole array first – and more responsive, if
it’s a large string.

Also it’d allow you to use the matchdata object inside map if you need
to. The
way it is now you’d have to do:
string.enum_for(:scan,/re/).map do
md = Regexp.last_match
do_something_with md

instead of just do…end

Making it an Enumerator would be somewhat more
efficient than building a whole array first – and more responsive, if it’s a
large string.

There is also the StringScanner class that can be used to return one
match at a time.