Forum: Ruby Parsing a string using multiple regexs

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Web M. (Guest)
on 2006-06-01 03:05
To whomever can help,

I need to parse a string into parts using multiple regexs, and I'm new
to Ruby, so I have NO idea how to do it. Basically, I'm trying to parse
tags that users might add on a webpage.

Example: In my tags input box, user enters the following:
        "my first tag", second tag, third_tag fourth_tag

The order of precedence is quotes, then commas then spaces, so my
results should be a ruby array which contains the values:

 my first tag
 second tag
 third_tag
 fourth_tag

which means the array will have 4 elements.

I desperately need help. Anyone ?

Thanks,

Jason
Jeff P. (Guest)
on 2006-06-01 07:14
Hi Jason,
I'm a newbie too, but thought I might be able to help on this one.
Unfortunately, I think we need a little bit more information about what
to expect in the string.  Is it always specifically of the form:
something in quotes followed by something between commas followed by two
space separated other things?

Or might the order be different, or some part of it is missing, etc.

Since each part has a different type of separation, it almost seems as
if you will need to look for each part separately.

maybe start out with:
tags =~ /"(.*)"/      # first tag is anything inside quotes
a[0] = $1              # parenthesis form a group, which is stored in $1
variable

tags =~ /[,]\s*([^,])\s*[,]/    # second tag is anything between commas
a[1] = $1

temp = tags =~  /.*([^,])/   # get everything after the last comma
(would this work?)
a[2],a[3] = temp.split(' ')     # split whatever is left at the space
and put in tags 3 and 4

Being a newbie myself, I can almost guarantee three things about this
reply:
1) three or four other people probably beat me to it while I was
figuring it out
2) there's bound to be at least four errors in my answer - but at least
I gave it a shot
3) there's probably a much more eloquent solution (or four of them since
it's ruby)

best of luck learning this bizarre but fascinating language,
jp


web mail wrote:
>
> To whomever can help,
>
> I need to parse a string into parts using multiple regexs, and I'm new
> to Ruby, so I have NO idea how to do it. Basically, I'm trying to parse
> tags that users might add on a webpage.
>
> Example: In my tags input box, user enters the following:
>         "my first tag", second tag, third_tag fourth_tag
>
> The order of precedence is quotes, then commas then spaces, so my
> results should be a ruby array which contains the values:
>
>  my first tag
>  second tag
>  third_tag
>  fourth_tag
>
> which means the array will have 4 elements.
>
> I desperately need help. Anyone ?
>
> Thanks,
>
> Jason
Ezra Z. (Guest)
on 2006-06-01 07:49
(Received via mailing list)
On May 31, 2006, at 4:06 PM, web mail wrote:

>
> I desperately need help. Anyone ?
>
> Thanks,
>
> Jason
>
> --
> Posted via http://www.ruby-forum.com/.
>


   def parse_tags(input)
     tags = []
     # pull out the quoted tags
     input.gsub!(/\"(.*?)\"\s*/ ) { tags << $1; "" }
     # replace all commas with a space
     input.gsub!(/,/, " ")
     # get whatever's left
     tags.concat input.split(/\s/)
     # strip whitespace from the names
     tags.map! { |t| t.strip }
     # delete any blank tag names
     tags = tags.delete_if { |t| t.empty? }
     return tags
   end


-Ezra
Jeff P. (Guest)
on 2006-06-01 08:12
Thanks Ezra,
I was sure I would learn something by attempting an answer.  I learned
several somethings:

tags << next answer
(cleaner than using subscripts)

input.gsub!(/\"(.*?)\"\s*/ ) { tags << $1; "" }
(remove the stuff you find as you find it - brilliant!)

/\"(.*?)\"\s*/
(don't forget to escape the quotes and use ? to make .* less greedy)

# replace all commas with a space
(if you can't get there from here, go someplace else first!)

tags.concat input.split(/\s/)
(another cool way to shove some more answers into the answer array)

# strip whitespace from the names
>      tags.map! { |t| t.strip }
(instead of making the regexp more complicated, do it simply and then
clean up the results)

# delete any blank tag names
>      tags = tags.delete_if { |t| t.empty? }
(yet another iterator I never heard of before)


 def parse_tags(input)
 ...
end
(modularize everything as you go!)


P.S. Ezra, thanks for recommending RimuHosting, they've been great!

best,
jp


Ezra Z. wrote:
> On May 31, 2006, at 4:06 PM, web mail wrote:
>
>>
>> I desperately need help. Anyone ?
>>
>> Thanks,
>>
>> Jason
>>
>> --
>> Posted via http://www.ruby-forum.com/.
>>
>
>
>    def parse_tags(input)
>      tags = []
>      # pull out the quoted tags
>      input.gsub!(/\"(.*?)\"\s*/ ) { tags << $1; "" }
>      # replace all commas with a space
>      input.gsub!(/,/, " ")
>      # get whatever's left
>      tags.concat input.split(/\s/)
>      # strip whitespace from the names
>      tags.map! { |t| t.strip }
>      # delete any blank tag names
>      tags = tags.delete_if { |t| t.empty? }
>      return tags
>    end
>
>
> -Ezra
Web M. (Guest)
on 2006-06-02 00:47
Jeff, Ezra,

Thanks tons for the help. You guys both helped a lot. My only question
so far, for Ezra, is this: My second tag is "second tag", not in quotes,
but with a space between the two words. The tag IS meant to be two words
long. So I'm thinking that if you replace all the commas with spaces,
wont that split my second tag into two tags, when it was meant to be
just one tag ?

Jeff,

The tags can appear in any order, and they can either be quoted,
separated by commas, or separated by spaces. I've decided that in the
case of all three, i would like quotes to have precedence, followed by
commas, followed by spaces. so, for an example, the following string:

"hello, i love you", was written by, the doors

i think would have the expected behaviour of creating the following tags

tag1 = "hello, i love you"
tag2 = was written by
tag3 = the doors

whereas the line

"hello, i love you" was written by the doors

would be

tag1 = "hello, i love you"
tag2 = was
tag3 = written
tag4 = by
tag5 = the
tag6 = doors

this is just the way i thought i should handle the users input. if
anyone has better suggestions, i'm open to them. the problem being
solved is how to take user input typed in a textbox and split it into
tags, while handling things like quotes, commas, whatever.

thanks again

- jason

Jeff P. wrote:
> Thanks Ezra,
> I was sure I would learn something by attempting an answer.  I learned
> several somethings:
>
> tags << next answer
> (cleaner than using subscripts)
>
> input.gsub!(/\"(.*?)\"\s*/ ) { tags << $1; "" }
> (remove the stuff you find as you find it - brilliant!)
>
> /\"(.*?)\"\s*/
> (don't forget to escape the quotes and use ? to make .* less greedy)
>
> # replace all commas with a space
> (if you can't get there from here, go someplace else first!)
>
> tags.concat input.split(/\s/)
> (another cool way to shove some more answers into the answer array)
>
> # strip whitespace from the names
>>      tags.map! { |t| t.strip }
> (instead of making the regexp more complicated, do it simply and then
> clean up the results)
>
> # delete any blank tag names
>>      tags = tags.delete_if { |t| t.empty? }
> (yet another iterator I never heard of before)
>
>
>  def parse_tags(input)
>  ...
> end
> (modularize everything as you go!)
>
>
> P.S. Ezra, thanks for recommending RimuHosting, they've been great!
>
> best,
> jp
>
>
> Ezra Z. wrote:
>> On May 31, 2006, at 4:06 PM, web mail wrote:
>>
>>>
>>> I desperately need help. Anyone ?
>>>
>>> Thanks,
>>>
>>> Jason
>>>
>>> --
>>> Posted via http://www.ruby-forum.com/.
>>>
>>
>>
>>    def parse_tags(input)
>>      tags = []
>>      # pull out the quoted tags
>>      input.gsub!(/\"(.*?)\"\s*/ ) { tags << $1; "" }
>>      # replace all commas with a space
>>      input.gsub!(/,/, " ")
>>      # get whatever's left
>>      tags.concat input.split(/\s/)
>>      # strip whitespace from the names
>>      tags.map! { |t| t.strip }
>>      # delete any blank tag names
>>      tags = tags.delete_if { |t| t.empty? }
>>      return tags
>>    end
>>
>>
>> -Ezra
Logan C. (Guest)
on 2006-06-02 01:13
(Received via mailing list)
On Jun 1, 2006, at 12:13 AM, Jeff P. wrote:

> /\"(.*?)\"\s*/
> (instead of making the regexp more complicated, do it simply and then
> (modularize everything as you go!)
>
>
> P.S. Ezra, thanks for recommending RimuHosting, they've been great!
>
> best,
> jp
>
>

I must say this has to be the best breakdown of someone else's ruby
code I've ever seen. Maybe you should do a ruby-quiz summary sometime.
Matthew S. (Guest)
on 2006-06-02 01:54
(Received via mailing list)
On Jun 1, 2006, at 21:48, web mail wrote:

> Jeff, Ezra,
>
> Thanks tons for the help. You guys both helped a lot. My only question
> so far, for Ezra, is this: My second tag is "second tag", not in
> quotes,
> but with a space between the two words. The tag IS meant to be two
> words
> long. So I'm thinking that if you replace all the commas with spaces,
> wont that split my second tag into two tags, when it was meant to be
> just one tag ?

You're right, that's what would happen.  The problem is that in the
case where a non-quoted multi-word tag (second tag) occurs in the
same input as two space-delimited single-word tags (third_tag
fourth_tag), you're out of luck.  This isn't a problem that's
solvable without some sort of semantic knowledge, potentially quite a
bit of it.  For example, how would you distinguish between ("first
tag", second tag, dog pile) and ("first tag", second tag, dog pile)
where (dog pile) is meant as one and two tags, respectively?

> The tags can appear in any order, and they can either be quoted,
> separated by commas, or separated by spaces. I've decided that in the
> case of all three, i would like quotes to have precedence, followed by
> commas, followed by spaces. so, for an example, the following string:

I think that you're using 'precedence' in a way that's not quite
formally correct.  Commas don't have precedence over spaces in the
same way that * has precendence over +.  From your examples, a more
accurate description might be: if commas occur outside of a quoted
tag, assume that tags are comma-delimited; otherwise, assume that
tags are space-delimited.

Even that, though, leaves you with a bit of a problem making the dog-
pile decision, if it's the only input.  Is it meant as a single tag
in a comma-delimited list (convention being to leave off the final
delimiter in lists)?  Or is it meant as two tags in a space-delimited
list?  It's still ambiguous as to what the user's intention might
have been without delving into some sort of semantics.

Another poster (can't remember whom) interpreted your specification
as "after the last comma, assume things are space-delimited", which
might also be an option.

For reasons of simplicity (for the user - you're free to make your
job as hard as you please!), though, I'd suggest that it would be
best to stick to a single type of delimiter.  I prefer spaces,
myself, since tags tend to be single words, and people are used to
that sort of input (not only from the canonical examples of flickr or
del.icio.us, but because it mimics search engines, for instance).

Matthew S.

[1] "Dog pile" in this sense: http://en.wikipedia.org/wiki/Pile-on
which I thought of because, being the oldest and largest of all my
cousins and siblings, it was always my misfortune to be on the bottom.
Web M. (Guest)
on 2006-06-02 03:59
Matthew,

Excellent comments. You described exactly the effect I was going for,
and also perfectly described the kind of conundrums that can take place
when doing something like this. Since I was going for quotes, then
commas, then spaces only if no commas existed, I have modified the code
Ezra so kindly provided to look like this:

   def parse_tags(input)
     tags = []
     # pull out the quoted tags
     input.gsub!(/\"(.*?)\"/ ) { tags << $1; "" }
     #pull out comma separated tags
if input.include? ","
     #find all tags that end with comma - ex: tag1,tag2,tag3 ==>
tag1,tag2,
     input.gsub!(/(.+?),/) { tags << $1; "" }
     #find all tags that begin with a comma - ex: tag1, tag2, tag3 ==> ,
tag2 ,  tag3
     input.gsub!(/,(.+?)/) { tags << $1; "" }
     tags << input
else
     tags.concat input.split(/\s/)
end

    # get whatever's left
    #    tags.concat input.split(/,/)
    # strip whitespace from the names
    tags.map! { |t| t.strip }
    # delete any blank tag names
    tags = tags.delete_if { |t| t.empty? }
    return tags
  end

#below is to test the function
puts parse_tags ('"jay" hello mary, goodbye stranger, its been long,
"hello again"')


You'll notice the if else end section which looks for commas, and then
decides to split tags by either commas, or spaces, conditionally

Thanks

- Jason

Matthew S. wrote:
> On Jun 1, 2006, at 21:48, web mail wrote:
>
>> Jeff, Ezra,
>>
>> Thanks tons for the help. You guys both helped a lot. My only question
>> so far, for Ezra, is this: My second tag is "second tag", not in
>> quotes,
>> but with a space between the two words. The tag IS meant to be two
>> words
>> long. So I'm thinking that if you replace all the commas with spaces,
>> wont that split my second tag into two tags, when it was meant to be
>> just one tag ?
>
> You're right, that's what would happen.  The problem is that in the
> case where a non-quoted multi-word tag (second tag) occurs in the
> same input as two space-delimited single-word tags (third_tag
> fourth_tag), you're out of luck.  This isn't a problem that's
> solvable without some sort of semantic knowledge, potentially quite a
> bit of it.  For example, how would you distinguish between ("first
> tag", second tag, dog pile) and ("first tag", second tag, dog pile)
> where (dog pile) is meant as one and two tags, respectively?
>
>> The tags can appear in any order, and they can either be quoted,
>> separated by commas, or separated by spaces. I've decided that in the
>> case of all three, i would like quotes to have precedence, followed by
>> commas, followed by spaces. so, for an example, the following string:
>
> I think that you're using 'precedence' in a way that's not quite
> formally correct.  Commas don't have precedence over spaces in the
> same way that * has precendence over +.  From your examples, a more
> accurate description might be: if commas occur outside of a quoted
> tag, assume that tags are comma-delimited; otherwise, assume that
> tags are space-delimited.

EXACTLY what i meant.

>
> Even that, though, leaves you with a bit of a problem making the dog-
> pile decision, if it's the only input.  Is it meant as a single tag
> in a comma-delimited list (convention being to leave off the final
> delimiter in lists)?  Or is it meant as two tags in a space-delimited
> list?  It's still ambiguous as to what the user's intention might
> have been without delving into some sort of semantics.
>
> Another poster (can't remember whom) interpreted your specification
> as "after the last comma, assume things are space-delimited", which
> might also be an option.
>
> For reasons of simplicity (for the user - you're free to make your
> job as hard as you please!), though, I'd suggest that it would be
> best to stick to a single type of delimiter.  I prefer spaces,
> myself, since tags tend to be single words, and people are used to
> that sort of input (not only from the canonical examples of flickr or
> del.icio.us, but because it mimics search engines, for instance).
>
> Matthew S.
>
> [1] "Dog pile" in this sense: http://en.wikipedia.org/wiki/Pile-on
> which I thought of because, being the oldest and largest of all my
> cousins and siblings, it was always my misfortune to be on the bottom.
Jeff P. (Guest)
on 2006-06-02 04:55
Thanks Logan,
I would like to encourage other newbies out there to make an attempt to
answer questions like this one.  You really do learn a lot more from the
"correct" solutions when you've already taken a stab at it yourself.
Also, I have to say that this forum is remarkably free of trolls and
assholes.  Pretty safe to throw some pre-pubescent ruby code out there
and have it tenderly corrected by the experts.  Many other places I've
been this would not be the case.

thanks,
jp


Logan C. wrote:
> On Jun 1, 2006, at 12:13 AM, Jeff P. wrote:
>
>> /\"(.*?)\"\s*/
>> (instead of making the regexp more complicated, do it simply and then
>> (modularize everything as you go!)
>>
>>
>> P.S. Ezra, thanks for recommending RimuHosting, they've been great!
>>
>> best,
>> jp
>>
>>
>
> I must say this has to be the best breakdown of someone else's ruby
> code I've ever seen. Maybe you should do a ruby-quiz summary sometime.
Robert K. (Guest)
on 2006-06-22 17:16
(Received via mailing list)
2006/6/2, web mail <removed_email_address@domain.invalid>:
>      tags = []
>      tags << input
>     return tags
>   end
>
> #below is to test the function
> puts parse_tags ('"jay" hello mary, goodbye stranger, its been long,
> "hello again"')
>
>
> You'll notice the if else end section which looks for commas, and then
> decides to split tags by either commas, or spaces, conditionally

This doesn't work as expected:

?> parse_tags '"my first tag", second tag, third_tag fourth_tag'
=> ["my first tag", ", second tag", "third_tag fourth_tag"]

Like Matthew said, you have a problem with regard to your separators
that cannot be solved without either additional rules / logic (that
you do not apply in your method) or a change of the rule set.
Personally I'd simply drop the rule that space can server as separator
*or* treat space and comma equally:

>> s='"my first tag", second tag, third_tag fourth_tag'
=> "\"my first tag\", second tag, third_tag fourth_tag"

>> s.scan(%r{"[^"]*"|[^\s,]+(?:\s[^\s,]*)*})
=> ["\"my first tag\"", "second tag", "third_tag fourth_tag"]

>> s.scan(%r{"[^"]*"|[^\s,]+})
=> ["\"my first tag\"", "second", "tag", "third_tag", "fourth_tag"]

Kind regards

robert
This topic is locked and can not be replied to.