Forum: Ruby Splitting strings on spaces, unless inside quotes

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
C004d67820a114e24ccf6f2ddaf5b236?d=identicon&s=25 Richard Livsey (Guest)
on 2006-01-07 01:10
(Received via mailing list)
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

So far I'm drawing a blank on the 'Ruby way' to do this and the only
solutions I can think of are turning out to be fairly ugly.

Any advice would be great. Thanks in advance.
5c7bdd14d6885c8275eaf78be41d120a?d=identicon&s=25 Eero Saynatkari (Guest)
on 2006-01-07 02:01
(Received via mailing list)
On 2006.01.07 09:08, Richard Livsey wrote:
> solutions I can think of are turning out to be fairly ugly.
>
> Any advice would be great. Thanks in advance.

Naively, you can try something like this:

   s = 'foo bar "baz quux" roo'
   s.scan(/(?:"")|(?:"(.*[^\\])")|(\w+)/).flatten.compact

Elaborate as necessary (add support for single quotes or something).

> R.Livsey


E
081edbad394127e1aa5b923b0d5804a5?d=identicon&s=25 Tim Heaney (Guest)
on 2006-01-07 02:13
(Received via mailing list)
Richard Livsey <richard@livsey.org> writes:

> I want to split a string into words, but group quoted words together
> such that...
>
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

How about the csv module? Despite the name, you don't have to use
commas.

  require 'csv'
  CSV::parse_line('some words "some quoted text" some more words', ' ')

I hope this helps,

Tim
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-01-07 02:55
(Received via mailing list)
On Jan 6, 2006, at 6:08 PM, Richard Livsey wrote:

> only solutions I can think of are turning out to be fairly ugly.
>
> Any advice would be great. Thanks in advance.

I agree that CSV is the way to go, but here's a direct attempt:

 >> example = %Q{some words "some quoted text" some more words}
=> "some words \"some quoted text\" some more words"
 >> example.scan(/\s+|\w+|"[^"]*"/).
?>         reject { |token| token =~ /^\s+$/ }.
?>         map { |token| token.sub(/^"/, "").sub(/"$/, "") }
=> ["some", "words", "some quoted text", "some", "more", "words"]

Hope that gives you some fresh ideas.

James Edward Gray II
81d609425e306219d54d793a0ad98bce?d=identicon&s=25 Matthew Moss (Guest)
on 2006-01-07 02:55
(Received via mailing list)
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

s = 'some words "some quoted text" some more words

sa = s.split(/"/).collect { |x| x.strip }
(0...sa.size).to_a.zip(sa).collect { |i,x| (i&1).zero? ? x.split : x
}.flatten
453bbc23c591908cbc5f8f9e0145cc33?d=identicon&s=25 Michael 'entropie' Trommer (Guest)
on 2006-01-07 02:55
(Received via mailing list)
* James Edward Gray II (james@grayproductions.net) wrote:
> >> example = %Q{some words "some quoted text" some more words}
> => "some words \"some quoted text\" some more words"
> >> example.scan(/\s+|\w+|"[^"]*"/).
> ?>         reject { |token| token =~ /^\s+$/ }.
> ?>         map { |token| token.sub(/^"/, "").sub(/"$/, "") }
> => ["some", "words", "some quoted text", "some", "more", "words"]

impressive


So long
--
Michael 'entropie' Trommer;  http://ackro.org

ruby -e "0.upto((a='njduspAhnbjm/dpn').size-1){|x| a[x]-=1}; p
'mailto:'+a"
81d609425e306219d54d793a0ad98bce?d=identicon&s=25 Matthew Moss (Guest)
on 2006-01-07 03:01
(Received via mailing list)
> (0...sa.size).to_a.zip(sa).collect { |i,x| (i&1).zero? ? x.split : x }.flatten

Just realized that Range responds to zip, so the to_a is unnecessary.

This looks slightly cleaner to me:

(1..sa.size).zip(sa).collect { |i,x| (i&1).zero? ? x : x.split }.flatten
7223c62b7310e164eb79c740188abbda?d=identicon&s=25 Xavier Noria (Guest)
on 2006-01-07 03:19
(Received via mailing list)
On Jan 7, 2006, at 1:08, Richard Livsey wrote:

> I want to split a string into words, but group quoted words
> together such that...
>
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

Curiously, someone asked exactly that on freenode#perl tonight.

If the input is that simple and is assumed to be well-formed this is
enough:

irb(main):005:0> %q{some words "some quoted text" some "" more
words}.scan(/"[^"]*"|\S+/)
=> ["some", "words", "\"some quoted text\"", "some", "\"\"", "more",
"words"]

Since nothing was said about this, it does not handle escaped quotes,
and it assumes quotes are always balanced, so a field cannot be %q
{"foo}, for example.

-- fxn
1fba4539b6cafe2e60a2916fa184fc2f?d=identicon&s=25 unknown (Guest)
on 2006-01-07 03:34
(Received via mailing list)
Hi --

On Sat, 7 Jan 2006, James Edward Gray II wrote:

>>
>> So far I'm drawing a blank on the 'Ruby way' to do this and the only
>> solutions I can think of are turning out to be fairly ugly.
>>
>> Any advice would be great. Thanks in advance.
>
> I agree that CSV is the way to go, but here's a direct attempt:

Me too (end of disclaimer :-)


>>> example = %Q{some words "some quoted text" some more words}
> => "some words \"some quoted text\" some more words"
>>> example.scan(/\s+|\w+|"[^"]*"/).
> ?>         reject { |token| token =~ /^\s+$/ }.
> ?>         map { |token| token.sub(/^"/, "").sub(/"$/, "") }
> => ["some", "words", "some quoted text", "some", "more", "words"]

I think you could do less work:

   example.scan(/"[^"]+"|\S+/).map { |word| word.delete('"') }

(Or am I overlooking some reason you'd want to capture sequences of
spaces?)

I changed the \w+ to \S+ (and moved it after the | to avoid having it
sponge up too much) in case the words included non-\w characters.

I guess with zero-width positive lookbehind/ahead one could do it
without the map operation.


David

--
David A. Black
dblack@wobblini.net

"Ruby for Rails", from Manning Publications, coming April 2006!
http://www.manning.com/books/black
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-01-07 03:46
(Received via mailing list)
On Jan 6, 2006, at 8:33 PM, dblack@wobblini.net wrote:

>
> (Or am I overlooking some reason you'd want to capture sequences of
> spaces?)
>
> I changed the \w+ to \S+ (and moved it after the | to avoid having it
> sponge up too much) in case the words included non-\w characters.

You're right, that's better all around.

> I guess with zero-width positive lookbehind/ahead one could do it
> without the map operation.

You can drop the map(), if you're willing to replace it with two
other calls:

 >> example = %Q{some words "some quoted text" some more words}
=> "some words \"some quoted text\" some more words"
 >>  example.scan(/"([^"]+)"|(\S+)/).flatten.compact
=> ["some", "words", "some quoted text", "some", "more", "words"]

James Edward Gray II
E0ed615bd6632dd23165e045e3c1df09?d=identicon&s=25 Florian GroÃ? (Guest)
on 2006-01-07 05:01
(Received via mailing list)
Richard Livsey wrote:

> I want to split a string into words, but group quoted words together
> such that...
>
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

Try this:
Cb48ca5059faf7409a5ab3745a964696?d=identicon&s=25 unknown (Guest)
on 2006-01-07 05:01
(Received via mailing list)
On Sat, 7 Jan 2006, Tim Heaney wrote:

>
> How about the csv module? Despite the name, you don't have to use
> commas.
>
>  require 'csv'
>  CSV::parse_line('some words "some quoted text" some more words', ' ')
>
> I hope this helps,

briliant!

-a
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 William James (Guest)
on 2006-01-09 10:13
(Received via mailing list)
Richard Livsey wrote:
> I want to split a string into words, but group quoted words together
> such that...
>
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

s = 'some words "some quoted text" some more words'
p s.split( / *"(.*?)" *| / )
D2fc40c5527da8c6ede7703f5fb23d1d?d=identicon&s=25 Geoff Jacobsen (Guest)
on 2006-01-09 11:31
(Received via mailing list)
On Mon, 2006-01-09 at 18:13 +0900, William James wrote:
> s = 'some words "some quoted text" some more words'
> p s.split( / *"(.*?)" *| / )
>

Which along with the CSV solution can't handle complex cases:

 s='one two"  "\'with quotes\' "three "'

 s.split( / *"(.*?)" *| / )
 => ["one", "two", "  ", "'with", "quotes'", "three "]

 require 'csv'
 CSV::parse_line(s)
 => []

but Shellwords can:

 require 'shellwords'
 Shellwords.shellwords(s)
 => ["one", "two  with quotes", "three "]
5befe95e6648daec3dd5728cd36602d0?d=identicon&s=25 Robert Klemme (Guest)
on 2006-01-09 12:53
(Received via mailing list)
Geoff Jacobsen wrote:
>>
>
>  require 'csv'
>  CSV::parse_line(s)
>  => []
>
> but Shellwords can:
>
>  require 'shellwords'
>  Shellwords.shellwords(s)
>  => ["one", "two  with quotes", "three "]

Another option is to use scan instead of split:

>> 'some words "some quoted text" some more words'.scan
%r{"(?:(?:[^"]|\\.)*)"|\S+}
=> ["some", "words", "\"some quoted text\"", "some", "more", "words"]

With some additional effort even the quotes can be removed (using
grouping
for example).

>> r=[];'some words "some quoted text" some more
words'.scan(%r{"((?:[^"]|\\.)*)"|(\S+)}) {|m| r << m.detect {|x|x}};r
=> ["some", "words", "some quoted text", "some", "more", "words"]

Kind regards

    robert
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 William James (Guest)
on 2006-01-09 20:23
(Received via mailing list)
Geoff Jacobsen wrote:
> >
>
>  require 'csv'
>  CSV::parse_line(s)
>  => []
>
> but Shellwords can:
>
>  require 'shellwords'
>  Shellwords.shellwords(s)
>  => ["one", "two  with quotes", "three "]

This is not a "more complex case"; it is an invalid case.
The original poster simply wanted to avoid splitting on spaces
within double quotes, not within single quotes.

The shellwords "solution" is a solution to a different problem, not
to this one.  It can't even handle a simple case:

require 'shellwords'
s = "why can't you think?"
Shellwords.shellwords(s)

ArgumentError: Unmatched single quote: 't you think?
D2fc40c5527da8c6ede7703f5fb23d1d?d=identicon&s=25 Geoff Jacobsen (Guest)
on 2006-01-10 02:50
(Received via mailing list)
On Tue, 2006-01-10 at 04:23 +0900, William James wrote:
> > > > ["some", "words", "some quoted text", "some", "more", "words"]
> > >
> > > s = 'some words "some quoted text" some more words'
> > > p s.split( / *"(.*?)" *| / )
> >
> > Which along with the CSV solution can't handle complex cases:
> >
> >  s='one two"  "\'with quotes\' "three "'
> >
> >  s.split( / *"(.*?)" *| / )
> >  => ["one", "two", "  ", "'with", "quotes'", "three "]
...
> The shellwords "solution" is a solution to a different problem, not
> to this one.  It can't even handle a simple case:
>
> require 'shellwords'
> s = "why can't you think?"
> Shellwords.shellwords(s)
>
> ArgumentError: Unmatched single quote: 't you think?
>

I agree my example doesn't match the originators request but *I think*
there is enough ambiguity about the post to postulate that they may want
more real-world cases such as:

s='symbol "William said: \"why can't you think?\"" 123 "<xml>foo</xml>"'
Shellwords.shellwords(s)

=> ["symbol", "William said: \"why can't you think?\"", "123",
"<xml>foo</xml>"]

So Shellwords may indeed be a solution to this problem but the problem
is not stated precisely enough to know.
This topic is locked and can not be replied to.