Regular expressions help

vivek · July 12, 2008, 8:25pm

Hi,
How do I split the below string into words…Words can be either a
consecutive set of non whitespace characters or anything withn " "

‘hi hello “hello world” hey yo’

should return
[hi, hello, hello world,hey,yo]

I tried to somehow do a collect , but not sure if there is a way to
retain a variable in between 2 invocations and then concat them and
return as one string…
Ofcourse if there is a smart way to do it in one shot using a regex
then i can do a scan on the string

vivek · July 12, 2008, 9:21pm

‘hi hello “hello world” hey yo’

should return
[hi, hello, hello world,hey,yo]

‘hi hello “hello world” hey yo’.scan(/\w+/)

=> [“hi”, “hello”, “hello”, “world”, “hey”, “yo”]

Sorry I couldn’t find a more verbose way. Maybe there is one!

vivek · July 12, 2008, 11:32pm

phlip wrote:

‘hi hello “hello world” hey yo’

should return
[hi, hello, hello world,hey,yo]

‘hi hello “hello world” hey yo’.scan(/\w+/)

=> [“hi”, “hello”, “hello”, “world”, “hey”, “yo”]

But this returns “hello world” as two entries, not one as required.

vivek · July 12, 2008, 11:40pm

should return
[hi, hello, hello world,hey,yo]

But this returns “hello world” as two entries, not one as required.

The “should return” clause is not well-formed anyway…

vivek · July 12, 2008, 11:39pm

On Sun, Jul 13, 2008 at 3:20 AM, Vivek [email protected] wrote:

Hi,
How do I split the below string into words…Words can be either a
consecutive set of non whitespace characters or anything withn " "

‘hi hello “hello world” hey yo’

should return
[hi, hello, hello world,hey,yo]

require ‘shellwords’
include Shellwords

str = ‘hi hello “hello world” hey yo’

p shellwords(str)

Harry

vivek · July 12, 2008, 11:53pm

Hi –

On Sun, 13 Jul 2008, phlip wrote:

should return
[hi, hello, hello world,hey,yo]

But this returns “hello world” as two entries, not one as required.

The “should return” clause is not well-formed anyway…

On the (usually misappropriated, but hopefully not here) Occam’s Razor
principle[1], I would refrain from positing that there’s actually
supposed to be a comma between the second “hello” and “world”, or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he’s now got just about every permutation
to choose from (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right

David

[1] http://pespmc1.vub.ac.be/occamraz.html (yes, it’s still “Link To
Something Other Than Wikipedia!” Week [barely])

vivek · July 13, 2008, 8:21am

Hi David and others,

On the (usually misappropriated, but hopefully not here) Occam’s Razor
principle[1], I would refrain from positing that there’s actually
supposed to be a comma between the second “hello” and “world”, or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he’s now got just about every permutation
to choose from (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right

Thanks for the replies…Indeed I don’t want the quotes to be a part
of the string
This one suggested above by works for me

irb(main):028:0> s
=> “hi there “hello world” namaste “yo man” “gutten morgen” ola
“what’s up” world”
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> [“hi”, “there”, “hello world”, “namaste”, “yo man”, “gutten
morgen”, “ola”, “what’s up”, “world”]

I presume that should capture pretty much any kind of combination…
and I don’t have the case where there are nested " so that looks good.
(unless someone can think of a case that breaks )
thanks so much…I had hit a dead end trying to do this!!

Vivek K.

vivek · July 13, 2008, 12:11pm

Hi –

On Sun, 13 Jul 2008, Vivek wrote:

right
“what’s up” world"
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> [“hi”, “there”, “hello world”, “namaste”, “yo man”, “gutten
morgen”, “ola”, “what’s up”, “world”]

I presume that should capture pretty much any kind of combination…
and I don’t have the case where there are nested " so that looks good.
(unless someone can think of a case that breaks )
thanks so much…I had hit a dead end trying to do this!!

Don’t forget the shellwords library though – a very convenient way to
do this.

David

vivek · July 13, 2008, 12:53pm

On Sun 13 Jul 2008 11:06:23, David A. Black wrote:

illustrate the results are actually supposed to be reinstated as

I presume that should capture pretty much any kind of combination…
and I don’t have the case where there are nested " so that looks
good. (unless someone can think of a case that breaks )
thanks so much…I had hit a dead end trying to do this!!

Don’t forget the shellwords library though – a very convenient way
to do this.

David

Is there a link for these listed on the web?

vivek · July 13, 2008, 3:57pm

Axel wrote:

str = ‘hi hello “hello world” hey yo’
str.gsub!( / " [^"]* " /x ) {|e| e[1…-2].gsub(’ ', “\007”) }
result = str.scan( / [\w\007]+ /x ).map {|e| e.gsub("\007", " ") }
p result

 str = 'hi hello  "hello world" hey yo'
 p str.scan(/(".*")|(\w+)/).flatten.compact

=> [“hi”, “hello”, “hello world”, “hey”, “yo”]

Greedy matching to the rescue!

vivek · July 13, 2008, 4:03pm

str = 'hi hello  "hello world" hey yo'
p str.scan(/(".*")|(\w+)/).flatten.compact
=> [“hi”, “hello”, “hello world”, “hey”, “yo”]

Greedy matching to the rescue!

Also, non-capturing groups help us remove the .flatten.compact nonsense:

 p str.scan(/(?:".*")|(?:\w+)/)

=> [“hi”, “hello”, ““hello world””, “hey”, “yo”]

I’m not sure why one version capture the “” marks and the other did
not…

vivek · July 13, 2008, 1:35pm

From: “humax” [email protected]

On Sun 13 Jul 2008 11:06:23, David A. Black wrote:

Don’t forget the shellwords library though – a very convenient way
to do this.
Is there a link for these listed on the web?

require ‘shellwords’

… should work in 1.8 and 1.9 ruby

vivek · July 13, 2008, 4:11pm

Hi –

On Sun, 13 Jul 2008, phlip wrote:

=> [“hi”, “hello”, ““hello world””, “hey”, “yo”]

I’m not sure why one version capture the “” marks and the other did not…

They both did (See my previous post.)

David

vivek · July 13, 2008, 4:28pm

David A. Black wrote:

=> [“hi”, “hello”, “hello world”, “hey”, “yo”]

That’s not quite the result, though:

I suspect I copied the wrong line from my transcript!

But…

The "'s are returned as part of the string ‘“hello world”’. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:

 str = 'hi hello "hello world" "hey yo"'
 p str.scan(/(?:".*")|(?:\w+)/)

=> [“hi”, “hello”, ““hello world” “hey yo””] # bad

 p str.scan(/(?:".*?")|(?:\w+)/)

=> [“hi”, “hello”, ““hello world””, ““hey yo””] # good!

(-:

str.scan(/"([^"]+)"|(\w+)/).flatten.compact

The non-greedy matcher .*? looks cuter.

Of course this assumes no embedded/escaped/nested "'s, etc.

Using regexps as real language parsers makes certain baby deities cry…

vivek · July 13, 2008, 4:38pm

Hi –

On Sun, 13 Jul 2008, phlip wrote:

The "'s are returned as part of the string ‘“hello world”’. Also, you
=> [“hi”, “hello”, ““hello world””, ““hey yo””] # good!
I don’t think the OP wanted the literal quotation marks as part of the
results, though. In other words you’d want the third string to be:

hello world

rather than

"hello world"

David

vivek · July 13, 2008, 4:52pm

From: “phlip” [email protected]

p str.scan(/(?:".*")|(?:\w+)/)
=> [“hi”, “hello”, “"hello world"”, “hey”, “yo”]

Probably want:

str.scan(/(?:“[^”]*")|(?:\w+)/)

…else the greediness will extend over multiple quoted
strings…

‘hi hello “hello world” hey yo “marmoset knocked you out” foo bar’
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vs.

‘hi hello “hello world” hey yo “marmoset knocked you out” foo bar’
^^^^^^^^^^^^^

I’m not sure why one version capture the “” marks and the
other did not…

Strange… They both did, on my system…(?)

BTW, in ruby 1.9, we have lookbehind, so we can avoid picking
up the quotes, with:

str.scan(/(?:(?<=“)[^”]*(?="))|(?:\w+)/)

Regards,

Bill

vivek · July 13, 2008, 4:09pm

Hi –

On Sun, 13 Jul 2008, phlip wrote:

=> [“hi”, “hello”, “hello world”, “hey”, “yo”]
That’s not quite the result, though:

str = ‘hi hello “hello world” hey yo’
=> “hi hello “hello world” hey yo”

str.scan(/(".*")|(\w+)/).flatten.compact
=> [“hi”, “hello”, ““hello world””, “hey”, “yo”]

The "'s are returned as part of the string ‘“hello world”’. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:

str = ‘one “two” “three” four’
=> “one “two” “three” four”

str.scan(/(".*")|(\w+)/).flatten.compact
=> [“one”, ““two” “three””, “four”] # only three strings

Try this:

str.scan(/"([^"]+)"|(\w+)/).flatten.compact

Of course this assumes no embedded/escaped/nested "'s, etc.

David