Regex for splitting string


#1

Hi

We have a search website where the user can type in individual words
separated by spaces and/or phrases enclosed in single or double quotes.
We are looking for a way to obtain a list of words and phrases from the
search string.
Can someone help?

Thanks,
Yash


#2

Yash wrote:

Hi

We have a search website where the user can type in individual words
separated by spaces and/or phrases enclosed in single or double quotes.
We are looking for a way to obtain a list of words and phrases from the
search string.
Can someone help?

string.scan(/\w+/)
to give an array of words, or
string.split(/\W+/)
to split on non-words. You can string.gsub(/["’]/, ‘’) if you want to
get rid of quotes.


#3

If the input string is:
Java Ruby ‘Ruby on rails’ “software development” “technology”

The list of words should be:
Java
Ruby
Ruby on rails
software development
technology

With your approach, the result will be:
Java
Ruby
Ruby
on
rails
software
development
technology

Alex Y. wrote:

Yash wrote:

Hi

We have a search website where the user can type in individual words
separated by spaces and/or phrases enclosed in single or double quotes.
We are looking for a way to obtain a list of words and phrases from the
search string.
Can someone help?

string.scan(/\w+/)
to give an array of words, or
string.split(/\W+/)
to split on non-words. You can string.gsub(/["’]/, ‘’) if you want to
get rid of quotes.


#4

I’m not enough of a regex guru to do it in one, so I’d probably do it in
two:

/"([^"]+)"|(’[^’]+)’/ to grab the quotes … replace occurences with ‘’
in
original string …

Then use the \w|\W from below to get individual tokens … at sompoint
cleaning the original string of anything you constitute garbage.


#5

Yash wrote:

We have a search website where the user can type in individual
words

separated by spaces and/or phrases enclosed in single or double
quotes.

We are looking for a way to obtain a list of words and phrases from
the

search string.

If the input string is:
Java Ruby ‘Ruby on rails’ “software development” “technology”

The list of words should be:
Java
Ruby
Ruby on rails
software development
technology

example = ‘some text and ‘some inside’ test “double quotes”’

Using the CSV module:

require ‘csv’
CSV::parse_line(example, ’ ‘)
=> [“some”, “text”, “and”, “'some”, "inside’", “test”, “double quotes”]

Fairly elegant, but doesn’t handle single quotes like you want

or

example.split( / *"’["’] *| / )
=> [“some”, “text”, “and”, “some inside”, “test”, “double quotes”]

Which seems to be more like you want.

Hope that helps.


#6

Oops, I seem to be capturing the opening single quote … should move
that

/"([^"]+)"|’([^’]+)’/


#7

On 4/5/06, James L. removed_email_address@domain.invalid wrote:

If you don’t mind having some array elements that are all whitespace
you can drop the “find_all” part.

Of course, the second I hit “send” I realized that I pasted the wrong
regex.

a = s.split(/(".?")|(’.?’)|((?=[^"’])\w+(?=[^"’]))/).find_all {|x|
x.match(/\w+/)}
=> [“Java”, “Ruby”, “‘Ruby on rails’”, ““software development””,
""technology
“”]

– James


#8

On 4/5/06, Yash removed_email_address@domain.invalid wrote:

If the input string is:
Java Ruby ‘Ruby on rails’ “software development” “technology”

The list of words should be:
Java
Ruby
Ruby on rails
software development
technology

irb(main):086:0> s = ‘Java Ruby ‘Ruby on rails’ “software
development” “technology”’
=> “Java Ruby ‘Ruby on rails’ “software development” “technology””

irb(main):087:0> a =
s.split(/(".")|(’.’)|((?=[^"’])\w+(?=[^"’]))/).find_all {|s|
s.match(/\w+/)}
=> [“Java”, “Ruby”, “‘Ruby on rails’”, ““software development”
“technology””]

If you don’t mind having some array elements that are all whitespace
you can drop the “find_all” part.

– James