Ruby Regular expression

Hi
I am reading a file and want to extract. all quotations given in “” or
‘’
I was using regular expression.

/\w+(’|")(\w+)(’|")/

for example.
Mr. Ayush said “we need to change ourselves to change the world”

result would be
we need to change ourselves to change the world.

but there is loop hole that this pattern would extract "’.
can anyone help? so that i can extract only “” or ‘’

How about this?
http://www.rubular.com/r/CRRsiTHMkG
/("[\w ‘]+"|’[\w ]+’)/

You can remove the quotes at either end post-match if that’s a
requirement as well.

Doing this is tricky, the robustness of a regexp approach depends on
what
you can assume about the input. For example, in a programming language
escaping a quote " would be valid but unsupported, or in English
apostrophes could be taken as single quotes.

A regexp solution that is broken in those scenarios but works for the
easy
cases is:

("|')((?:(?!\1).)*)\1

The regexp says: if you match either " o ', then countinue matching as
long
as you do not find the matched quote, and until you find the closing
quote
(needed because you could reach end of file with an unbalanced quote).

The second group has the string without quotes.

Whether this is going to work well for you input is something you have
to
evaluate.

On Wed, Dec 11, 2013 at 10:58 AM, Xavier N. [email protected] wrote:

The regexp says: if you match either " o ', then countinue matching as long
as you do not find the matched quote, and until you find the closing quote
(needed because you could reach end of file with an unbalanced quote).

The second group has the string without quotes.

Interesting solution! I also tried

("|’)([^\1]*)\1

which looked fine initially

irb(main):025:0> “foo ‘bar’ “baz”
buz”.scan(/("|’)([^\1]*)\1/).map(&:last)
=> [“bar”, “baz”]

but broke later:

irb(main):030:0> “foo ‘bar’ “baz” buz “bongo’s
kongo””.scan(/("|’)([^\1]*)\1/)
=> [["’", "bar’ “baz” buz “bongo”]]

where your solution still works:

irb(main):031:0> “foo ‘bar’ “baz” buz “bongo’s
kongo””.scan(/("|’)((?:(?!\1).)*)\1/)
=> [["’", “bar”], [""", “baz”], [""", “bongo’s kongo”]]

However, we can also use non greediness to achieve the same:

irb(main):032:0> “foo ‘bar’ “baz” buz “bongo’s
kongo””.scan(/("|’)(.?)\1/)
=> [["’", “bar”], [""", “baz”], [""", “bongo’s kongo”]]
irb(main):033:0> “foo ‘bar’ “baz” buz “bongo’s
kongo””.scan(/("|’)(.
?)\1/).map(&:last)
=> [“bar”, “baz”, “bongo’s kongo”]

Adding some escaping capabilities we get ("|’)((?:\.|(?!\1).)*)\1

irb(main):038:0> “foo ‘bar’ “baz” buz “bongo’s kongo” gingo said
“foo \” bar” yes".scan(/("|’)((?:\.|(?!\1).)*)\1/).map(&:last)
=> [“bar”, “baz”, “bongo’s kongo”, “foo \” bar"]

:wink:

Kind regards

robert

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs