String#split and groups in the field separator RE

Is this expected behaviour? I haven’t seen anything related to this
mentioned in the API docs…

irb(main):057:0> s = ‘a::b:::c::::d’
=> “a::b:::c::::d”
irb(main):058:0> s.split(/:/)
=> [“a”, “”, “b”, “”, “”, “c”, “”, “”, “”, “d”] => OK
irb(main):059:0> s.split(/:+/)
=> [“a”, “b”, “c”, “d”] => OK
irb(main):060:0> s.split(/(:)+/)
=> [“a”, “:”, “b”, “:”, “c”, “:”, “d”] => ?
irb(main):061:0> s.split(/((:)+)/)
=> [“a”, “::”, “:”, “b”, “:::”, “:”, “c”, “::::”, “:”, “d”] => ???
irb(main):062:0> s.split(/(:+)/)
=> [“a”, “::”, “b”, “:::”, “c”, “::::”, “d”] => ???

mortee

mortee wrote:

Is this expected behaviour? I haven’t seen anything related to this
mentioned in the API docs…

irb(main):057:0> s = ‘a::b:::c::::d’
=> “a::b:::c::::d”
irb(main):058:0> s.split(/:/)
=> [“a”, “”, “b”, “”, “”, “c”, “”, “”, “”, “d”] => OK
irb(main):059:0> s.split(/:+/)
=> [“a”, “b”, “c”, “d”] => OK
irb(main):060:0> s.split(/(:)+/)
=> [“a”, “:”, “b”, “:”, “c”, “:”, “d”] => ?
irb(main):061:0> s.split(/((:)+)/)
=> [“a”, “::”, “:”, “b”, “:::”, “:”, “c”, “::::”, “:”, “d”] => ???
irb(main):062:0> s.split(/(:+)/)
=> [“a”, “::”, “b”, “:::”, “c”, “::::”, “d”] => ???

It was unexpected behavior for me when I ran into it using python’s
regex split() function a few months ago. Since it works the same way in
both languages, I would guess it might be a universal regex trait.

mortee wrote:

Is this expected behaviour? I haven’t seen anything related to this
mentioned in the API docs…

irb(main):057:0> s = ‘a::b:::c::::d’
=> “a::b:::c::::d”
irb(main):058:0> s.split(/:/)
=> [“a”, “”, “b”, “”, “”, “c”, “”, “”, “”, “d”] => OK
irb(main):059:0> s.split(/:+/)
=> [“a”, “b”, “c”, “d”] => OK
irb(main):060:0> s.split(/(:)+/)
=> [“a”, “:”, “b”, “:”, “c”, “:”, “d”] => ?
irb(main):061:0> s.split(/((:)+)/)
=> [“a”, “::”, “:”, “b”, “:::”, “:”, “c”, “::::”, “:”, “d”] => ???
irb(main):062:0> s.split(/(:+)/)
=> [“a”, “::”, “b”, “:::”, “c”, “::::”, “d”] => ???

I guess I should mention that the rule I jotted down in the margin of my
book is: if the split() pattern has parenthesized sub groupings, the
result array will include the match for each subgroup–but not the whole
match.

Applying that rule to your examples:

irb(main):060:0> s.split(/(:)+/)
=> [“a”, “:”, “b”, “:”, “c”, “:”, “d”] => ?

The subgroup (:slight_smile: matches a single colon, so those matches are included
in the results,

irb(main):061:0> s.split(/((:)+)/)
=> [“a”, “::”, “:”, “b”, “:::”, “:”, “c”, “::::”, “:”, “d”] => ???

The subgroup (:slight_smile: matches one colon and those results are included. The
subgroup ((:)+) matches two, three, and four colons as it traverses the
strings and those results are included. Because groups are numbered by
their left most parentheses, the outer grouping comes first in the list.

irb(main):062:0> s.split(/(:+)/)
=> [“a”, “::”, “b”, “:::”, “c”, “::::”, “d”] => ???

The subgroup (:+) matches two, three, and four colons as it traverses
the list, and those matches are included in the results.

And, here is an example of my own that shows that the whole match is not
included in the results–only the parenthesized sub groupings are
included:

str = 'a_::b:::c::::d’
pattern = /
(:+)_/

results = str.split(pattern)
p results

–output:–
[“a”, “::”, “b”, “:::”, “c”, “::::”, “d”]

Is this expected behaviour? I haven’t seen anything related to this
mentioned in the API docs…

irb(main):060:0> s.split(/(:)+/)
=> [“a”, “:”, “b”, “:”, “c”, “:”, “d”]

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

It is curious that it’s not in the api doc… I must have learnt it from
somewhere…

Dan.

mortee wrote:

Is this expected behaviour? I haven’t seen anything related to this
mentioned in the API docs…

$ri String#split


if pattern is a +Regexp+, str is divided where the pattern
matches. Whenever the pattern matches a zero-length string, str
is split into individual characters.

pickaxe2, p. 619 adds a line to the end of that description:


If pattern includes groups, these groups will be included in the
returned values.

Daniel S. wrote:

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

It is curious that it’s not in the api doc… I must have learnt it from
somewhere…

7stud – wrote:

The subgroup (:slight_smile: matches a single colon, so those matches are included
in the results,
[…]

Thanks, that clarifies it, and the results make sense based on the rule.
However, I find it quite confusing to have parts of what I intend to be
part of the “separator” among the list of results. To say the least.

mortee

Daniel S. wrote:

Is this expected behaviour? I haven’t seen anything related to this
mentioned in the API docs…

irb(main):060:0> s.split(/(:)+/)
=> [“a”, “:”, “b”, “:”, “c”, “:”, “d”]

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

/(?=,)/ is a lookahead match. I’m sure you really meant /(?:,)/