How to split(//) with respect to bigraphs?

elgato · August 3, 2006, 1:04pm

And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters with
respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this “zero-width positive
look-behind assertion”, so the question is how can one efficiently split
the string in Ruby?

Thanks,

P.

elgato · August 3, 2006, 1:05pm

Pavel S. wrote:

and it works fine.

Unfortunately, Ruby does not implement/support this “zero-width
positive look-behind assertion”, so the question is how can one
efficiently split the string in Ruby?

Thanks,

P.

Does this work?

irb(main):001:0> “czech”.split(/([Cc][Hh])|/)
=> [“c”, “z”, “e”, “ch”]
irb(main):002:0> “check czech”.split(/([Cc][Hh])|/)
=> ["", “ch”, “e”, “c”, “k”, " “, “c”, “z”, “e”, “ch”]
irb(main):003:0> “cHeck czeCh”.split(/([Cc][Hh])|/)
=> [”", “cH”, “e”, “c”, “k”, " ", “c”, “z”, “e”, “Ch”]

-Justin

elgato · August 3, 2006, 1:08pm

On 02/08/06, Justin C. [email protected] wrote:

irb(main):001:0> “czech”.split(/([Cc][Hh])|/)
=> [“c”, “z”, “e”, “ch”]
irb(main):002:0> “check czech”.split(/([Cc][Hh])|/)
=> [“”, “ch”, “e”, “c”, “k”, " “, “c”, “z”, “e”, “ch”]
irb(main):003:0> “cHeck czeCh”.split(/([Cc][Hh])|/)
=> [”", “cH”, “e”, “c”, “k”, " ", “c”, “z”, “e”, “Ch”]

Or use scan:

str.scan(/(?:ch)|./i)

You might still have a problem with other characters, though,
depending on the encoding and normalisation.

Paul.

elgato · August 3, 2006, 1:10pm

Pavel S. [email protected] writes:

And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters
with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

string.split(/ch|./i)

elgato · August 3, 2006, 1:11pm

Paul B. wrote:

Or use scan:

str.scan(/(?:ch)|./i)

Yes, the use of scan strikes me in the meantime too. Why (?:)?
str.scan(/ch|./i) does exactly the same, doesn’t it?

Thank you,

P.

elgato · August 3, 2006, 1:12pm

On 02/08/06, Pavel S. [email protected] wrote:

Yes, the use of scan strikes me in the meantime too. Why (?:)?
str.scan(/ch|./i) does exactly the same, doesn’t it?

Yeah, there’s no need for the (?: … ). I started off thinking it was
more complicated than it was, and forgot to take that out. I really
need a regexp refactoring tool.

Paul.

elgato · August 3, 2006, 1:12pm

Justin C. wrote:

Pavel S. wrote:

And once more question:

one more

Unfortunately, Ruby does not implement/support this “zero-width
positive look-behind assertion”, so the question is how can one
efficiently split the string in Ruby?

Stupid question. One should not insist on word-for-word translation
when rewriting some code from Perl to Ruby.

The solution can be e.g. scan(/[cC][hH]|./)

irb(main):001:0> “cHeck czeCh”.scan(/[cC][hH]|./)
=> [“cH”, “e”, “c”, “k”, " ", “c”, “z”, “e”, “Ch”]

Does this work?

irb(main):001:0> “czech”.split(/([Cc][Hh])|/)
=> [“c”, “z”, “e”, “ch”]
irb(main):002:0> “check czech”.split(/([Cc][Hh])|/)
=> [“”, “ch”, “e”, “c”, “k”, " “, “c”, “z”, “e”, “ch”]
irb(main):003:0> “cHeck czeCh”.split(/([Cc][Hh])|/)
=> [”", “cH”, “e”, “c”, “k”, " ", “c”, “z”, “e”, “Ch”]

Scan version is slightly better as it never returns the empty string. Of
course, thanks anyway.

But where can one find this feature of the split in the documentation?
http://www.rubycentral.com/ref/ref_c_string.html#split does not mention
split returns not only delimited substrings, but also successful groups
from the match of the regexp.

Regards,

P.

elgato · August 3, 2006, 1:13pm

Pavel S. wrote:

Stupid question. One should not insist on word-for-word

=> [“c”, “z”, “e”, “ch”]
mention split returns not only delimited substrings, but also
successful groups from the match of the regexp.

Regards,

P.

As far as I can see, it’s not in the documentation. I found it by
accident. But, yes, the scan method is better.

-Justin

elgato · August 3, 2006, 2:23pm

On Aug 2, 2006, at 12:21, Justin C. wrote:

As far as I can see, it’s not in the documentation. I found it by
accident. But, yes, the scan method is better.

Oh, my gosh. If only you’d posted this little tidbit two days ago, I’d
have saved a couple hours of code-wrangling.

For sorting purposes, I needed to turn something like
[email protected]
into
[email protected]

I started with str.split(/[.]|@/), but then I’d lose where the @ went.
I tried turning it into
[“one-and”, “.”, “two”, “@”, “three”, “.”, “net”]
so I could .reverse that, but without positive look-behind, I couldn’t
find any way to detect the break after the dot except with \w, which
would also trigger after the hyphen.

After hours of work, I ended up with something that was not only long
and confusing, involving .collect and an inner search loop and other
stuff, but when I brought it back up to check it for this email
message, I discovered that it didn’t even actually work correctly.

And all along, all I needed to do was change
str.split(/[.]|@).reverse.join
into
str.split(/([.]|@)/).reverse.join

Dang. And thanks!

elgato · August 3, 2006, 2:23pm

On Aug 2, 2006, at 3:05 PM, Pavel S. wrote:

But where can one find this feature of the split in the
documentation? http://www.rubycentral.com/ref/
ref_c_string.html#split does not mention split returns not only
delimited substrings, but also successful groups from the match of
the regexp.

In Dave T.’ Pickaxe book. Under String#split he writes:

“If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern
matches a zero-length string, str is split into individual
characters. If pattern includes
groups, these groups will be included in the returned values.”

Then he gives the following example:

“a@1bb@2ccc”.split(/@(\d)/) => [“a”, “1”, “bb”, “2”, “ccc”]

Regards, Morton