UTF-8 aware chop for 1.8?

okkezSS · November 3, 2010, 3:09pm

Hello,

Is there an easy way to chop (as in String#chop) a string that can
potentially contain UTF-8 in ruby 1.8? Or should I roll my own?

Thanks,
Ammar

ammar · November 3, 2010, 4:39pm

Ended up making my own. Posting it here for the benefit of others, and
maybe some feedback.

UTF-8 aware string chop · GitHub

Regards,
Ammar

ammar · November 3, 2010, 4:58pm

On Nov 3, 2010, at 9:08 AM, Ammar A. wrote:

Is there an easy way to chop (as in String#chop) a string that can
potentially contain UTF-8 in ruby 1.8? Or should I roll my own?

Well, it should be this simple:

str.gsub(/.\z/mu, “”)

James Edward G. II

ammar · November 3, 2010, 5:33pm

On Wed, Nov 3, 2010 at 5:57 PM, James Edward G. II
[email protected] wrote:

Well, it should be this simple:

str.gsub(/.\z/mu, “”)

On Wed, Nov 3, 2010 at 6:04 PM, Adam P. [email protected]
wrote:

s.gsub(/^(.+)./u) { $1 }
=> “one two thre”

Beautiful. Thank you both.

It was a god exercise for me, so I don’t necessarily feel that I
wasted 30 minutes of my life

By the way, the m options seems superfluous in James’ version. I get
the same results without it.

Thanks again,
Ammar

ammar · November 3, 2010, 5:40pm

On Nov 3, 2010, at 11:33 AM, Ammar A. wrote:

Beautiful. Thank you both.

It was a god exercise for me, so I don’t necessarily feel that I
wasted 30 minutes of my life

By the way, the m options seems superfluous in James’ version. I get
the same results without it.

It’s not:

“\n”.sub(/.\z/u, “”)
=> “\n”

“\n”.sub(/.\z/mu, “”)
=> “”

Using gsub() over sub() was a dumb mistake on my part though. sub() is
all you need, since it can only match once.

James Edward G. II

ammar · November 3, 2010, 5:05pm

I was going to say

$KCODE=“U”
=> “U”

s = “one two three”
=> “one two three”

s.gsub(/^(.+)./u) { $1 }
=> “one two thre”

I guess I overthought it, huh!

ammar · November 3, 2010, 5:57pm

On Wed, Nov 3, 2010 at 6:38 PM, James Edward G. II
[email protected] wrote:

Using gsub() over sub() was a dumb mistake on my part though. sub() is all you
need, since it can only match once.

Thanks for the clarification.

My method now looks like:

def chop_utf8(s)
return unless s

lead = s.sub(/.\z/mu, “”)
last = s.scan(/.\z/mu).first
last = ‘’ unless last

[lead, last]
end

Short and sweet.

Cheers,
Ammar

ammar · November 3, 2010, 6:00pm

On Nov 3, 2010, at 11:56 AM, Ammar A. wrote:

My method now looks like:

def chop_utf8(s)
return unless s

lead = s.sub(/.\z/mu, “”)
last = s.scan(/.\z/mu).first
last = ‘’ unless last

The two lines above can be replaced with the more efficient:

last = s[/.\z/mu] || ‘’

[lead, last]
end

James Edward G. II

ammar · November 3, 2010, 6:26pm

On Wed, Nov 3, 2010 at 7:00 PM, James Edward G. II
[email protected] wrote:

The two lines above can be replaced with the more efficient:

last = s[/.\z/mu] || ‘’

At this rate the method is going to disappear.

I updated the gist accordingly:

UTF-8 aware string chop. (the firs gist was posted as anonymous) · GitHub

Thanks again,
Ammar

ammar · November 4, 2010, 3:19am

On Thu, Nov 4, 2010 at 1:25 AM, Ammar A. [email protected] wrote:

On Wed, Nov 3, 2010 at 7:00 PM, James Edward G. II

last = s[/.\z/mu] || ‘’
I updated the gist accordingly:
UTF-8 aware string chop. (the firs gist was posted as anonymous) · GitHub

can we make that a one pass?

str =~ /.\z/mu
[$`,$&]

best regards -botp

ammar · November 4, 2010, 3:35pm

Ammar A. wrote in post #959047:

By the way, the m options seems superfluous in James’ version. I get
the same results without it.

foo = “abc\n”
=> “abc\n”

foo.sub(/.\z/mu, ‘’)
=> “abc”

foo.sub(/.\z/u, ‘’)
=> “abc\n”

ammar · November 4, 2010, 3:53pm

On Thu, Nov 4, 2010 at 4:37 PM, Brian C. [email protected]
wrote:

Ammar A. wrote in post #959047:

By the way, the m options seems superfluous in James’ version. I get
the same results without it.

foo = “abc\n”
=> “abc\n”
foo.sub(/.\z/mu, ‘’)
=> “abc”
foo.sub(/.\z/u, ‘’)
=> “abc\n”

James clarified this earlier. But thanks for chiming in nonetheless.

Cheers,
Ammar