Ruby 1.9 string slicing and StringScanner pointers

kch · December 14, 2009, 2:50pm

Hi all,

Earlier today I posted this question to Stack Overflow:

ruby 1.9: how do I get a byte-index-based slice of a String? - Stack Overflow

Basically, it boils down to this:

I’m working with UTF-8 strings in Ruby 1.9. I need to get a slice using byte-based indexes, not char-based, because StringScanner’s internal pointers are byte-based.

While I welcome answers to that question here, I’m posting to ask
something else:

Should this be considered a bug in StringScanner? Wouldn’t it make more
sense for it to use character indexes?

kch · December 14, 2009, 3:04pm

Caio C. wrote:

Should this be considered a bug in StringScanner? Wouldn’t it make more
sense for it to use character indexes?

I suspect the reason it does it this way is because it’s very expensive
in ruby 1.9 to jump to the Nth character. So if you were scanning a
large string, it would get slower and slower as you scanned further
along, calling #scan each time.

I think what you’re doing is the only option: tag the string as a
single-byte encoding (“ASCII-8BIT” would be better than “US-ASCII”),
select the range of bytes, and tag it back again, relying on the fact
that strscan has chomped a whole number of characters.

kch · December 14, 2009, 4:49pm

Robert K. wrote:

I don’t know StringScanner internals, but does this have to be so? I
mean, with $’ you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.

Even if it did, I think the point remains that StringScanner#pos
wouldn’t be of much use if it gave a character offset, since str[n…m]
is an expensive operation in ruby 1.9.

If all you want is the rest of the string, then StringScanner#post_match
gives you that already, doesn’t it? But I think the OP wanted to get the
buffer between two arbitrary match positions.

kch · December 14, 2009, 4:26pm

2009/12/14 Brian C. [email protected]:

Caio C. wrote:

Should this be considered a bug in StringScanner? Wouldn’t it make more
sense for it to use character indexes?

It would seem so.

I suspect the reason it does it this way is because it’s very expensive
in ruby 1.9 to jump to the Nth character. So if you were scanning a
large string, it would get slower and slower as you scanned further
along, calling #scan each time.

I don’t know StringScanner internals, but does this have to be so? I
mean, with $’ you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.

I think what you’re doing is the only option: tag the string as a
single-byte encoding (“ASCII-8BIT” would be better than “US-ASCII”),
select the range of bytes, and tag it back again, relying on the fact
that strscan has chomped a whole number of characters.

Or use String#scan or another matching option, if that is possible.

Kind regards

robert

kch · December 14, 2009, 5:56pm

2009/12/14 Brian C. [email protected]:

If all you want is the rest of the string, then StringScanner#post_match
gives you that already, doesn’t it? But I think the OP wanted to get the
buffer between two arbitrary match positions.

You’re right. Frankly, I did not read the stackoverflow question
initially but from that it is obvious.

Still this leaves an awkward feeling: you have a String which can
informally be defined as a sequence of characters. Now, for most
application cases accessing the nth character seems to be a more
natural operation than accessing the nth byte. I know the internal
reasons for the fact that accessing the nth character is expensive
(variable length encodings) but from an interface perspective this is
not good IMHO.

Java did solve this with a specialized character type so you can have
arrays of char, but from what I recall about Matz’s comments the Java
model is flawed because it does not work well with non western
languages, namely Asian languages.

Btw, although UTF-16 is a fixed length encoding, char based accesses
are really slow:

#!ruby19

require ‘benchmark’

REP = 100

s1 = “abcdeABCDE” * 1_000_000

encodings =[“ASCII”, “UTF-8”, “UTF-16BE”, “UTF-16LE”]
strings = {}

encodings.each do |enc|
strings[enc] = s1.encode(enc)
end

idx = s1.length - 10

Benchmark.bmbm 30 do |b|
encodings.each do |enc|
str = strings[enc]
rep = /16/ =~ enc ? REP : REP * 1000

b.report "enc %-10s rep %11d" % [enc, rep] do
  rep.times do
    s = str[idx..-1]
  end
end

end
end

Cheers

robert

kch · December 14, 2009, 6:07pm

On Dec 14, 2009, at 10:55 AM, Robert K. wrote:

Btw, although UTF-16 is a fixed length encoding,

UTF-16 is not a fixed length encoding. A UTF-16 character may be
encoded in two or four bytes.

I believe UTF-32 is fixed length, but even then it would not be cheap to
index due to Unicode’s use of “combining characters.” With them
multiple codepoints may represent a single index.

James Edward G. II

kch · December 14, 2009, 10:11pm

On 14.12.2009 18:07, James Edward G. II wrote:

On Dec 14, 2009, at 10:55 AM, Robert K. wrote:

Btw, although UTF-16 is a fixed length encoding,

UTF-16 is not a fixed length encoding. A UTF-16 character may be encoded in two or four bytes.

I believe UTF-32 is fixed length, but even then it would not be cheap to index due to Unicode’s use of “combining characters.” With them multiple codepoints may represent a single index.

Thanks for the education, James! I would have sword UTF-16 is fixed
length…

For reference of other readers:

Combining characters - oh what a mess. I189 is really a minefield.

Kind regards

robert