I’m working with UTF-8 strings in Ruby 1.9. I need to get a slice using byte-based indexes, not char-based, because StringScanner’s internal pointers are byte-based.
While I welcome answers to that question here, I’m posting to ask
something else:
Should this be considered a bug in StringScanner? Wouldn’t it make more
sense for it to use character indexes?
Should this be considered a bug in StringScanner? Wouldn’t it make more
sense for it to use character indexes?
I suspect the reason it does it this way is because it’s very expensive
in ruby 1.9 to jump to the Nth character. So if you were scanning a
large string, it would get slower and slower as you scanned further
along, calling #scan each time.
I think what you’re doing is the only option: tag the string as a
single-byte encoding (“ASCII-8BIT” would be better than “US-ASCII”),
select the range of bytes, and tag it back again, relying on the fact
that strscan has chomped a whole number of characters.
I don’t know StringScanner internals, but does this have to be so? I
mean, with $’ you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.
Even if it did, I think the point remains that StringScanner#pos
wouldn’t be of much use if it gave a character offset, since str[n…m]
is an expensive operation in ruby 1.9.
If all you want is the rest of the string, then StringScanner#post_match
gives you that already, doesn’t it? But I think the OP wanted to get the
buffer between two arbitrary match positions.
Should this be considered a bug in StringScanner? Wouldn’t it make more
sense for it to use character indexes?
It would seem so.
I suspect the reason it does it this way is because it’s very expensive
in ruby 1.9 to jump to the Nth character. So if you were scanning a
large string, it would get slower and slower as you scanned further
along, calling #scan each time.
I don’t know StringScanner internals, but does this have to be so? I
mean, with $’ you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.
I think what you’re doing is the only option: tag the string as a
single-byte encoding (“ASCII-8BIT” would be better than “US-ASCII”),
select the range of bytes, and tag it back again, relying on the fact
that strscan has chomped a whole number of characters.
Or use String#scan or another matching option, if that is possible.
If all you want is the rest of the string, then StringScanner#post_match
gives you that already, doesn’t it? But I think the OP wanted to get the
buffer between two arbitrary match positions.
You’re right. Frankly, I did not read the stackoverflow question
initially but from that it is obvious.
Still this leaves an awkward feeling: you have a String which can
informally be defined as a sequence of characters. Now, for most
application cases accessing the nth character seems to be a more
natural operation than accessing the nth byte. I know the internal
reasons for the fact that accessing the nth character is expensive
(variable length encodings) but from an interface perspective this is
not good IMHO.
Java did solve this with a specialized character type so you can have
arrays of char, but from what I recall about Matz’s comments the Java
model is flawed because it does not work well with non western
languages, namely Asian languages.
Btw, although UTF-16 is a fixed length encoding, char based accesses
are really slow:
UTF-16 is not a fixed length encoding. A UTF-16 character may be
encoded in two or four bytes.
I believe UTF-32 is fixed length, but even then it would not be cheap to
index due to Unicode’s use of “combining characters.” With them
multiple codepoints may represent a single index.
UTF-16 is not a fixed length encoding. A UTF-16 character may be encoded in two or four bytes.
I believe UTF-32 is fixed length, but even then it would not be cheap to index due to Unicode’s use of “combining characters.” With them multiple codepoints may represent a single index.
Thanks for the education, James! I would have sword UTF-16 is fixed
length…
For reference of other readers:
Combining characters - oh what a mess. I189 is really a minefield.
Kind regards
robert
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.