Daniel M. wrote:
I’ll point you at my solution to ruby quiz #83: (short but unique)
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197973
How would you write the method string_similarity without access to
each character? (This method computes the length of the longest
common substring)How would you compute the Levenshtein distance (edit distance) between
two strings without access to each character?
I’ll grant that I don’t have enough imagination and that there are
cases where
you want character access. But it seems to me that the main use case is
for
something like this:
str = “cogito ergo sum”
i = str.index(“”) + 3
j = str.index(“”,i)
str[i…j]
=> “ergo”
and for that common case, regexes are far more appropriate:
str.match(/(.*?)</b>/)[1]
=> “ergo”
Advocating regexes-only for character manipulation is certainly extreme.
I’m
just saying that byte access and character access needs to have
different
semantics. If you look at the current ruby String API, bytes are
accessed
through integer positions and characters are accessed through regexes.
The byte
are char APIs are quite distinct, it’s just that everybody is using the
byte API
and expecting to get characters as a result.
From what I understand (and please correct me if I’m wrong), ruby2 will
fix
that by changing the api so that integer positions represent characters
instead
of bytes. For binary strings, those two concepts map to the same reality
so it
won’t be such a backward-incompatible change. I just wonder what will be
the
behavior of str[0]. Will it return a 0…255 integer in the case of
binary string
and a 1-character string in the case of encoding-set string? Now that
would be
an API nightmare.
How would you pull strings out of a file with fixed-width fields?
With regular expressions? Really? What if you had a hundred fields?
Hmm, fixed width records and fields were created for the purpose of fast
access
to data, i.e. seek to position recnum*reclength and extract reclength
bytes;
they only make sense in the case of single-byte characters. So this is
more a
case of byte access.
Daniel
