Byteâ€“stream parsing in Ruby

elliottcable · July 21, 2009, 8:38am

So, Iâ€™ve a problem. Iâ€™m using ncurses (or possibly not, might just
STDIN.read(1) or something, weâ€™ll see) to grab byteâ€“level input from
the terminal. Purpose being to catch and handle control characters in a
text mode application, such as â€œmetaâ€“3â€ or â€œcontrolâ€“c.â€

Currently, I have a really ugly method that manually parses UTF-8 and
ASCII directly in my Ruby source; however, this is extremely slow, and
seems quite a bit like overkill. After all, with 1.9â€™s wonderfully
robust Encoding support, it seems silly to duplicate all that
byteâ€“parsing work that must be going on somewhere in Ruby already.

Hereâ€™s my current method (forgive the horrendous code, please! I fully
intended to get rid of it right from the start, soâ€¦):

github.com

elliottcable/nfoiled/blob/0e231acf8720790a5e449050e141912053fe5ae6/lib/nfoiled/window.rb#L80-175


      
          
          
##
          # Destroys the `wrapee` of this `Window`, and removes this `Window`
          # from its owning `Terminal`'s `#windows`. See `delwin(3X)`.
          def destroy!
            ::Ncurses.delwin(wrapee)
            @wrapee = nil
            owner.windows.delete self
          end
          
          
# =========
          # = Input =
          # =========
          
          
##
          # Gets a single byte from the input buffer for this window. Returns nil if
          # there are no new characters in the buffer. See `wgetch(3X)`.
          def getb
            byte = ::Ncurses.wgetch(wrapee)
            byte == -1 ? nil : byte
          end

The goal is to devise some method by which I can:

Determine whether or not an Array of soâ€“farâ€“received bytes is, yet,
a valid String of a given Encoding (I can get the intended input
Encoding by way of a simple Encoding.find(:locale), so weâ€™re always
inâ€“theâ€“know as to which Encoding the incoming bytes are intended to
be)
Once we know theArray instance containing the relevant bytes
pertains to a valid String, convert that into a String and further
store/cache/process it in some way.

Yes, this means that the String will almost always be one character
long; I am uninterested in parsing lengths of characters out of the
input stream, I can deal with that later. At the moment, I very simply
want to ensure that I can retrieve, in real time, the latest character
entered at the terminal, as a String, in any Encoding.

Any help would be much appreciated; Iâ€™ve been banging my head against
this onâ€“andâ€“off for weeks! (-:

elliottcable · July 21, 2009, 3:43pm

Elliott Cable wrote:

The goal is to devise some method by which I can:

Determine whether or not an Array of soâ€“farâ€“received bytes is, yet,
a valid String of a given Encoding

“Ã¼ber”.bytes.to_a
=> [195, 188, 98, 101, 114]

a = “\xc3”.force_encoding(“UTF-8”)
=> “\xC3”

a.valid_encoding?
=> false

a << “\xbc”
=> “Ã¼”

a.valid_encoding?
=> true

elliottcable · July 21, 2009, 11:50pm

Brian C. wrote:

Elliott Cable wrote:

The goal is to devise some method by which I can:

Determine whether or not an Array of soâ€“farâ€“received bytes is, yet,
a valid String of a given Encoding

“Ã¼ber”.bytes.to_a
=> [195, 188, 98, 101, 114]

a = “\xc3”.force_encoding(“UTF-8”)
=> “\xC3”

a.valid_encoding?
=> false

a << “\xbc”
=> “Ã¼”

a.valid_encoding?
=> true

Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(Fixnum) bytes onto the string?

elliottcable · July 22, 2009, 2:57am

7stud – wrote:

Elliott Cable wrote:

Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(Fixnum) bytes onto the string?

hex_str = “\x%x” % 195
puts hex_str

–output:–
\xc3

That is not exactly ideal. Is there a cleaner way.

elliottcable · July 22, 2009, 2:35am

Elliott Cable wrote:

Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(Fixnum) bytes onto the string?

hex_str = “\x%x” % 195
puts hex_str

–output:–
\xc3

elliottcable · July 22, 2009, 9:46am

Elliott Cable wrote:

Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(Fixnum) bytes onto the string?

If you’re doing STDIN.read(1), you get a String. Just use << to
concatenate, or the 2-argument form of read() where you supply a buffer
to append to.

If you are forced to use Fixnum, then try Integer#chr:

str = “”
=> “”

str << 255.chr
=> “\xFF”

Warning: you have to deal with all the (undocumented) ruby-1.9 encoding
stupidity. Have fun guessing the behaviour of each of the methods. e.g.

str = “hello”
=> “hello”

str.encoding
=> #Encoding:UTF-8

str << 255.chr
=> “hello\xFF”

str.encoding
=> #Encoding:ASCII-8BIT

Surprised that the encoding changed? This means that:

str.valid_encoding?
=> true

until you do:

str.force_encoding(“UTF-8”)
=> “hello\xFF”

str.valid_encoding?
=> false

Now have a guess what happens if you try to append another byte. Go on.

str.encoding
=> #Encoding:UTF-8

str << 250.chr
Encoding::CompatibilityError: incompatible character encodings: UTF-8
and ASCII-8BIT
from (irb):24
from /usr/local/bin/irb19:12:in `’

Haha, fooled you. You thought it was safe to append a non-UTF8 character
to a UTF8 string (after all, you did before quite happily), but this
time you get an exception. So now you have to do:

str.force_encoding(“ASCII-8BIT”)
=> “hello\xFF”

str << 250.chr
=> “hello\xFF\xFA”

str.force_encoding(“UTF-8”)
=> “hello\xFF\xFA”

str.valid_encoding?
=> false

This is why I hate ruby 1.9.

Regards,

Brian.

P.S. The above example was with ruby 1.9.2 r23158 under Linux with UTF8
locale. Behaviour may or may not be different with other 1.9.x versions
and/or under different locale settings.

elliottcable · July 22, 2009, 9:59am

Incidentally, I needed to do something similar in ruby-1.8 recently, and
it was very straightforward.

def is_utf8?(str)
Iconv.iconv(‘UTF-8’,‘UTF-8’,str)
true
rescue Iconv::IllegalSequence
false
end

elliottcable · July 23, 2009, 12:46am

On Jul 22, 2009, at 00:46, Brian C. wrote:

=> “hello\xFF\xFA”

str.valid_encoding?
=> false

This is why I hate ruby 1.9.

I don’t think that’s a valid UTF-8 byte sequence…

Incidentally, I needed to do something similar in ruby-1.8 recently,
and
it was very straightforward.

def is_utf8?(str)
Iconv.iconv(‘UTF-8’,‘UTF-8’,str)
true
rescue Iconv::IllegalSequence
false
end

Oh, I see there’s another tool let’s try it!

$ cat conv.rb
str = “\xFF\xFA”

require ‘iconv’

converted = Iconv.iconv ‘UTF-8’, ‘UTF-8’, str

puts converted
$ ruby -v conv.rb
ruby 1.8.6 (2008-08-11 patchlevel 287) [universal-darwin9.0]
conv.rb:6:in `iconv’: “\377\372” (Iconv::IllegalSequence)
from conv.rb:6

Ok, so it’s not valid. Let’s get a valid byte sequence…

$ cat conv.rb
str = “\xE2\x98\x83”

require ‘iconv’

converted = Iconv.iconv ‘UTF-8’, ‘UTF-8’, str

puts converted
$ ruby conv.rb
â˜ƒ

Ok, so that works!

Now let’s use 1.9’s built-in encoding stuff with our valid byte
sequence:

$ cat conv.rb

encoding: utf-8

str = "hello "
p :encoding => str.encoding
str << 0xE2.chr
str << 0x98.chr
str << 0x83.chr

puts str
$ ruby19 conv.rb
{:encoding=>#Encoding:UTF-8}
hello â˜ƒ

huh, it worked fine.

So you’re mad that Ruby doesn’t let you shoot yourself in the foot?

elliottcable · July 23, 2009, 1:52pm

On Jul 23, 2009, at 2:47 AM, Brian C. wrote:

That’s the whole point. The OP wanted to append bytes to a string, and
detect whether the resulting string was a valid set of complete UTF-8
codepoints, or whether it was necessary to wait for more byte(s) for
it
to become complete.

Ruby 1.9’s valid_encoding? method seems to do that for you - except
that
all the automagical and undocumented mutation of Strings gets in the
way.

I’m pretty sure I document all the behavior we’ve seen in this thread
(and much more), in this single article on my blog:

http://blog.grayproductions.net/articles/ruby_19s_string

I’m really not sure why you seem totally unwilling to count my
articles as a valid source of information after all this time. They
continually explain what you say is unexplained. I’ve asked you in
the past to list what they don’t cover, but aside from the C API side
of things (which I admit I don’t cover) you’re just all out of
excuses. I assume you simply have no desire to read them. Fair
enough, but hopefully others do. I feel that means we should list
them as an available resource.

I’m not sure what “automagical” means in this context either, but I
don’t feel it’s a good description. I assume “auto” is for
“automatic.” Is Ruby automatically changing the Encoding? I don’t
think so. The programmer is asking Ruby to add two Strings with
different Encodings. Ruby could just say no, but in this case there
is a way it can be done, so it makes the choice, assuming that’s what
you wanted.

I guess “magical” may just mean you don’t understand what’s happening
here. I do though, so there’s certainly a process we can break down
and understand.

to the end. This shows that the string’s encoding has magically
mutated
without a by-your-leave.

That’s not true. You asked Ruby to combine those Strings of differing
content. You gave your permission.

Argh, you need to mutate it back to ASCII-8BIT first.

As always, you are just not explaining what these examples show. The
str variable contains some UTF-8 content. There is another String
involved here though and we should examine its Encoding:

0xFF.chr.encoding
=> #Encoding:ASCII-8BIT

So what you are really asking Ruby to do is to combine data in two
different Encodings. There is a way to do that here, thanks to Ruby’s
concept of compatible Encodings. Given that, the conversion is made.
If you had wanted to keep that data in UTF-8, you should have added
more UTF-8 bytes to it:

(“abc”.force_encoding(“UTF-8”) <<
0xFF.chr.force_encoding(“UTF-8”)).encoding
=> #Encoding:UTF-8

There’s no magic here. It’s a process. We can explain it. I have.

James Edward G. II

elliottcable · July 23, 2009, 7:02pm

I’ve briefly read sections 8 to 11 again.

Where does it say that String#<< can now raise an exception, and under
what circumstances?. Ah, I finally found it, right at the end of the
comments at the bottom of section 8, added a month after initial
publication. (+)

Where does it say that the encoding of a String can change when you
concatenate another string onto it?

By “undocumented” I mean: I expect to type “ri String#<<” and see an
accurate description of what String#<< does, including which
combinations of inputs are valid and which are not, and which attributes
of the String may mutate based on the input supplied.

Regards,

Brian.

(+) There is a warning in the string comparisons section saying that,
basically, the rules are too complicated to understand, so you should
always ensure that two strings are in the same encoding before comparing
them. Arguably you could say the same applies to any other operation
which takes two strings.

But this to me shows the whole exercise is futile. If, in order to write
a valid program, you need to ensure that all strings are in the same
encoding, then there should be a global flag which sets the encoding. If
I cannot predict what will happen when string A (encoding X) encounters
string B (encoding Y), and I have to keep forcing the encodings to X,
then there’s no benefit in having the capability for strings to carry
about their own encodings.

And in many apps, the encoding information is carried “out of band”
anyway: for example: in HTTP or MIME, the encoding info is in a
Content-Type: header.

elliottcable · July 23, 2009, 9:47am

Eric H. wrote:

On Jul 22, 2009, at 00:46, Brian C. wrote:

=> “hello\xFF\xFA”

str.valid_encoding?
=> false

This is why I hate ruby 1.9.

I don’t think that’s a valid UTF-8 byte sequence…

That’s the whole point. The OP wanted to append bytes to a string, and
detect whether the resulting string was a valid set of complete UTF-8
codepoints, or whether it was necessary to wait for more byte(s) for it
to become complete.

Ruby 1.9’s valid_encoding? method seems to do that for you - except that
all the automagical and undocumented mutation of Strings gets in the
way. Sometimes, ruby lets you concatenate an arbitrary byte to a UTF-8
string without an exception; sometimes it does not. It appears this is
something to do with the concept of “compatible encodings”.

Now let’s use 1.9’s built-in encoding stuff with our valid byte
sequence:

$ cat conv.rb

encoding: utf-8

str = "hello "
p :encoding => str.encoding
str << 0xE2.chr
str << 0x98.chr
str << 0x83.chr

puts str
$ ruby19 conv.rb
{:encoding=>#Encoding:UTF-8}
hello â˜ƒ

huh, it worked fine.

Yes, but you forgot to add another

p :encoding => str.encoding

to the end. This shows that the string’s encoding has magically mutated
without a by-your-leave.

So now to test whether the encoding is valid or not, you have to mutate
the string back again:

str.force_encoding(“UTF-8”)
puts “is valid” if str.valid_encoding?

OK, then what happens if you concatenate another byte?

str << 0xFF.chr # boom

Argh, you need to mutate it back to ASCII-8BIT first.

So you’re mad that Ruby doesn’t let you shoot yourself in the foot?

I’m mad that Ruby has behaviour which is (a) undocumented, and (b) IMO
just plain stupid, and you have to expend ridiculous effort both to
understand it and to work around it.

I’m actually attempting to document it in my spare time, in the form of
a Test::Unit script. It looks like I’m going to have over 200
assertions. This is time I should probably have spent migrating code to
Erlang - which incidentally has a very sensible proposal for Unicode
handling.

Thank goodness for those people maintaining 1.8.6 and related forks like
Ruby Enterprise Edition.

Regards,

Brian.

elliottcable · July 24, 2009, 12:45am

On Jul 23, 2009, at 10:02, Brian C. wrote:

If I cannot predict what will happen when string A (encoding X)
encounters
string B (encoding Y), and I have to keep forcing the encodings to X,
then there’s no benefit in having the capability for strings to carry
about their own encodings.

I think you have a misconception about what #force_encoding does. It
does not do any conversion. Use Encoding::Converter for that.

While #force_encoding does approximately what you want in the examples
you’ve shown (ASCII, binary data and UTF-8 encodings) it won’t work
when you’re reading one multibyte encoding (say, Shift-JIS from an IO)
and adding it to another multibyte encoding (say, a UTF-8 String).
You’ll only end up with garbage if you don’t use a converter.

For 1.9, I don’t think io.read(1) is correct. #getc is better since
it’ll read what you want:

$ cat file
Ï€
$ irb19
irb(main):001:0> open ‘file’ do |io| p io.getc end
“Ï€”
=> “Ï€”
irb(main):002:0> open ‘file’ do |io| io.set_encoding ‘binary’; p
io.getc end
“\xCF”
=> “\xCF”

Even for control characters:

$ ruby19 -e ‘p $stdin.getc’
^I
“\t”
$

elliottcable · July 31, 2009, 9:54pm

Thanks to everybody involved here, I now have a great solution that
works really well. I also ended up using EventMachine to get the
individual bytes from the keyboard, itâ€™s a lot more efficient. Hereâ€™s my
final solution, incase anybodyâ€™s interested:

require 'eventmachine'

module Handler
  def initialize
    @buffer = ""
  end

  def receive_data byte
    byte.force_encoding Encoding.find('locale')
    @buffer << byte
    check_buffer
  end

  private
    def check_buffer
      if @buffer.valid_encoding?
        p @buffer
        @buffer = ""
      end
    end
end

EM.run{ EM.open_keyboard Handler }

elliottcable · July 24, 2009, 1:17am

On Jul 23, 2009, at 12:02 PM, Brian C. wrote:

Where does it say that String#<< can now raise an exception, and under
what circumstances?

Quoting from the page I linked to in my last message:
It’s probably worth mentioning that it is possible for a transcoding
operation to fail with an error. For example:

$ cat transcode.rb

encoding: UTF-8

utf8 = “Résumé…”
latin1 = utf8.encode(“ISO-8859-1”)
$ ruby transcode.rb
transcode.rb:3:in encode': "\xE2\x80\xA6" from UTF-8 to ISO-8859-1 (Encoding::UndefinedConversionError) from transcode.rb:3:in’
Naturally this fails because “…” is not a valid character in Latin-1.

Ah, I finally found it, right at the end of the
comments at the bottom of section 8, added a month after initial
publication. (+)

What does how long it took me to write the content have to do with
anything? I added that comment to cover some items you had mentioned
I had overlooked. Now it’s invalid because it took me a while???

Where does it say that the encoding of a String can change when you
concatenate another string onto it?

Quoting from the same page:

One thing that my help a little in normalizing your data is Ruby’s
concept of compatibleEncodings. Here’s an example of checking and
taking advantage of compatible Encodings:

data in two different Encodings

p ascii_my # >> "My "
puts ascii_my.encoding.name # >> US-ASCII
p utf8_resume # >>
"Résumé"puts utf8_resume.encoding.name # >> UTF-8

check compatibility

p Encoding.compatible?(ascii_my, utf8_resume) # >> #Encoding:UTF-8

combine compatible data

my_resume = ascii_my + utf8_resume
p my_resume # >> "My
Résumé"puts my_resume.encoding.name # >> UTF-8
In this example I had data in two different Encodings, US-ASCII and
UTF-8. I asked Ruby if the two pieces of data were compatible?(). Ruby
can respond to that question in one of two ways. If it returns false,
the data is not compatible and you will probably need to transcode at
least one piece of it to work with the other. If an Encoding is
returned, the data is compatible and can be concatenated resulting in
data with the returned Encoding. You can see how that played out when
I combined these Strings.

(+) There is a warning in the string comparisons section saying
that,
basically, the rules are too complicated to understand, so you should
always ensure that two strings are in the same encoding before
comparing
them. Arguably you could say the same applies to any other operation
which takes two strings.

But this to me shows the whole exercise is futile.

But you should be doing the exact same thing in Ruby 1.8, which I
understand you believe to be a superior system. If you are going to
have two pieces of data interact, it just makes sense that they will
pretty much always need to be the same kinds of data.

If, in order to write a valid program, you need to ensure that all
strings are in the same encoding, then there should be a global flag
which sets the encoding.

Like -E and -U in Ruby 1.9?

And in many apps, the encoding information is carried “out of band”
anyway: for example: in HTTP or MIME, the encoding info is in a
Content-Type: header.

Yeah, that’s why a global switch won’t really save you from doing your
job. You need to read that header, and treat the content accordingly.

James Edward G. II