How to read utf-16 with Open3.popen3

jason_sam · June 18, 2014, 5:08pm

Hi group,

I have a windows executable that writes to stdout using utf-16 encoding.
When I read it using Open3.popen3 it comes in with spaces between each
letter. I can’t seem to figure out how to get that converted to utf-8.
The docs to IO.pipe specify that it can convert between external and
internal encodings. I’ve tried changing the source code in popen3 to eg
IO.pipe(‘utf-16:utf-8’) and various other things but have had no luck.
Any help greatly appreciated.

Cheers,
James

kofboss · June 20, 2014, 2:00am

my_str = “hello world”
puts my_str.encoding #=>US-ASCII
puts my_str.bytesize #=>11

x = my_str.encode(‘UTF-16’)
puts x.encoding #=>UTF-16
puts x.bytesize #=>24
p x #=>"\uFEFFhello world"

result = x.encode(“UTF-8”, x.encoding)
puts result.encoding #=>UTF-8
p result #=>“hello world”

So you can do this:

Open3.popen3(‘some_program’) do |stdin, stdout, stderr, wait_thr|
data = stdout.read()
result = data.encode(“UTF-8”, “UTF-16”)
#Or more generally: result = data.encode(“UTF-8”, data.encoding)
end

For what it’s worth, if I use the following program as the target of
popen3():

my_str = “hello world”
#puts my_str.encoding
#puts my_str.bytesize

x = my_str.encode(‘UTF-16’)
#puts x.encoding
#puts x.bytesize
#p x

puts x

And then I read from stdout:

#!/usr/bin/env ruby

require ‘open3’

Open3.popen3(‘ruby do_stuff.rb’) do |stdin, stdout, stderr, wait_thr|
data = stdout.read()
p data.encoding
end

The encoding is “UTF-8”. I’m not sure how that works.

kofboss · June 20, 2014, 4:34am

Hi James F.,

The main source of problems I’ve had with encondings are related to
the string being a byte representation of some enconding but being
“marked” as another encoding.

If my default encoding is utf-8, and if I open an utf-16 file it will
not “auto-detect” it as utf-16.
The strings coming from that file will be marked as utf-8 but their
byte representation will be utf-16.

If I try this:

str.encode(Encoding::UTF_8)

it will do NOTHING just because it thinks the file is already
Encoding::UTF_8.
So I have to do something like

str.encode(Encoding::UTF_8, Encoding::UTF_16)

or

str.force_encoding(Encoding::UTF_16).encode(Encoding::UTF_8)

to inform the “source” encoding.

Another source of error (at least for me) is that some UTF_16 streams
just lacks that “BOM” comand (Byte Order Mark), so it cannot guess if
it is little endian or big endian. So you could “hint” it with
Encoding::UTF_16BE or Encoding::UTF_16LE.

I created a gist with my IRB session (for clarifying)

gist.github.com

https://gist.github.com/abinoam/224cbffd5cae7f591b10

encoding.rb

file = File.open "acao_e_reacao_utf16_le.txt"
 => #<File:acao_e_reacao_utf16_le.txt> 

file.methods.grep /enc/
 => [:external_encoding, :internal_encoding, :set_encoding] 

file.external_encoding
 => #<Encoding:UTF-8> 

file.internal_encoding

This file has been truncated. show original

I hope it helps, and sorry if it is not related to your problem.

Best regards,
Abinoam Jr.

On Wed, Jun 18, 2014 at 12:08 PM, James F.

kofboss · June 20, 2014, 5:55am

On 20 June 2014 12:33, Abinoam Jr. [email protected] wrote:

Another source of error (at least for me) is that some UTF_16 streams
just lacks that “BOM” comand (Byte Order Mark), so it cannot guess if
it is little endian or big endian. So you could “hint” it with
Encoding::UTF_16BE or Encoding::UTF_16LE.

This is the most common problem, AFAIK. If your data doesn’t have a
BOM
(the first bytes = 0xFFFE or 0xFEFF) then the system can’t know which
UTF-16 encoding you’re using.

Here’s a much tinier gist than Abinoam’s, which does some inter-process
communications: https://gist.github.com/phluid61/40b9d5b32c9edabaae07

Cheers

kofboss · June 20, 2014, 4:15pm

Thank you Abinoam and Matthew. Incredibly useful. It doesn’t appear that
you can pass the encoding in up front with Open3.popen3 but I was able
to change my io processing code to

  chunk = io.readpartial(4096)
  chunk.encode!(Encoding::UTF_8, external_encoding) if

external_encoding
buffer << chunk

Where external_encoding was UTF16-LE and it fixed my problem.

Very grateful for that great help!

From: ruby-talk [mailto:[email protected]] On Behalf Of
Matthew K.
Sent: 20 June 2014 04:55
To: Ruby users
Subject: Re: how to read utf-16 with Open3.popen3

On 20 June 2014 12:33, Abinoam Jr.
<[email protected]mailto:[email protected]> wrote:

Another source of error (at least for me) is that some UTF_16 streams
just lacks that “BOM” comand (Byte Order Mark), so it cannot guess if
it is little endian or big endian. So you could “hint” it with
Encoding::UTF_16BE or Encoding::UTF_16LE.

This is the most common problem, AFAIK. If your data doesn’t have a
BOM (the first bytes = 0xFFFE or 0xFEFF) then the system can’t know
which UTF-16 encoding you’re using.

Here’s a much tinier gist than Abinoam’s, which does some inter-process
communications: https://gist.github.com/phluid61/40b9d5b32c9edabaae07

Cheers

kofboss · June 21, 2014, 3:06am

On 21 June 2014 00:14, James F. [email protected]
wrote:

  buffer << chunk
Where external_encoding was UTF16-LE and it fixed my problem.

The only issue to watch for here is: what happens if you have a
surrogate
pair split over that 4096-byte boundary? The chunk.encode! line will
probably fail. Granted, it’s not likely if your output is normal text
from
the BMP, but still something to watch for.

Very grateful for that great help!

No worries.

kofboss · June 21, 2014, 12:03am

Abinoam Jr. wrote in post #1150267:

Hi James F.,

The main source of problems I’ve had with encondings are related to
the string being a byte representation of some enconding but being
“marked” as another encoding.

If my default encoding is utf-8, and if I open an utf-16 file it will
not “auto-detect” it as utf-16.
The strings coming from that file will be marked as utf-8 but their
byte representation will be utf-16.

Ah hah. That is what I’m seeing in the example I posted above. If I
add a line to inspect the string read in by popen3():

require ‘open3’

Open3.popen3(‘ruby do_stuff.rb’) do |stdin, stdout, stderr, wait_thr|
data = stdout.read
p data.encoding
p data
end

–output:–
#Encoding:UTF-8
“\xFE\xFF\u0000h\u0000e\u0000l\u0000l\u0000o\u0000
\u0000w\u0000o\u0000r\u0000l\u0000d\n”

Clearly, that is not a UTF-8 encoding (ASCII characters occupy 1 byte in
UTF-8).

But where does UTF-8 come from?

$ ruby -v
ruby 1.9.3p547 (2014-05-14 revision 45962) [x86_64-darwin10.8.0]

Let’s check with James G. III…

==
There’s another way Strings are commonly created and that’s by reading
from some IO object. It doesn’t make sense to give those Strings the
source Encoding[by default ASCII in ruby 1.9.3] because the external
data doesn’t have to be related to your source code.

The external Encoding is the Encoding the data is in inside the IO
object.

The default external Encoding is pulled from your environment, much like
the source Encoding is for code given on the command-line. Have a look:

$ echo $LC_CTYPE

Gray Soft / Character Encodings / Ruby 1.9's Three Default Encodings

I get a blank line for that last command(on a Mac), but if I can do
this:

$ echo $LANG
en_US.UTF-8

It doesn’t appear that
you can pass the encoding in up front with Open3.popen3

No, but after looking through the ruby docs for awhile, you can do
it post festum:

do_stuff.rb:

#!/usr/bin/env ruby

my_str = “hello world”
#puts my_str.encoding
#puts my_str.bytesize

x = my_str.encode(‘UTF-16’)
#puts x.encoding
#puts x.bytesize
#p x
print x

my_prog.rb:

require ‘open3’

Open3.popen3(‘ruby do_stuff.rb’) do |stdin, stdout, stderr, wait_thr|
puts Encoding.default_external.name
puts “popen3_stdout external: #{stdout.external_encoding.name}”

stdout.set_encoding ‘UTF-16:UTF-8’ #<—HERE***
#read() data as UTF-16 and convert to UTF-8

puts “popen3_stdout external: #{stdout.external_encoding.name}”

data = stdout.read
puts “data says it’s encoded with: #{data.encoding}”
puts “Let’s see if that’s true:”
p data
end

–output:–
UTF-8
popen3_stdout external: UTF-8
popen3_stdout external: UTF-16
data says it’s encoded with: UTF-8
Let’s see if that’s true:
“hello world”