Forum: Ruby how to read utf-16 with Open3.popen3

403b1743ad9522eeb22f301727ffc714?d=identicon&s=25 James French (Guest)
on 2014-06-18 17:08
(Received via mailing list)
Hi group,

I have a windows executable that writes to stdout using utf-16 encoding.
When I read it using Open3.popen3 it comes in with spaces between each
letter. I can't seem to figure out how to get that converted to utf-8.
The docs to IO.pipe specify that it can convert between external and
internal encodings. I've tried changing the source code in popen3 to eg
IO.pipe('utf-16:utf-8') and various other things but have had no luck.
Any help greatly appreciated.

Cheers,
James
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2014-06-20 02:00
my_str = "hello world"
puts my_str.encoding  #=>US-ASCII
puts my_str.bytesize   #=>11

x = my_str.encode('UTF-16')
puts x.encoding   #=>UTF-16
puts x.bytesize   #=>24
p x   #=>"\uFEFFhello world"


result = x.encode("UTF-8", x.encoding)
puts result.encoding  #=>UTF-8
p result   #=>"hello world"


So you can do this:

Open3.popen3('some_program') do |stdin, stdout, stderr, wait_thr|
  data = stdout.read()
  result = data.encode("UTF-8", "UTF-16")
  #Or more generally: result = data.encode("UTF-8", data.encoding)
end

For what it's worth, if I use the following program as the target of
popen3():

my_str = "hello world"
#puts my_str.encoding
#puts my_str.bytesize

x = my_str.encode('UTF-16')
#puts x.encoding
#puts x.bytesize
#p x

puts x


And then I read from stdout:

#!/usr/bin/env ruby

require 'open3'

Open3.popen3('ruby do_stuff.rb') do |stdin, stdout, stderr, wait_thr|
  data = stdout.read()
  p data.encoding
end


The encoding is "UTF-8".  I'm not sure how that works.
09a32175057418748822c587ac08c429?d=identicon&s=25 Abinoam Jr. (abinoampraxedes_m)
on 2014-06-20 04:34
(Received via mailing list)
Hi James French,

The main source of problems I've had with encondings are related to
the string being a byte representation of some enconding but being
"marked" as another encoding.

If my default encoding is utf-8, and if I open an utf-16 file it will
not "auto-detect" it as utf-16.
The strings coming from that file will be marked as utf-8 but their
byte representation will be utf-16.

If I try this:

str.encode(Encoding::UTF_8)

it will do NOTHING just because it thinks the file is already
Encoding::UTF_8.
So I have to do something like

str.encode(Encoding::UTF_8, Encoding::UTF_16)

or

str.force_encoding(Encoding::UTF_16).encode(Encoding::UTF_8)

to inform the "source" encoding.

Another source of error (at least for me) is that some UTF_16 streams
just lacks that "BOM" comand (Byte Order Mark), so it cannot guess if
it is little endian or big endian. So you could "hint" it with
Encoding::UTF_16BE or Encoding::UTF_16LE.

I created a gist with my IRB session (for clarifying)
https://gist.github.com/abinoam/224cbffd5cae7f591b10

I hope it helps, and sorry if it is not related to your problem.

Best regards,
Abinoam Jr.

On Wed, Jun 18, 2014 at 12:08 PM, James French
3df767279ce7d81db0a5bb30f5136863?d=identicon&s=25 Matthew Kerwin (mattyk)
on 2014-06-20 05:55
(Received via mailing list)
On 20 June 2014 12:33, Abinoam Jr. <abinoam@gmail.com> wrote:

>
> Another source of error (at least for me) is that some UTF_16 streams
> just lacks that "BOM" comand (Byte Order Mark), so it cannot guess if
> it is little endian or big endian. So you could "hint" it with
> Encoding::UTF_16BE or Encoding::UTF_16LE.
>
>
​This is the most common problem, AFAIK.​ If your data doesn't have a
BOM
(the first bytes = 0xFFFE or 0xFEFF) then the system can't know which
UTF-16 encoding you're using.

Here's a much tinier gist than Abinoam's, which does some inter-process
communications: https://gist.github.com/phluid61/40b9d5b32c9edabaae07

Cheers
403b1743ad9522eeb22f301727ffc714?d=identicon&s=25 James French (Guest)
on 2014-06-20 16:15
(Received via mailing list)
Thank you Abinoam and Matthew. Incredibly useful. It doesn’t appear that
you can pass the encoding in up front with Open3.popen3 but I was able
to change my io processing code to

      chunk = io.readpartial(4096)
      chunk.encode!(Encoding::UTF_8, external_encoding) if
external_encoding
      buffer << chunk

Where external_encoding was UTF16-LE and it fixed my problem.

Very grateful for that great help!

From: ruby-talk [mailto:ruby-talk-bounces@ruby-lang.org] On Behalf Of
Matthew Kerwin
Sent: 20 June 2014 04:55
To: Ruby users
Subject: Re: how to read utf-16 with Open3.popen3

On 20 June 2014 12:33, Abinoam Jr.
<abinoam@gmail.com<mailto:abinoam@gmail.com>> wrote:

Another source of error (at least for me) is that some UTF_16 streams
just lacks that "BOM" comand (Byte Order Mark), so it cannot guess if
it is little endian or big endian. So you could "hint" it with
Encoding::UTF_16BE or Encoding::UTF_16LE.

​This is the most common problem, AFAIK.​ If your data doesn't have a
BOM (the first bytes = 0xFFFE or 0xFEFF) then the system can't know
which UTF-16 encoding you're using.

Here's a much tinier gist than Abinoam's, which does some inter-process
communications: https://gist.github.com/phluid61/40b9d5b32c9edabaae07

Cheers
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2014-06-21 00:03
Abinoam Jr. wrote in post #1150267:
> Hi James French,
>
> The main source of problems I've had with encondings are related to
> the string being a byte representation of some enconding but being
> "marked" as another encoding.
>
> If my default encoding is utf-8, and if I open an utf-16 file it will
> not "auto-detect" it as utf-16.
> The strings coming from that file will be marked as utf-8 but their
> byte representation will be utf-16.
>

Ah hah.  That is what I'm seeing in the example I posted above.  If I
add a line to inspect the string read in by popen3():

require 'open3'

Open3.popen3('ruby do_stuff.rb') do |stdin, stdout, stderr, wait_thr|
  data = stdout.read
  p data.encoding
  p data
end

--output:--
#<Encoding:UTF-8>
"\xFE\xFF\u0000h\u0000e\u0000l\u0000l\u0000o\u0000
\u0000w\u0000o\u0000r\u0000l\u0000d\n"

Clearly, that is not a UTF-8 encoding (ASCII characters occupy 1 byte in
UTF-8).

But where does UTF-8 come from?

$ ruby -v
ruby 1.9.3p547 (2014-05-14 revision 45962) [x86_64-darwin10.8.0]

Let's check with James Gray III....

==
There's another way Strings are commonly created and that's by reading
from some IO object. It doesn't make sense to give those Strings the
source Encoding[by default ASCII in ruby 1.9.3] because the external
data doesn't have to be related to your source code.

The external Encoding is the Encoding the data is in inside the IO
object.

The default external Encoding is pulled from your environment, much like
the source Encoding is for code given on the command-line. Have a look:

$ echo $LC_CTYPE
==
 http://graysoftinc.com/character-encodings/ruby-19...

I get a blank line for that last command(on a Mac), but if I can do
this:

$ echo $LANG
en_US.UTF-8


> It doesn’t appear that
> you can pass the encoding in up front with Open3.popen3

No, but after looking through the ruby docs for awhile,  you can do
it `post festum`:


do_stuff.rb:

#!/usr/bin/env ruby

my_str = "hello world"
#puts my_str.encoding
#puts my_str.bytesize

x = my_str.encode('UTF-16')
#puts x.encoding
#puts x.bytesize
#p x
print x


my_prog.rb:

require 'open3'

Open3.popen3('ruby do_stuff.rb') do |stdin, stdout, stderr, wait_thr|
  puts Encoding.default_external.name
  puts "popen3_stdout external: #{stdout.external_encoding.name}"

  stdout.set_encoding 'UTF-16:UTF-8'   #<---HERE***
  #read() data as UTF-16 and convert to UTF-8

  puts "popen3_stdout external: #{stdout.external_encoding.name}"

  data = stdout.read
  puts "data says it's encoded with: #{data.encoding}"
  puts "Let's see if that's true:"
  p data
end

--output:--
UTF-8
popen3_stdout external: UTF-8
popen3_stdout external: UTF-16
data says it's encoded with: UTF-8
Let's see if that's true:
"hello world"
3df767279ce7d81db0a5bb30f5136863?d=identicon&s=25 Matthew Kerwin (mattyk)
on 2014-06-21 03:06
(Received via mailing list)
On 21 June 2014 00:14, James French <James.French@naturalmotion.com>
wrote:

>
>       buffer << chunk
>
>
>
> Where external_encoding was UTF16-LE and it fixed my problem.
>
>
>

The only issue to ​watch for here is: what happens if you have a
surrogate
pair split over that 4096-byte boundary? The chunk.encode! line will
probably fail.  Granted, it's not likely if your output is normal text
from
the BMP, but still something to watch for.


Very grateful for that great help!
>
>
>

No worries.
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.