Ascii representation of unicode string?

darren_kirby · June 22, 2006, 10:35pm

Hello all.

I am unpacking some unicode strings from a binary file. I have a string
like:

“W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003”

and I need to turn it into:

“WM/TrackNumber”

When I ‘puts’ the string it prints fine but I need to assign it to a
variable,
and when I try something like:

require ‘jcode’
$KCODE = ‘UTF8’
s.each_char { |ch| print ch }

it will print each char but return the original unicode string. And when
I
try:

n = “”
s.each_char { |ch| n += ch }

The entire unicode char is being added to n.

So can I extract an ascii representation of this string? I will admit I
don’t
know the first thing about unicode and I may be totally lost here…

Thanks for consideration.

-d

darren_kirby · June 22, 2006, 11:00pm

On 6/23/06, darren kirby [email protected] wrote:

When I ‘puts’ the string it prints fine but I need to assign it to a variable,
s.each_char { |ch| n += ch }

The entire unicode char is being added to n.

So can I extract an ascii representation of this string? I will admit I don’t
know the first thing about unicode and I may be totally lost here…

Thanks for consideration.

That’s not UTF-8, that’s UTF-16 little endian without a BOM. If you
know the string is pure ASCII, just UTF-16 encoded you can just do
s.gsub(/\000/,‘’), but this will break any non-7bit characters.

darren_kirby · June 22, 2006, 11:12pm

darren kirby wrote:

Hello all.

I am unpacking some unicode strings from a binary file. I have a string
like:

“W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003”

and I need to turn it into:

“WM/TrackNumber”

could try:

myAscii = s.unpack(‘U’*s.length).select{|x| x >46}.collect{|x|
x.chr}.to_s

darren_kirby · June 22, 2006, 11:34pm

quoth the Chris H.:

“WM/TrackNumber”

could try:

myAscii = s.unpack(‘U’*s.length).select{|x| x >46}.collect{|x|
x.chr}.to_s

Hello,

This is fairly close to something I was trying earlier, except more
elegant.
This seems to be munging my spaces though, as I take it the space code
is
less than 46? Have to look for a chart.

Is unpacking as UTF8 going to be problematic if the string is actually
UTF16
as Philip points out?

I will play some more, thanks guys…
-d

darren_kirby · June 22, 2006, 11:44pm

quoth the Phillip H.:

Spaces are 32 if I recall my ASCII. UTF-16 with no non-ASCII
characters is essentially an ASCII string with NULLs every other byte,
it’s quite obvious. UTF-8 with only ASCII just looks like ASCII.

Thanks Philip,

I changed the 46 to 32 in Chris’ code and it seems to be working fine
for my
test files here. Will have to do more testing to see if it will be a
suitable
permanent solution…

Thanks again guys,
-d

darren_kirby · June 23, 2006, 3:22am

Hi,

At Fri, 23 Jun 2006 06:12:21 +0900,
Chris H. wrote in [ruby-talk:198599]:

“WM/TrackNumber”

could try:

myAscii = s.unpack(‘U’*s.length).select{|x| x >46}.collect{|x|
x.chr}.to_s

Shorter:

s.unpack(“v*”).pack(“U*”) #=> “WM/TrackNumber\000”

Or to keep trailing odd byte,

(s+"\0").unpack(“v*”).pack(“U*”) #=> “WM/TrackNumber\000\003”

Note this isn’t aware of surrogate pairs.

darren_kirby · June 23, 2006, 1:34pm

On 22/06/06, darren kirby [email protected] wrote:

I am unpacking some unicode strings from a binary file. I have a string like:

“W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003”

and I need to turn it into:

“WM/TrackNumber”
…
So can I extract an ascii representation of this string? I will admit I don’t
know the first thing about unicode and I may be totally lost here…

Here’s a reliable way to do it with Iconv:

require ‘iconv’
s =
“W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003”
ic = Iconv.new(“US-ASCII//IGNORE”, “UTF-16LE”)
p (ic.iconv(s+’ '))[0…-2] # => “WM/TrackNumber”

Paul.

darren_kirby · June 22, 2006, 11:37pm

less than 46? Have to look for a chart.

Is unpacking as UTF8 going to be problematic if the string is actually UTF16
as Philip points out?

I will play some more, thanks guys…
-d

Spaces are 32 if I recall my ASCII. UTF-16 with no non-ASCII
characters is essentially an ASCII string with NULLs every other byte,
it’s quite obvious. UTF-8 with only ASCII just looks like ASCII.

darren_kirby · June 23, 2006, 4:24pm

quoth the Paul B.:

p (ic.iconv(s+’ '))[0…-2] # => “WM/TrackNumber”
Hi Paul,

This seems to be working quite nicely, after playing around for a bit. A
few
of my test files were throwing “Iconv::InvalidCharacter” errors on some
strings, but when I change the “(s+’ ')” to “(s)” it works fine. Then,
of
course, the strings that originally worked start throwing the error. So,
I do
this:

begin
textString = @ic.iconv(data+’ ')[0…-2]
rescue
textString = @ic.iconv(data)[0…-2]
end

Yesir, I am really just mashing code together until I see the results I
am
looking for…

I wonder though, the docs lead me to believe the iconv library is UNIX
only.
Is this true? I really need a cross-platform solution, but don’t have a
win32
box to try on…

Thanks very much,

Paul.

-d

darren_kirby · June 23, 2006, 5:38pm

On 22-jun-2006, at 22:33, darren kirby wrote:

Hello all.

I am unpacking some unicode strings from a binary file. I have a
string like:

“W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r
\000\000\000\003”

What you have looks like UTF-16. you are best just pushing it through
iConv and convert it to UTF8

darren_kirby · June 24, 2006, 12:17am

On 23/06/06, darren kirby [email protected] wrote:

This seems to be working quite nicely, after playing around for a bit. A few
of my test files were throwing “Iconv::InvalidCharacter” errors on some
strings, but when I change the “(s+’ ')” to “(s)” it works fine. Then, of
course, the strings that originally worked start throwing the error. So, I do
this:

Sorry, I translated that code from somewhere else, but forgot that
UTF-16 needs an even number of bytes. The fact that it worked as
advertised was serendipity rather than good judgement! The trouble
with Iconv’s //IGNORE flag is that it doesn’t ignore trailing errors;
you can get around this by adding a valid codepoint at the end, and
removing it after conversion. Adding a valid byte <128 gets around
this for UTF-8 input, but only worked for your example as it had an
odd number of input bytes. For UTF-16 (LE or BE) without surrogates,
this will work:

t = ic.iconv(s[0,s.length/2*2])

although a more general solution that should also handle surrogates is
this:

t = ic.iconv(s[0,s.length/2*2]+“\000\000”)[0…-2]

Finally, your input string has a trailing null; a regexp-based
solution is probably the most reliable way to remove this:

t.sub!(/\x00$/, ‘’)

I wonder though, the docs lead me to believe the iconv library is UNIX only.
Is this true? I really need a cross-platform solution, but don’t have a win32
box to try on…

It’s definitely possible to use iconv on Windows, but it wasn’t in the
one-click installer until 1.8.4, I believe.

Paul.

darren_kirby · June 24, 2006, 10:37pm

quoth the Paul B.:

t = ic.iconv(s[0,s.length/2*2])

although a more general solution that should also handle surrogates is
this:

t = ic.iconv(s[0,s.length/2*2]+"\000\000")[0…-2]

This ^^^ is working perfect for all my test files now…thank you.

one-click installer until 1.8.4, I believe.
Ok, good. I can live with that.

Thank you very much for the help,

Paul.
-d