Reading from and writing to a Unicode encoded file

Gurdipe_D · May 18, 2012, 9:48am

Hi,

I made a script to read from a Unicode encoded file and also to write
something back. The problem is that the stuff that gets written back is
turned into jibberish.

Is there any way of solving this other than manually changing the coding
of the file to UTF-8?

thank you
regards,
seba

sebastjan_h · May 18, 2012, 10:19am

Sebastjan H. wrote in post #1061256:

Hi,

I made a script to read from a Unicode encoded file and also to write
something back. The problem is that the stuff that gets written back is
turned into jibberish.

Is there any way of solving this other than manually changing the coding
of the file to UTF-8?

thank you
regards,
seba

I refer to this post: Array of strings - finding letter combinations - Ruby - Ruby-Forum

regards,
seba

sebastjan_h · May 18, 2012, 11:22am

Hi Sebastjan H,

You can use Iconv standard library of Ruby (Method Name: conv) which
help you to convert the unicode of the string or file.

Please refer:

Regards,
Vimal Raj

sebastjan_h · May 18, 2012, 12:52pm

Am 18.05.2012 11:22, schrieb Vimal Selvam:

Hi Sebastjan H,

You can use Iconv standard library of Ruby (Method Name: conv) which
help you to convert the unicode of the string or file.

Iconv is deprecated and will be removed. Ruby has built-in encoding
facilities, namely String#encode.

Please refer:
Class: Iconv (Ruby 1.9.2)

Regards,
Vimal Raj

Vale,
Marvin

sebastjan_h · May 18, 2012, 12:14pm

Hello,
tested on windows:

open(“data.txt”, “rb:UTF-16LE”) {|fin|
open(“odata.txt”, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

Regards,

sebastjan_h · May 18, 2012, 12:55pm

Regis d’Aubarede wrote in post #1061272:

Hello,
tested on windows:

open(“data.txt”, “rb:UTF-16LE”) {|fin|
open(“odata.txt”, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

Regards,

thx, it works for me too, however, I wanted to include it in the script
refered to above, so I tried this modification according to your modell:

file = ARGV[0]

File.open(file, “Unicode”) {|fin|
File.open(file, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

and I get an error.

If possible, I want to run this file conversion prior to my other code,
but the file should be named the same and the content untouched. I know
that my version above would overwrite it.

Something like this:

Convert the file.
Reopen the file.
Read the content.
Run some code on the content.
Write something back to the file.

No. 1 is giving me the headache, the rest is in place:)

regards,
seba

sebastjan_h · May 18, 2012, 1:06pm

Sebastjan H. wrote in post #1061276:

File.open(file, “Unicode”) {|fin|
File.open(file, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

regards,
seba

data=nil
open(“data.txt”, “rb:UTF-16LE”) {|fin| data=fin.read() }
open(“data.txt”, “wb:UTF-8”) { |fout| fout.write(data) } if data

sebastjan_h · May 18, 2012, 1:30pm

Regis d’Aubarede wrote in post #1061277:

Sebastjan H. wrote in post #1061276:

File.open(file, “Unicode”) {|fin|
File.open(file, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

regards,
seba

data=nil
open(“data.txt”, “rb:UTF-16LE”) {|fin| data=fin.read() }
open(“data.txt”, “wb:UTF-8”) { |fout| fout.write(data) } if data

Thank you very much, works like a charm. I’ve replaced the actual
filename with a variable, so I can use ARGV.

kind regards,
seba

sebastjan_h · May 21, 2012, 12:41pm

Hi,

something goes wrong with this block of code when I use Shoes to package
an app:

data=nil
open(“data.txt”, “rb:UTF-16LE”) {|fin| data=fin.read() }
open(“data.txt”, “wb:UTF-8”) { |fout| fout.write(data) } if data

I’ve attached the entire script. If I comment out this block the
jibberish appears again in the file. However, If I let run this block of
code, the content is deleted and the Alert is not displayed.

Could someone take a look where I went wrong. I am still learning Shoes
too:)

kind regards,
Seba

sebastjan_h · May 21, 2012, 2:28pm

Hi Sebastjan,

What platform and Shoes are you using?

I downloaded your code (dup_app.rb) and replaced the following two
lines.
Then it worked with Shoes 3 (0.r1514) on my Windows 7.

 open(file, "r") {|fin| data=fin.read() }
 open(file, "w:UTF-8")  { |fout| fout.write(data) } if data

ashbb

sebastjan_h · May 21, 2012, 1:55pm

Sebastjan H. wrote in post #1061483:

Hi,

something goes wrong with this block of code when I use Shoes to package
an app:

data=nil
open(“data.txt”, “rb:UTF-16LE”) {|fin| data=fin.read() }
open(“data.txt”, “wb:UTF-8”) { |fout| fout.write(data) } if data

I’ve attached the entire script. If I comment out this block the
jibberish appears again in the file. However, If I let run this block of
code, the content is deleted and the Alert is not displayed.

Could someone take a look where I went wrong. I am still learning Shoes
too:)

kind regards,
Seba

one more note: the same code as attached (without the Shoes elements)
runs ok just as a script run from a command line.

sebastjan_h · May 21, 2012, 4:32pm

Hi Sebastjan,

Ah, sorry. Try out the following again:

  data=nil
  #open(file, "r") {|fin| data=fin.read() }
  data = IO.read(file).force_encoding("UTF-8")
  open(file, "w:UTF-8")  { |fout| fout.write(data) } if data

ashbb

sebastjan_h · May 21, 2012, 2:45pm

ashbb shoeser wrote in post #1061501:

Hi Sebastjan,

What platform and Shoes are you using?

I downloaded your code (dup_app.rb) and replaced the following two
lines.
Then it worked with Shoes 3 (0.r1514) on my Windows 7.
 open(file, "r") {|fin| data=fin.read() }
 open(file, "w:UTF-8")  { |fout| fout.write(data) } if data
ashbb

Hi ashbb,

I’m using Shoes 3 on Win7. I’ve just tried the above two lines and the
result is still the same:

the content of the chosen file is removed
the encoding is changed to ANSI
the alert box is not displayed

seba

sebastjan_h · May 21, 2012, 11:40pm

Hi Sebastjan,

Now the process runs through
Good!

the stuff that is written in the file is mixed with
the legacy content
Me too.
But your code re-open the file with ‘a+’ mode.
So, I think this is a normal behavior.

the new content is again jibberish.
Ah,… what does that mean?

I got the file mixed the following:

Here are the unused characters:
“&a”, “&b”, “&c”, “&d”, “&e”, …

Do you mean that this is jibberish?

Sorry, I don’t understand what you want to do correctly.

ashbb

sebastjan_h · May 21, 2012, 4:57pm

ashbb shoeser wrote in post #1061510:

Hi Sebastjan,

Ah, sorry. Try out the following again:

  data=nil
  #open(file, "r") {|fin| data=fin.read() }
  data = IO.read(file).force_encoding("UTF-8")
  open(file, "w:UTF-8")  { |fout| fout.write(data) } if data

ashbb

hi,

it still not working. Now the process runs through, but the stuff that
is written in the file is mixed with the legacy content and the new
content is again jibberish.

I am afraid I can’t pinpoint the issue being a complete beginner. It
works fine without Shoes.

Does the code work for you?

regards,
seba

sebastjan_h · May 22, 2012, 10:15am

ashbb shoeser wrote in post #1061553:

Hi Sebastjan,

Now the process runs through
Good!

the stuff that is written in the file is mixed with
the legacy content
Me too.
But your code re-open the file with ‘a+’ mode.
So, I think this is a normal behavior.

the new content is again jibberish.
Ah,… what does that mean?

Actually the new content is not even written to the file, but the file
is stil encoded as Unicode so some special characters in my language (č
and š) are not displayed corectly. For example, “č” is printed out like
栀攀挀甀爀爀攀渀琀昀漀爀洀甀氀愀⸀ऀ匀欀爀

I got the file mixed the following:

Here are the unused characters:
“&a”, “&b”, “&c”, “&d”, “&e”, …

If this is on the end of your file, then this is correct. I don’t get
any of the added content written anywhere in the file.

Do you mean that this is jibberish?

Jibberish: 甀爀爀攀渀琀昀漀爀洀甀氀愀⸀ऀ匀欀爀

Sorry, I don’t understand what you want to do correctly.

Input file: two column tab delimited and Unicode encoded
Replace the first column with “”
Run the rest of the code (finding duplicates, used and unused
charactersd)
Write the unused characters to the input file

I’ve attached the code which is compiled as *shy app again.

I know the main issue is, that my input file is Unicode encoded, but I
get that from another program that supports only Unicode.

Thank you for your patience:)

Two more notes:

the *shy app is about 420 MB in size. Is that normal?
the *shy app takes quite some time to load. Is that normal?

regards,
seba

sebastjan_h · May 22, 2012, 8:06pm

ashbb shoeser wrote in post #1061656:

Hi Sebastjan,

I know the main issue is, that my input file is Unicode encoded
Oh, I see.
Why not using nkf?

Try out the following:

require ‘nkf’
Shoes.app do
extend NKF
file = ask_open_file
data = IO.read file
para nkf(’-W16w’, data)
end
I’ve tried incorporating this into my script, but I guess my knowledge
isn’t sufficient:)

Furthermore, I’ve tried replicating the whole thing on Ubuntu and the
app is not as large and it loads extremely fast. However, the code still
doesn’t tun properly with Shoes. And the files are automatically
converted to UTF-8 as soon as they are stored on Ubuntu, so I can’t
really replicate anything:(

kind regards,
seba

sebastjan_h · May 22, 2012, 3:16pm

Hi Sebastjan,

I know the main issue is, that my input file is Unicode encoded
Oh, I see.
Why not using nkf?

Try out the following:

require ‘nkf’
Shoes.app do
extend NKF
file = ask_open_file
data = IO.read file
para nkf(’-W16w’, data)
end

In my case with Shoes 3 (0.r1514), I can see some special characters in
your language (č and š) on the Shoes window.

Two more notes:

the *shy app is about 420 MB in size. Is that normal?

the *shy app takes quite some time to load. Is that normal?
Umm,… I’m not sure,… but I don’t think they are normal…

ashbb

sebastjan_h · May 23, 2012, 12:20am

Hi Sebastjan,

Umm,…
Can you move to Shoes-ML (http://librelist.com/browser/shoes/) ?
You’ll get other Shoeser’s helps.

ashbb

Reading from and writing to a Unicode encoded file

thx, it works for me too, however, I wanted to include it in the script refered to above, so I tried this modification according to your modell:

File.open(file, “Unicode”) {|fin| File.open(file, “wb:UTF-8”) { |fout| fout.write(fin.read()) } }

thx, it works for me too, however, I wanted to include it in the script
refered to above, so I tried this modification according to your modell:

File.open(file, “Unicode”) {|fin|
File.open(file, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}