Reading from and writing to a Unicode encoded file

Hi,

I made a script to read from a Unicode encoded file and also to write
something back. The problem is that the stuff that gets written back is
turned into jibberish.

Is there any way of solving this other than manually changing the coding
of the file to UTF-8?

thank you
regards,
seba

Sebastjan H. wrote in post #1061256:

Hi,

I made a script to read from a Unicode encoded file and also to write
something back. The problem is that the stuff that gets written back is
turned into jibberish.

Is there any way of solving this other than manually changing the coding
of the file to UTF-8?

thank you
regards,
seba

I refer to this post: http://www.ruby-forum.com/topic/4191662#new

regards,
seba

Hi Sebastjan H,

You can use Iconv standard library of Ruby (Method Name: conv) which
help you to convert the unicode of the string or file.

Please refer:
http://ruby-doc.org/stdlib-1.9.2/libdoc/iconv/rdoc/Iconv.html#method-c-conv

Regards,
Vimal Raj

Am 18.05.2012 11:22, schrieb Vimal Selvam:

Hi Sebastjan H,

You can use Iconv standard library of Ruby (Method Name: conv) which
help you to convert the unicode of the string or file.

Iconv is deprecated and will be removed. Ruby has built-in encoding
facilities, namely String#encode.

Please refer:
http://ruby-doc.org/stdlib-1.9.2/libdoc/iconv/rdoc/Iconv.html#method-c-conv

Regards,
Vimal Raj

Vale,
Marvin

Hello,
tested on windows:

open(“data.txt”, “rb:UTF-16LE”) {|fin|
open(“odata.txt”, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

Regards,

Regis d’Aubarede wrote in post #1061272:

Hello,
tested on windows:

open(“data.txt”, “rb:UTF-16LE”) {|fin|
open(“odata.txt”, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

Regards,

thx, it works for me too, however, I wanted to include it in the script
refered to above, so I tried this modification according to your modell:

file = ARGV[0]

File.open(file, “Unicode”) {|fin|
File.open(file, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

and I get an error.

If possible, I want to run this file conversion prior to my other code,
but the file should be named the same and the content untouched. I know
that my version above would overwrite it.

Something like this:

  1. Convert the file.
  2. Reopen the file.
  3. Read the content.
  4. Run some code on the content.
  5. Write something back to the file.

No. 1 is giving me the headache, the rest is in place:)

regards,
seba

Sebastjan H. wrote in post #1061276:

File.open(file, “Unicode”) {|fin|
File.open(file, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

regards,
seba

data=nil
open(“data.txt”, “rb:UTF-16LE”) {|fin| data=fin.read() }
open(“data.txt”, “wb:UTF-8”) { |fout| fout.write(data) } if data

Regis d’Aubarede wrote in post #1061277:

Sebastjan H. wrote in post #1061276:

File.open(file, “Unicode”) {|fin|
File.open(file, “wb:UTF-8”) { |fout|
fout.write(fin.read())
}
}

regards,
seba

data=nil
open(“data.txt”, “rb:UTF-16LE”) {|fin| data=fin.read() }
open(“data.txt”, “wb:UTF-8”) { |fout| fout.write(data) } if data

Thank you very much, works like a charm. I’ve replaced the actual
filename with a variable, so I can use ARGV.

kind regards,
seba

Hi,

something goes wrong with this block of code when I use Shoes to package
an app:

data=nil
open(“data.txt”, “rb:UTF-16LE”) {|fin| data=fin.read() }
open(“data.txt”, “wb:UTF-8”) { |fout| fout.write(data) } if data

I’ve attached the entire script. If I comment out this block the
jibberish appears again in the file. However, If I let run this block of
code, the content is deleted and the Alert is not displayed.

Could someone take a look where I went wrong. I am still learning Shoes
too:)

kind regards,
Seba

Hi Sebastjan,

What platform and Shoes are you using?

I downloaded your code (dup_app.rb) and replaced the following two
lines.
Then it worked with Shoes 3 (0.r1514) on my Windows 7.

 open(file, "r") {|fin| data=fin.read() }
 open(file, "w:UTF-8")  { |fout| fout.write(data) } if data

ashbb

Sebastjan H. wrote in post #1061483:

Hi,

something goes wrong with this block of code when I use Shoes to package
an app:

data=nil
open(“data.txt”, “rb:UTF-16LE”) {|fin| data=fin.read() }
open(“data.txt”, “wb:UTF-8”) { |fout| fout.write(data) } if data

I’ve attached the entire script. If I comment out this block the
jibberish appears again in the file. However, If I let run this block of
code, the content is deleted and the Alert is not displayed.

Could someone take a look where I went wrong. I am still learning Shoes
too:)

kind regards,
Seba

one more note: the same code as attached (without the Shoes elements)
runs ok just as a script run from a command line.

Hi Sebastjan,

Ah, sorry. Try out the following again:

  data=nil
  #open(file, "r") {|fin| data=fin.read() }
  data = IO.read(file).force_encoding("UTF-8")
  open(file, "w:UTF-8")  { |fout| fout.write(data) } if data

ashbb

ashbb shoeser wrote in post #1061501:

Hi Sebastjan,

What platform and Shoes are you using?

I downloaded your code (dup_app.rb) and replaced the following two
lines.
Then it worked with Shoes 3 (0.r1514) on my Windows 7.

 open(file, "r") {|fin| data=fin.read() }
 open(file, "w:UTF-8")  { |fout| fout.write(data) } if data

ashbb

Hi ashbb,

I’m using Shoes 3 on Win7. I’ve just tried the above two lines and the
result is still the same:

  • the content of the chosen file is removed
  • the encoding is changed to ANSI
  • the alert box is not displayed

seba

Hi Sebastjan,

Now the process runs through
Good!

the stuff that is written in the file is mixed with
the legacy content
Me too.
But your code re-open the file with ‘a+’ mode.
So, I think this is a normal behavior.

the new content is again jibberish.
Ah,… what does that mean?

I got the file mixed the following:

Here are the unused characters:
“&a”, “&b”, “&c”, “&d”, “&e”, …

Do you mean that this is jibberish?

Sorry, I don’t understand what you want to do correctly.

ashbb

ashbb shoeser wrote in post #1061510:

Hi Sebastjan,

Ah, sorry. Try out the following again:

  data=nil
  #open(file, "r") {|fin| data=fin.read() }
  data = IO.read(file).force_encoding("UTF-8")
  open(file, "w:UTF-8")  { |fout| fout.write(data) } if data

ashbb

hi,

it still not working. Now the process runs through, but the stuff that
is written in the file is mixed with the legacy content and the new
content is again jibberish.

I am afraid I can’t pinpoint the issue being a complete beginner. It
works fine without Shoes.

Does the code work for you?

regards,
seba

ashbb shoeser wrote in post #1061553:

Hi Sebastjan,

Now the process runs through
Good!

the stuff that is written in the file is mixed with
the legacy content
Me too.
But your code re-open the file with ‘a+’ mode.
So, I think this is a normal behavior.

the new content is again jibberish.
Ah,… what does that mean?

Actually the new content is not even written to the file, but the file
is stil encoded as Unicode so some special characters in my language (č
and š) are not displayed corectly. For example, “č” is printed out like
栀攀 挀甀爀爀攀渀琀 昀漀爀洀甀氀愀⸀ऀ匀欀爀

I got the file mixed the following:

Here are the unused characters:
“&a”, “&b”, “&c”, “&d”, “&e”, …

If this is on the end of your file, then this is correct. I don’t get
any of the added content written anywhere in the file.

Do you mean that this is jibberish?

Jibberish: 甀爀爀攀渀琀 昀漀爀洀甀氀愀⸀ऀ匀欀爀

Sorry, I don’t understand what you want to do correctly.

  1. Input file: two column tab delimited and Unicode encoded
  2. Replace the first column with “”
  3. Run the rest of the code (finding duplicates, used and unused
    charactersd)
  4. Write the unused characters to the input file

I’ve attached the code which is compiled as *shy app again.

I know the main issue is, that my input file is Unicode encoded, but I
get that from another program that supports only Unicode.

Thank you for your patience:)

Two more notes:

  • the *shy app is about 420 MB in size. Is that normal?
  • the *shy app takes quite some time to load. Is that normal?

regards,
seba

ashbb shoeser wrote in post #1061656:

Hi Sebastjan,

I know the main issue is, that my input file is Unicode encoded
Oh, I see.
Why not using nkf?

Try out the following:

require ‘nkf’
Shoes.app do
extend NKF
file = ask_open_file
data = IO.read file
para nkf(’-W16w’, data)
end
I’ve tried incorporating this into my script, but I guess my knowledge
isn’t sufficient:)

Furthermore, I’ve tried replicating the whole thing on Ubuntu and the
app is not as large and it loads extremely fast. However, the code still
doesn’t tun properly with Shoes. And the files are automatically
converted to UTF-8 as soon as they are stored on Ubuntu, so I can’t
really replicate anything:(

kind regards,
seba

Hi Sebastjan,

I know the main issue is, that my input file is Unicode encoded
Oh, I see.
Why not using nkf?

Try out the following:

require ‘nkf’
Shoes.app do
extend NKF
file = ask_open_file
data = IO.read file
para nkf(’-W16w’, data)
end

In my case with Shoes 3 (0.r1514), I can see some special characters in
your language (č and š) on the Shoes window.

Two more notes:

  • the *shy app is about 420 MB in size. Is that normal?
  • the *shy app takes quite some time to load. Is that normal?
    Umm,… I’m not sure,… but I don’t think they are normal…

ashbb

Hi Sebastjan,

Umm,…
Can you move to Shoes-ML (http://librelist.com/browser/shoes/) ?
You’ll get other Shoeser’s helps.

ashbb