File.new and encoding


#1

Hi,

I’m still quite new to ruby, but have written a simple code generator.
The generator opens some files and combines them to a new one. The
resulting file is encoded as iso-8859-1, but it looks like ruby writes
an UTF-8 Markter to the beginning of the file. Is that possible?

How can I tell ruby which encoding to use, if I write to textfiles?

Any pointers to documentation are wellcome, but I didn’t find something
usefull using google.

regards,
Achim


#2

Achim D. (SyynX Solutions GmbH) wrote:

Hi,

I’m still quite new to ruby, but have written a simple code generator.
The generator opens some files and combines them to a new one. The
resulting file is encoded as iso-8859-1, but it looks like ruby writes
an UTF-8 Markter to the beginning of the file. Is that possible?

What’s an UTF-8 marker? I know only two byte UTF-16 marker but AFAIK
there is no marker for UTF-8. Did I miss something?

How can I tell ruby which encoding to use, if I write to textfiles?

Any pointers to documentation are wellcome, but I didn’t find
something usefull using google.

Encoding is not an easy issue with ruby - I guess by default it uses the
default enconding of your environment. But you can specify certain
(Japanese) encodings with command line option -K. HTH

Kind regards

robert

#3

Hi,

At Wed, 30 Nov 2005 00:17:29 +0900,
Robert K. wrote in [ruby-talk:167988]:

I’m still quite new to ruby, but have written a simple code generator.
The generator opens some files and combines them to a new one. The
resulting file is encoded as iso-8859-1, but it looks like ruby writes
an UTF-8 Markter to the beginning of the file. Is that possible?

What’s an UTF-8 marker? I know only two byte UTF-16 marker but AFAIK
there is no marker for UTF-8. Did I miss something?

It would be UTF-8 encoded BOM, but ruby itself never write it
automatically.

How can I tell ruby which encoding to use, if I write to textfiles?

Can’t you show the code?


#4

removed_email_address@domain.invalid wrote:

It would be UTF-8 encoded BOM, but ruby itself never write it
automatically.
[…]
Can’t you show the code?

Trying to reproduce the problem in a smaller example, I figured out,
that I’m reading the BOM from one of my source files. Sorry for the
confusion. I’m doing something like:

File.open(“target”,“w”) do |target|
File.open(“source”,“r”) do |source|
source.each_line do |line|
… some processing …
target.write(line)
end
end
end

source seems to contain the BOM and it is writen to target. Any hint on
how to strip the BOM?

regards,
Achim


#5

I’m doing something like:

File.open(“target”,“w”) do |target|
File.open(“source”,“r”) do |source|
source.each_line do |line|
… some processing …
target.write(line)
end
end
end

Have you looked at ‘iconv’ in the standard library?

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/classes/Iconv.html

Assuming all your input files were ISO-8859-1, and you wanted your
output file in UTF-8, your example might look something like (untested):

File.open(“target”,“w”) do |target|
Iconv.open(‘UTF-8’, ‘ISO-8859-1’) do | converter |
File.open(“source”,“r”) do |source|
source.each_line do |line|
# … some processing …
target.write( converter.iconv(line) )
end
end
target << converter.iconv(nil)
end
end

Iconv should deal with BOMs, stripping them out or adding them in where
necessary. I’m not sure if it will complain if it finds a BOM mid-stream
(as you open your second and subsequent input file) - if so you could
just instantiate a new Iconv to deal with each input.

HTH
alex