Forum: Ruby File.new and encoding

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
achim.domma (Guest)
on 2005-11-29 17:09
(Received via mailing list)
Hi,

I'm still quite new to ruby, but have written a simple code generator.
The generator opens some files and combines them to a new one. The
resulting file is encoded as iso-8859-1, but it looks like ruby writes
an UTF-8 Markter to the beginning of the file. Is that possible?

How can I tell ruby which encoding to use, if I write to textfiles?

Any pointers to documentation are wellcome, but I didn't find something
usefull using google.

regards,
Achim
bob.news (Guest)
on 2005-11-29 17:21
(Received via mailing list)
Achim D. (SyynX Solutions GmbH) wrote:
> Hi,
>
> I'm still quite new to ruby, but have written a simple code generator.
> The generator opens some files and combines them to a new one. The
> resulting file is encoded as iso-8859-1, but it looks like ruby writes
> an UTF-8 Markter to the beginning of the file. Is that possible?

What's an UTF-8 marker?  I know only two byte UTF-16 marker but AFAIK
there is no marker for UTF-8.  Did I miss something?

> How can I tell ruby which encoding to use, if I write to textfiles?
>
> Any pointers to documentation are wellcome, but I didn't find
> something usefull using google.

Encoding is not an easy issue with ruby - I guess by default it uses the
default enconding of your environment.  But you can specify certain
(Japanese) encodings with command line option -K.  HTH

Kind regards

    robert
nobu (Guest)
on 2005-11-29 17:37
(Received via mailing list)
Hi,

At Wed, 30 Nov 2005 00:17:29 +0900,
Robert K. wrote in [ruby-talk:167988]:
> > I'm still quite new to ruby, but have written a simple code generator.
> > The generator opens some files and combines them to a new one. The
> > resulting file is encoded as iso-8859-1, but it looks like ruby writes
> > an UTF-8 Markter to the beginning of the file. Is that possible?
>
> What's an UTF-8 marker?  I know only two byte UTF-16 marker but AFAIK
> there is no marker for UTF-8.  Did I miss something?

It would be UTF-8 encoded BOM, but ruby itself never write it
automatically.

> > How can I tell ruby which encoding to use, if I write to textfiles?

Can't you show the code?
achim.domma (Guest)
on 2005-11-29 20:56
(Received via mailing list)
removed_email_address@domain.invalid wrote:

> It would be UTF-8 encoded BOM, but ruby itself never write it
> automatically.
[...]
> Can't you show the code?

Trying to reproduce the problem in a smaller example, I figured out,
that I'm reading the BOM from one of my source files. Sorry for the
confusion. I'm doing something like:

File.open("target","w") do |target|
     File.open("source","r") do |source|
         source.each_line do |line|
             ... some processing ...
             target.write(line)
         end
      end
end


source seems to contain the BOM and it is writen to target. Any hint on
how to strip the BOM?

regards,
Achim
alex (Guest)
on 2005-11-29 21:36
(Received via mailing list)
> I'm doing something like:
>
> File.open("target","w") do |target|
>     File.open("source","r") do |source|
>         source.each_line do |line|
>             ... some processing ...
>             target.write(line)
>         end
>      end
> end

Have you looked at 'iconv' in the standard library?

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/c...

Assuming all your input files were ISO-8859-1, and you wanted your
output file in UTF-8, your example might look something like (untested):

File.open("target","w") do |target|
  Iconv.open('UTF-8', 'ISO-8859-1') do | converter |
    File.open("source","r") do |source|
      source.each_line do |line|
        # ... some processing ...
        target.write( converter.iconv(line) )
      end
    end
    target << converter.iconv(nil)
  end
end

Iconv should deal with BOMs, stripping them out or adding them in where
necessary. I'm not sure if it will complain if it finds a BOM mid-stream
(as you open your second and subsequent input file) - if so you could
just instantiate a new Iconv to deal with each input.

HTH
alex
This topic is locked and can not be replied to.