Forum: Ruby read text and binary files

57c29ccd48e6bd021d34c25fe69ad36e?d=identicon&s=25 Eric Peterson (Guest)
on 2014-06-30 19:16
(Received via mailing list)
Some one just wrote: "we're hungry for some actual Ruby discussion."  So
here's my question worthy of 2¢ of all ya'll's time (but not much more).


I was playing around and thinking I'd like to look at files and see what
sort of UNICODE characters are in each and how many.  I was using this
to
validate some data files that were sent to use before I attempted to
upload
to the database.

Using Perl I was able to do this


\\\\\\ perl incomplete snippet /////
use open IO => ':utf8'; # all I/O in utf8
no warnings 'utf8'; # but ignore utf-8 warnings
binmode( STDIN, ":utf8" );
binmode( STDOUT, ":utf8" );
binmode( STDERR, ":utf8" );
use Unicode::UCD 'charinfo';

open( my $fh, '<', $file ) or die "Unable to open $file - $!\n";
while ( $line = <$fh> ) {
  my @chars = split( //, $line );
  foreach my $char ( @chars )
...
    $info->{code}
    $info->{name}
...
///// perl incomplete snippet \\\\\\


\\\\\\ perl output /////
   Dec   Hex Letter Count Desc

     1     9 0x0009 [HT]         2 C0 Control Character - Horizontal
Tabulation (^I \t)
     2    10 0x000A [LF]       332 C0 Control Character - Line Feed (^J
\n)
     3    32 0x0020 [SP]     1,821 Space
     4    33 0x0021 [!]         7 EXCLAMATION MARK
     5    34 0x0022 ["]        42 QUOTATION MARK
///// perl output \\\\\\\


Ok, so now I want to try the same in ruby.  Where the perl script above
can
read text and binary files, the ruby snippet below can only do text
files.
 The reason I'd like to have it read binary files, there are some bad
files
occasionally sent with characters that define it as binary.  I'd like to
respond to the vendor with which extraneous characters they have
included
in the file and which line it is on.

I though of doing "rb:utf-8:-" on the File.open, but that didn't work
either.

Any ideas?



\\\\\\ ruby /////
​#! /usr/bin/env ruby​
# -*- encoding: utf-8 -*-
require "unicode_utils"
File.open( fn, "r:utf-8:-" ) do |input|

  input.each_line do |line|
    line.each_char do |c|

puts UnicodeUtils.char_name( c )
​...​
///// ruby \\\\\\\
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2014-07-01 02:34
> Where the perl script above
> can read text and binary files,

While true, I don't think it does so correctly.  split() becomes unicode
aware when a string has been encoded with utf-8, and you've told perl to
automatically encode every line read in with utf-8.  That means split()
is splitting on characters--not bytes.  That means if your binary data
happens to have the sequence:

\x{E2}
\x{82}
\x{AC}

then that sequence will not be split into three bytes because
that sequence is the utf8 encoding for a Euro Sign, therefore
all three bytes will be split off as one character.  There are lots
of combinations of random bytes that can make up a utf-8 character.
Proof:

    use strict;
    use warnings;
    use 5.016;

    my $fname =  'data.txt';

    #Write 3 bytes to file: ---------

    open my $OUTFILE, '>', $fname
        or die "Couldn't open $fname: $!\n";

    #my $str = "\N{EURO SIGN}";   #UTF-8 encoding: E2 82 AC
    print {$OUTFILE} "\x{E2}\x{82}\x{AC}";

    close $OUTFILE;

    #---------------------------


    use open IO => ':utf8';   #Now read file as utf8

    open my $INFILE, '<', $fname
        or die "Couldn't open $fname: $!\n";

    while ( my $line = <$INFILE> ) {
      my @chars = split //, $line;
      say scalar @chars;   #=> 1
    }

    close $INFILE;


As for ruby, Strings have the following methods:

each_byte()
each_char()

And you can specify the encoding of the file that you are reading when
you create the filehandle:


    #encoding: UTF-8

    #The previous comment line is so the string on line 9 will be
    #encoded with UTF-8.  The encoding in the comment applies
    #only to Strings in the source file.

    fname = 'data.txt'

    File.open(fname, 'w') do |f|
      f.write("\u20AC")  #LINE 9, Euro Sign
    end

    File.open(fname, 'r', external_encoding: 'UTF-8') do |f|
      f.each_line do |line|

        line.each_char do |char|   #UTF-8 chars
          puts char
        end

        line.each_byte do |byte|
          printf "%x \n", byte
        end

      end
    end

    --output:--
    €  (I see a Euro Sign)
    e2
    82
    ac
4b1da7279bbee36eeb1cf77ece56e1bb?d=identicon&s=25 Addis Aden (Guest)
on 2014-07-01 11:07
(Received via mailing list)
Hi,

I am not so familiar with unicode but the difference with binary and
textfiles is that in binary-mode every byte which is not ascii is
presented
as \x.. so also the unicode characters are presented as 2 or more \x..

Maybe you can read the string first as binary and use the method
force_encoding (
http://ruby-doc.org/core-2.0/String.html#method-i-...) to set
it
to utf-8.

How many files do you have to examine?

best regards
adrian



2014-06-30 19:14 GMT+02:00 Eric Peterson <epeterson@rhapsody.com>:
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (robert_k78)
on 2014-07-01 11:44
(Received via mailing list)
On Mon, Jun 30, 2014 at 7:14 PM, Eric Peterson <epeterson@rhapsody.com>
wrote:

> I though of doing "rb:utf-8:-" on the File.open, but that didn't work
> either.
>
> Any ideas?

You were pretty close:

$ ruby -e 'File.open("xx", "rb") {|io| p io.external_encoding}'
#<Encoding:ASCII-8BIT>

ASCII-8BIT is binary:

$ ruby -e 'p Encoding::BINARY'
#<Encoding:ASCII-8BIT>

For comparison

$ ruby -e 'File.open("xx", "r") {|io| p io.external_encoding}'
#<Encoding:UTF-8>

Does that help?

Kind regards

robert
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2014-07-02 03:30
Robert Klemme wrote in post #1151291:

> For comparison
>
> $ ruby -e 'File.open("xx", "r") {|io| p io.external_encoding}'
> #<Encoding:UTF-8>
>
> Does that help?
>

Well, he certainly can't rely on that.

===
The default external Encoding is pulled from your environment, much like
the source Encoding is for code given on the command-line
===
(James Edward Gray II)

And has been discussed recently here:

https://www.ruby-forum.com/topic/4980931#new

...the default external_encoding that ruby slaps on Strings read in from
IO objects does not mean those String are actually encoded in that
encoding.
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.