Test if file is binary?

RebhanS_Gilbert · August 21, 2007, 8:05am

Hi ,

how to test if a file is binary or not ?

There ain’t something like File.binary =
NoMethodError: undefined method `binary?’ for File:Class

Any ideas or libraries available ?

Regards, Gilbert

RebhanS_Gilbert · August 21, 2007, 8:51am

On Aug 21, 8:04 am, “Rebhan, Gilbert” [email protected]
wrote:

Hi ,

how to test if a file is binary or not ?

There ain’t something like File.binary =
NoMethodError: undefined method `binary?’ for File:Class

Any ideas or libraries available ?

Regards, Gilbert

What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.

RebhanS_Gilbert · August 21, 2007, 8:58am

Hi,

RebhanS_Gilbert · August 21, 2007, 9:06am

2007/8/21, Rebhan, Gilbert [email protected]:

Hi ,

how to test if a file is binary or not ?

There ain’t something like File.binary =
NoMethodError: undefined method `binary?’ for File:Class

Any ideas or libraries available ?

If I’d really need it I’d probably do a heuristic based on
distribution of byte values across an initial portion of the file.
Something like this:

class File
def self.binary?(name)
ascii = control = binary = 0

File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
  case bt
    when 0...32
      control += 1
    when 32...128
      ascii += 1
    else
      binary += 1
  end
end

control.to_f / ascii > 0.1 || binary.to_f / ascii > 0.05

end
end

Kind regards

robert

RebhanS_Gilbert · August 21, 2007, 9:13am

Hi,

RebhanS_Gilbert · August 21, 2007, 9:23am

On 21 Aug 2007, at 15:57, Rebhan, Gilbert wrote:

wrote:

All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.

For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii

Regards, Gilbert

One simple approach is this:

class File
def is_binary?
ascii = 0
total = 0
self.read(1024).each_byte{|c| total += 1; ascii +=1 if c >= 128
or c == 0}
ascii.to_f / total.to_f > 0.33 ? true : false
end
end

You can tweak the 0.33 value if you like. Probably better (i.e. more
robust) ways out there though.

Alex G.

Bioinformatics Center
Kyoto University

RebhanS_Gilbert · August 21, 2007, 9:42am

2007/8/21, Alex G. [email protected]:

Sorry for the duplicate! Robert is too fast for me.

It’s always good to see more solutions. I like the conciseness of
your solution. But I think this should rather be a class method
because you would not do the test on an open stream. Dunno which of
the solutions is more realistic. Might be fun to let both approaches
test a large number of files and compare their results (probably also
with output from “file”).

Btw, you should get rid of the ternary operator - it’s totally
superfluous because there is no point in converting a boolean value
into a boolean value.

Kind regards

robert

RebhanS_Gilbert · August 21, 2007, 3:59pm

On Aug 21, 12:04 am, “Rebhan, Gilbert” [email protected]
wrote:

Hi ,

how to test if a file is binary or not ?

There ain’t something like File.binary =
NoMethodError: undefined method `binary?’ for File:Class

gem install ptools
require ‘ptools’
File.binary?(file)

Regards,

Dan

RebhanS_Gilbert · August 21, 2007, 9:25am

Sorry for the duplicate! Robert is too fast for me.

Alex G.

Bioinformatics Center
Kyoto University

RebhanS_Gilbert · August 21, 2007, 11:08am

Hi,

Am Dienstag, 21. Aug 2007, 15:57:13 +0900 schrieb Rebhan, Gilbert:

What to you need to achieve with this is_binary? method?

For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii

I’m impressed by the solutions of Alex and Robert. Anyway I
suppose in most cases a test on one single null character
will suffice. Something like this:

class File
def binary?
while (b=f.read(256)) do
return true if b[ “\0”]
end
end
end

Yet I recommend first to review whether you want to read the
file later. In this case you may abort reading when the file
fails a more sophisticated filetype check.

Dividing files into “text” and “binary” is the archetype
misdesign in the operating system you use. (Is there
anything designed well (besides Outlook, of course?)) The
distinction doesn’t refer to the files contents but how to
the file is treated when it is being read or written. In
“rb”/“wb” modes files are left how they are, in “r”/“w”
modes Windows programmers get line ends “\r\n” translated
into “\n” what disturbs file positions and string lengths.
I think the only purpose of this is to detain programmers
from doing anything a non-Microsoft way.

Bertram

RebhanS_Gilbert · August 21, 2007, 5:53pm

Hi,

Am Dienstag, 21. Aug 2007, 18:06:26 +0900 schrieb Bertram S.:

class File
def binary?
while (b=f.read(256)) do
return true if b[ “\0”]
end
end
end

This is blunder, of course. Some better ones:

def File.binary? name
open name do |f|
while (b=f.read(256)) do
return true if b[ “\0”]
end
end
false
end

def File.binary? name
open name do |f|
f.each_byte { |x|
x.nonzero? or return true
}
end
false
end

Just to be corrrect.

Bertram

RebhanS_Gilbert · August 21, 2007, 9:36pm

Robert K. [email protected] (09:04) schrieb:

If I’d really need it I’d probably do a heuristic based on
distribution of byte values across an initial portion of the file.

That only shows how many non-ascii-characters are used. It won’t
recognise russian script in utf-8 as text, or uuencode as binary.

What diff (and thus rcs, cvs, svn …) cares about is lines. Something
is text if it’s logically organized in short lines, and eohl cahracters
are used only for ending lines.

class File
def self.binary?(name)
cr, len, mlen = false, 0, 0
File.open(name, “rb”) {|io| io.read(1024)}.each_byte do |bt|
return false if cr and bt != 10
case bt
when 13
cr = true
when 10
mlen = len if len > mlen
len = 0
else
len += 1
end
end
mlen > 1000
end
end

I chose 1000 as the maximum line length, to fit whole paragraphs in one
line. But of course the maximum of the proceeding tool is relevant here.
There is the right place to do the check anyway.

mfg, simon … l

RebhanS_Gilbert · August 21, 2007, 10:43pm

Simon K. wrote:

are used only for ending lines.
[snip]

I chose 1000 as the maximum line length, to fit whole paragraphs in one
line. But of course the maximum of the proceeding tool is relevant here.
There is the right place to do the check anyway.

That’s why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000
characters.

This is on my personal list of ‘bad practices’, but it may be
appropriate to others.

My 0.02EUR

Stefan

RebhanS_Gilbert · August 22, 2007, 3:06am

Stefan M. [email protected] (22:40) schrieb:

That’s why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.

You can’t seriously treat a file with lines longer than 8000 characters
as line oriented. It’s far from being readable by a human. You declare
that file as application/xml.

One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.

This is on my personal list of ‘bad practices’, but it may be
appropriate to others.

I think it’s bad practice to declare something with huge lines as text.

mfg, simon … l

RebhanS_Gilbert · August 22, 2007, 9:27am

Simon K. wrote:

Stefan M. [email protected] (22:40) schrieb:

That’s why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.

You can’t seriously treat a file with lines longer than 8000 characters
as line oriented. It’s far from being readable by a human. You declare
that file as application/xml.

Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn’t this be text?

Why do you think it is not readable?

One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.

Sorry, I fail to see your point. Are we really judging whether a file is
text by how much memory pages a diff will take or how many characters a
patch has?

I couldn’t find a definition of text except that text means absence of
binary data. This is weak - so I would follow your definition - A text
file is a file which can be read by a human.

This is on my personal list of ‘bad practices’, but it may be
appropriate to others.

I think it’s bad practice to declare something with huge lines as text.

Well, I disagree.

But to get (slightly at least) ontopic again, if I would have to detect
whether a file is text I would go with a combination of Robert K.s
and Bertram Schrapfs solutions.

Stefan

RebhanS_Gilbert · August 22, 2007, 1:36pm

Stefan M. [email protected] (09:25) schrieb:

Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

Note that Subversion would just treat the file as binary and process it
with its binary diff.

But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn’t this be text?

If that’s really a paragraph there is no problem, except maybe of style.
(When wrapped in lines of 80 characters, it’s 100 lines!)

Why do you think it is not readable?

I think that an XML file that has huge lines is unreadable since a
human wouldn’t recognize any structure, when all the elements are on a
single line.

One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.

Sorry, I fail to see your point.

That’s another point. Or actually two. The hard one being the limitation
of the software used: It may have a maximum line length.

Are we really judging whether a file is text by how much memory pages
a diff will take or how many characters a patch has?

No, this has nothing to do with being text, just with being well suited
as input to a diff algorithm.

Text usually is suited as well as everything else that is line oriented
and typical changes affect only one or a few neighboring lines.

mfg, simon … l

RebhanS_Gilbert · August 22, 2007, 8:50pm

Simon K. wrote:

Stefan M. [email protected] (09:25) schrieb:

Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

Note that Subversion would just treat the file as binary and process it
with its binary diff.

I didn’t know this. Thanks for the info.

But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn’t this be text?

If that’s really a paragraph there is no problem, except maybe of style.
(When wrapped in lines of 80 characters, it’s 100 lines!)

Agreed. But it is still text - which was the point I tried to make.

Why do you think it is not readable?

I think that an XML file that has huge lines is unreadable since a
human wouldn’t recognize any structure, when all the elements are on a
single line.

My question was directed to the 8000 char-paragraph. I even find small
xml-files unreadable - so I completely agree with you that 8000 chars of
xml-data in a single line is far from being readable by a human. Anyway

xml is meant to be processed by machines.

But even this case I would classify as text (I’m changing my earlier
definition slightly) if it does not contain binary data. The xml in a
file is semantics. And I assume the question text or binary refers to
syntax.

One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.
Sorry, I fail to see your point.

That’s another point. Or actually two. The hard one being the limitation
of the software used: It may have a maximum line length.

If I understand the original poster correctly he wants to
programmatically detect whether a file is “binary or text”. My point was
that he shouldn’t restrict his program artifically - but this depends on
context.

Are we really judging whether a file is text by how much memory pages
a diff will take or how many characters a patch has?

No, this has nothing to do with being text, just with being well suited
as input to a diff algorithm.

Text usually is suited as well as everything else that is line oriented
and typical changes affect only one or a few neighboring lines.

These are things I’m normally not concerned about, that’s why I couldn’t
follow that subject change.

Do I summarize correctly that depending on the purpose of the check one
could use a maximum line length - or any other of the posted approaches?

Aka ‘use the right tool for the job’ + ‘There is no single answer to
this question’?

Stefan

RebhanS_Gilbert · August 22, 2007, 3:26pm

On Aug 22, 2007, at 1:35 PM, Simon K. wrote:

Note that Subversion would just treat the file as binary and
process it
with its binary diff.

It also disables newline normalization (which may or may not be an
issue in that case).

– fxn

RebhanS_Gilbert · August 22, 2007, 10:32pm

Xavier N. [email protected] (15:24) schrieb:

Note that Subversion would just treat the file as binary and process
it with its binary diff.

It also disables newline normalization (which may or may not be an
issue in that case).

Which is configurable for text files, too.

mfg, simon … end of off topic

RebhanS_Gilbert · August 23, 2007, 4:35am

Stefan M. [email protected] (20:46) schrieb:

My question was directed to the 8000 char-paragraph. I even find small
xml-files unreadable

Well, there is lot of XML files that I find readable. Including many I
or my software wrote.

Of course there are perversions like XMI and Microsoft’s new formats.

so I completely agree with you that 8000 chars of xml-data in a
single line is far from being readable by a human.

And thus it’s binary and not text.

Anyway - xml is meant to be processed by machines.

It’s meant to be read by an XML parser, which a regular diff isn’t. So
only special cases are well suited for diff, and other special cases are
human readable.

But even this case I would classify as text (I’m changing my earlier
definition slightly) if it does not contain binary data.

I would say it’s text when interpreted as text/plain it’s human
readable. Otherwise it’s binary. That is, binary = for machines only.

If I understand the original poster correctly he wants to
programmatically detect whether a file is “binary or text”. My point was
that he shouldn’t restrict his program artifically - but this depends on
context.

Yes, in the original post he didn’t say, for what purpose. If it’s for
diffing the line structure is what matters.

Do I summarize correctly that depending on the purpose of the check one
could use a maximum line length - or any other of the posted approaches?

The other approaches are good for deciding if the files contains text in
latin based scripts. That’s only a small subset of text, and they will
happily classify base64 as text.

Aka ‘use the right tool for the job’ + ‘There is no single answer to
this question’?

Yes. Probably the best approach was using file(1).

mfg, simon … l