Determining if a file is binary or text

noagenda · September 19, 2009, 1:15am

Hi all,

I need to search text files for a given expression and flag a warning/
error if that expression does not exist. I’m going to search a large
number of files using the Linux “find” command, so I won’t know if
they are binary or text.

I realize that this can be OS-dependent and can be tricky to
determine. I was going to use the Linux “file” command which works
well in providing human-readable information about the file; however,
due to a variety of possible file types, I cannot easily determine the
file type without specifying every single possible text file format to
consider. For example, the “file” command can produce the following
(all of which are ASCII):

ASCII text
XML document text
Lisp/Scheme program text
…

Is there an easy way to do this in Ruby? After looking around quite a
bit, I thought about looking at a few first lines of the file and
matching against this regular expression:

Character class:

[:print:] Any printable character, including space

line.match(/^[[:print:]]+$/)

Which I believe could work. Any comments?

Thanks,
-James

noagenda · September 19, 2009, 2:41am

On 2009-09-18, James M. [email protected] wrote:

I need to search text files for a given expression and flag a warning/
error if that expression does not exist. I’m going to search a large
number of files using the Linux “find” command, so I won’t know if
they are binary or text.

This question is not well defined.

Think about UTF8 and ISO-8859-1…

Basically, stop and think what you mean by “binary or text”. Once
you’ve
articulated that more clearly, you may well have a much better notion of
what
you mean.

Would you be expecting to not see this message in a “binary” file? If
so,
why are they different? What about binary files makes them not need the
message (or what about text files makes them not need it…)? If you
mean
“executables”, you might approximate decently by checking the execute
permission bit…

-s

noagenda · September 19, 2009, 4:16am

On Sep 18, 5:12 pm, Seebs [email protected] wrote:

Basically, stop and think what you mean by “binary or text”. Once you’ve
articulated that more clearly, you may well have a much better notion of what
you mean.

How about a file that contains any single byte character (0-255) that
you cannot find a key for on a standard US keyboard (English)? The
[:print:] regular expression character set comprises the range of
characters 32-126, which is what I believe that I need, but I wanted
to see if there are better ways to accomplish this.

Basically I’m trying to search for the presence of a header in source
code files (which may have various extensions or no extensions at
all). The source code files are mixed with executable and non-
executable “binary” files (data files; not something that you can
read). I don’t want to flag the non-source code files as not having a
header. The scope of this problem is small so I don’t need to worry
about any character sets, etc.

I realize that this can be a complicated problem to solve, but there
are solutions to it. For example, the Linux “file” command is a
robust solution but does not meet my needs for the previously stated
reason. I also know that SVN can automatically detect binary files as
well.

Hopefully this helps clear things up…

Thanks,
-James

noagenda · September 19, 2009, 4:55am

On 2009-09-19, James M. [email protected] wrote:

How about a file that contains any single byte character (0-255) that
you cannot find a key for on a standard US keyboard (English)? The
[:print:] regular expression character set comprises the range of
characters 32-126, which is what I believe that I need, but I wanted
to see if there are better ways to accomplish this.

Well, you probably also want tabs and newlines.

I would think that [:print:] might also, in some locales, get you things
like accented letters. Whether or not you want this is harder to say.

Basically I’m trying to search for the presence of a header in source
code files (which may have various extensions or no extensions at
all). The source code files are mixed with executable and non-
executable “binary” files (data files; not something that you can
read). I don’t want to flag the non-source code files as not having a
header. The scope of this problem is small so I don’t need to worry
about any character sets, etc.

I thought that until I found a dozen Makefiles with copyright symbols
embedded in them.

I’d say as a first approximation, just check for NUL bytes. I’m pretty
sure that the vast majority of binary files will contain at least one,
and the vast majority of text files will contain none.

-s

noagenda · September 19, 2009, 5:26am

On Sep 18, 7:54 pm, Seebs [email protected] wrote:

Well, you probably also want tabs and newlines.

Ah, good point…

I thought that until I found a dozen Makefiles with copyright symbols
embedded in them.

I’d say as a first approximation, just check for NUL bytes. I’m pretty
sure that the vast majority of binary files will contain at least one,
and the vast majority of text files will contain none.

Yeah, this is another idea that I had also considered… I’m just not
sure if all of the binary files that I’m dealing with have NULL bytes
though. But that might just be good enough.

Fortunately, I’m working with a small team of individuals who will be
authoring the files so I do have some control on the type of text that
I’m looking for. So I might try [:print], \n, \t, and maybe \r (just
in case) and then fall back on the NULL idea as a Plan B.

Thanks again,
-James

noagenda · September 19, 2009, 5:30am

On 2009-09-19, James M. [email protected] wrote:

Fortunately, I’m working with a small team of individuals who will be
authoring the files so I do have some control on the type of text that
I’m looking for. So I might try [:print], \n, \t, and maybe \r (just
in case) and then fall back on the NULL idea as a Plan B.

How many files are you dealing with?

Hmm. Some source files (scripts, say) will be executable, so you can’t
assumme executables are binaries. But… You might want to experiment
with
testing a few likely heuristics and maybe making a chart. Say, make a
list
of:

TEST: .jpg x-bit NUL 128-255

FILE:
foo.jpg X - X X
foo.sh - X - -
…

and then look to see whether you can make some simple rules, like
“everything with .jpg or .gif is definitely a binary.” If you can
get a couple of simple rules that deal with 90% of so of the files,
then you can look at the remainder as a separate case and work from
there.

Don’t feel compelled to make a single perfect test when three easy tests
that handle 70% of the cases might give you a remaining pool for which
it’s much easier to write a good test.

-s

noagenda · September 19, 2009, 6:57am

On Sep 18, 8:47 pm, John W Higgins [email protected] wrote:

What about file -i which returns the MIME type instead of “human readable”
format. That should limit the choices it will return or at least give you
something you can work with.

Hi John - that’s a good idea - I looked over the “file” command
options over and over again today and somehow I missed this.

noagenda · September 19, 2009, 5:48am

Evening James.

On Fri, Sep 18, 2009 at 4:15 PM, James M.
[email protected]wrote:

…

For example, the “file” command can produce the following

(all of which are ASCII):

ASCII text
XML document text
Lisp/Scheme program text

What about file -i which returns the MIME type instead of “human
readable”
format. That should limit the choices it will return or at least give
you
something you can work with.

John

noagenda · September 19, 2009, 2:27pm

On 19.09.2009 01:14, James M. wrote:

due to a variety of possible file types, I cannot easily determine the
bit, I thought about looking at a few first lines of the file and
matching against this regular expression:

Character class:

[:print:] Any printable character, including space

line.match(/^[[:print:]]+$/)

Which I believe could work. Any comments?

Just using a single “+” seems too unsafe to me: you need only three
matching bytes which does not seem too unlikely even for binary files.

Some more random thoughts: if you use Ruby to determine file types you
can as well use Find.find to find all files removing the dependency to
an external program.

A complete different approach would be to define classes of bytes and do
statistics on the first n bytes from the file, e.g.

32-127, \r, \n, \t printable
0-31 without \n, \t, \r, 128-255 non printable

Then determine based on ratio of occurrences. Of course, that approach
can also be tricky…

Kind regards

robert

noagenda · September 19, 2009, 5:27pm

By convention, source and object files use standardized file-type
extensions, which should help you weed out files to ignore.

As a starting point ask the developers what file-type extensions
they’re using. As a second check, run something like the following
commands at the top of the path you’ll be checking:

 find . | xargs -n1 basename | egrep '\.\w+$' | awk -F. {'print

$2’} | sort -u

to give you a list of possible extensions, then check those too.

Use “file” and “file -i” to do a best-guess once you’ve narrowed your
possibilities. Both use “magic” files which define where file should
look inside a target file to determine what type it is. They are
fallible though and you can get false positives. Do a “man magic” from
the command-line on your Linux box for more info.

Also, be careful assuming only binary files have \x00 bytes or high-
order ASCII. Old text files that have migrated from other systems
could have them, as could files where someone ALT+fat-fingered on the
keypad as could a source file coming from a non-english speaking
nation where the developer used variable names in his native language.
You just never know what you’ll find in those pesky source files.

noagenda · September 20, 2009, 12:29am

FWIW Subversion flags binaries automatically.

If svn does that, I guess there’s gonna be some heuristics that work
reasonably well in practice.

noagenda · September 20, 2009, 12:05am

Robert K. wrote:

well in providing human-readable information about the file; however,
Is there an easy way to do this in Ruby? After looking around quite a
matching bytes which does not seem too unlikely even for binary files.

I have a problem with considering 128-255 being non-printable. A lot of
these characters are printable, and can be part of text, much like I use
Alt-0xxx keys in Pagemaker a lot. The other problem with saying a file
is not a text file is determining what is meant by a text file. Is it
strictly a file with only Ascii text like a log file, or does it include
formated text like word processor file? Word processing and spreadsheet
files contain many characters that are considered non-printable but
display as text with the correct program.

noagenda · September 20, 2009, 11:13am

James M. wrote:

Hi all,

I need to search text files for a given expression and flag a warning/
error if that expression does not exist. I’m going to search a large
number of files using the Linux “find” command, so I won’t know if
they are binary or text.

require ‘ptools’

File.binary?(your_file)

Regards,

Dan

noagenda · September 20, 2009, 11:20am

On 20.09.2009 00:03, Michael W. Ryder wrote:

determine. I was going to use the Linux “file” command which works

Just using a single “+” seems too unsafe to me: you need only three
0-31 without \n, \t, \r, 128-255 non printable

I have a problem with considering 128-255 being non-printable. A lot of
these characters are printable, and can be part of text, much like I use
Alt-0xxx keys in Pagemaker a lot.

That was just an example. Of course you can use a different
classification (for example, adding a third category for 127-255). I
assume those characters are comparatively rare in text files so the
general approach would still work.

The other problem with saying a file
is not a text file is determining what is meant by a text file. Is it
strictly a file with only Ascii text like a log file, or does it include
formated text like word processor file? Word processing and spreadsheet
files contain many characters that are considered non-printable but
display as text with the correct program.

I fully agree: the difficult part is in deciding: what is a text file?
If that has been clarified enough the algorithm for checking should
become much more obvious.

Cheers

robert

noagenda · September 21, 2009, 5:39pm

On Sep 20, 2:17 am, Robert K. [email protected] wrote:

I fully agree: the difficult part is in deciding: what is a text file?
If that has been clarified enough the algorithm for checking should
become much more obvious.

I agree - this is what it comes down to. BTW, I tried the following
on my project (using Find#find to get the tree) on the first 40
“lines” (which I know can theoretically be very short or long in a
“binary” file) and it seems to work for what I’m doing. This works
well for me also because I’m checking for the presence of a header and
I can do this check along with checking for a header while the file is
still open:

line.match(/^[[:print:]\t\n\r]+$/)

But probably a better approach would be to use a ratio of characters
that are printable against those that may traditionally be non-
printable in the even that some “non-printable” characters are present
in a text file. This is what SVN does (found the link from a post
from Xavier on a “ptools” website when I Googled it):

http://subversion.tigris.org/faq.html#binary-files

And it also appears to be what File#binary? is doing in ptools (I
checked the source code; thanks Dan for the pointer).