Binary file: SAT

Hi all. I never work before with binary file, and I’m a bit
confused… I need to “read” in a text file (a *.dxf file) a block of
lines encoded in binary format (in the sat/sab file, like here:

http://paulbourke.net/dataformats/sat/sat.pdf

How can I do it?

Alessandro B. wrote in post #994136:

Hi all. I never work before with binary file, and I’m a bit
confused…

Both numbers and characters are stored as integers in a file(or anywhere
on a computer). One method of storing characters in a file is with the
ASCII encoding. For instance, in the ASCII encoding ‘a’ is stored as
the integer 97, taking up one byte total. Note that you could also
store the integer 97 in 4 bytes–the other three bytes would just be all
0’s.

You may also want to store the count of the number of banks in New York
in a file, which is 97. You could also store that in one byte. So the
question becomes, how do you know whether a 97 you read from the file is
supposed to be the count of banks or the letter ‘a’? The answer is: you
have to know how the data in the file is supposed to be interpreted.

If the integer in the first byte in a file is supposed to be the number
of banks, then you read in the integer as is; and if the integer in the
second byte in the file is supposed to be a letter, then you need to
convert the integer to a letter. In other words, you have to know ahead
of time what each byte in the file is supposed to represent.

Once you are familiar with what each byte in your file represents, you
can use String#unpack to tell ruby how many bytes each integer occupies,
and how to interpret the integer.

7stud – wrote in post #994163:

Once you are familiar with what each byte in your file represents, you
can use String#unpack to tell ruby how many bytes each integer occupies,
and how to interpret the integer.

But, I can’t get a simple unpack() example to work, so what do I know:

str = “\x00\x00\x00\x61” #97 in hex, taking up 4 bytes

results = str.unpack(“L”)
p results

–output:–
[1627389952]

Hi,

2011/4/21 7stud – [email protected]:

results = str.unpack(“L”)
p results

–output:–
[1627389952]

It’s the correct result. L uses your systems endianness, which seems
to be little-endian. If you force big-endian by using N instead of L,
you will get your expected 97.

ruby-1.9.2-p180 :008 > str = “\x61\x00\x00\x00”
=> “a\u0000\u0000\u0000”
ruby-1.9.2-p180 :009 > results = str.unpack(“L”)
=> [97]

ruby-1.9.2-p180 :011 > str = “\x00\x00\x00\x61”
=> “\u0000\u0000\u0000a”
ruby-1.9.2-p180 :012 > results = str.unpack(“N”)
=> [97]

Thanx you all. I’m beginning to understand a bit…

These are the first 20 lines of the binary-block in the file:

1
mogoo mih m o
1
_ll P/:1 [:,681 ^ 336>1<: ^ \VL ]*63;:- _nk ^ \VL mogqoo QK _mk H:; ^ /-
mo mmeogemi monn
1
n fqfffffffffffffffj:rooh n:rono
1

,27:>;:- {rn rn _nm mnmqoqoqjgmm |
1
=0;& {m rn {rn {l {rn {rn |
1
-:9@)+r:&:r>+±6= {rn rn {rn {rn {n {k {j |
1
32/ {i rn {rn {rn {h {n |
1
:&:@-:961:2:1+ {rn rn _j 8-6; n _l ±6 n _k ,
-9 o _l >;5 o _k 8->; o
_f /0,+<7:<4 o _k ,+03 oqoohilhjnjnfgklfljfh _k 1+03 lo _k ;,63 o _g
93>+1:, o _h /6’>-:> o _k 72>’ o _i 8-6;>- o _j 28-6; looo _j *8-6; o
_j )8-6; o _no :1;@96:3;, |
1
):-+:’@+:2/3>+: {rn rn l o n g |
1
-:9@)+r:&:r>+±6= {rn rn {rn {rn {l {k {j |


It consists of pairs of lines: the first is a code (always 1), the
second is the data. I think that the latter is wrote according to the
SAT format (well to the SAB format, it’s binary…).

ACIS supports two kinds of save files, SAT and SAB, which stand for
“Standard ACIS Text”
and “Standard ACIS Binary”, respectively. Although one is ASCII text and
the other is binary
data, the model data information stored in the two formats is identical

A SAB file has a .sab file extension. A SAB file uses delimiters
between elements and binary tags, without additional formatting.
The binary formats supported are:
int . . . . . . . . . . 4–byte 2s complement (as long)
long . . . . . . . . . 4–byte 2s complement
double . . . . . . . 8–byte IEEE
char . . . . . . . . . 1–byte ASCII
where “byte” is eight bits, and files are considered to be byte strings.
For multi–byte data
items, byte order normally just matches that of the processor being
used, but a specific order
may be imposed by compiling with the preprocessor macro BIG_ENDIAN or
LITTLE_ENDIAN defined.

Alessandro B. wrote in post #994230:

Thanx you all. I’m beginning to understand a bit…

These are the first 20 lines of the binary-block in the file:

Binary files aren’t human readable, i.e. they look like nonsense.

It consists of pairs of lines: the first is a code (always 1), the
second is the data.

Do not think of binary files as containing lines. A binary file is a
long continuous sequence of integers contained in a varying number of
bytes. And you have to know exactly what each sequence of bytes
represents ahead of time in order to read the data. For
instance, you have to know that the first 4 bytes is the count of banks
in New York, and the next byte is a letter, and the next 2 bytes is the
year, and the next 2 bytes is the month, etc. If you do not know that
information, you cannot read the file.

Just because one of the integers stored in one byte in the file happens
to be the ascii code for a newline, and then when you print out the data
that integer gets converted to a newline does not mean that the integer
is supposed to be interpreted as a newline–the integer might actually
be the count of the number of banks in New York! The reason you see a
newline in the output is because whatever device you used to view the
output thought it was supposed to convert that integer to a newline.

Suppose your file contains this data:

“\x00\x00\x00\x01”

Scenario 1:
The four bytes represent the number of widgets sold (=1).

Scenario 2:
The first two bytes represent the number of widgets sold(=0),
the third byte is the number of widgets in inventory(=0), and the
fourth byte is the number of widgets in transit to the factory(=1).

Unless you know ahead of time what each byte in the file is supposed to
represent, you cannot read the file correctly. If someone hands you the
file with the above data in it, and says, “Here’s your data. Get
cracking!”, and then the person walks out the door, how would you know
if Scenario 1 or Scenario 2 is the way the data is laid out?

Alessandro B. wrote in post #994473:

Do not think of binary files as containing lines. A binary file is a
long continuous sequence of integers contained in a varying number of
bytes.

That’s OK. but the file I need to parse is a special txt file (DXF
format) that consist of couple-of-line:

Binary files do not have lines. Until you understand that, you
cannot proceed. Binary files consist of blocks of data that are
adjacent to each other, so they form one continuous sequence of
data. Each block contains an integer. Each block may contain a
different number of bytes.

From the link you provided:

===
SAT files are ASCII text files that may be viewed with a simple text
editor. A SAT file
contains carriage returns, white space and other formatting that makes
it readable to the
human eye. A SAT file has a .sat file extension.

SAB files cannot be viewed with a simple text editor and are meant for
compactness and not
for human readability. A SAB file has a .sab file extension. A SAB file
uses delimiters
between elements and binary tags, without additional formatting.

That seems to indicate that there is some kind of delimitter between
blocks of data in a SAB file. If that is the case, then reading the
data is quite
simple. You can read the whole file into a string and then use split()
with the delimiter to separate the data into an array. In that case,
you only need to know the order of the data–not the size of each block.
See this recent thread:

http://www.ruby-forum.com/topic/1538357#new

Do not think of binary files as containing lines. A binary file is a
long continuous sequence of integers contained in a varying number of
bytes.

That’s OK. but the file I need to parse is a special txt file (DXF
format) that consist of couple-of-line: the 1st is a code, that specify
an objectt-property (the colour of a line, the center of a circle, the
hieght of a text, etc), the 2nd is the value associated with it.
Well, there is a special object, the 3dsolid, that have 4 or 5 copules
like above, and a long series of couple that have the 1st line always 1
and the 2nd one as binary data.

Group code Description
8 Layer name
70 Modeler format version number (currently = 1)
… …
1 Proprietary data (multiple lines < 255 characters
each)
3 Additional lines of proprietary data (if previous
group 1 string is greater than 255 characters)(optional)

For exanple, the following draws a line, in the layer “Walls”, from the
point (16.5, 12.5,0.0) to (46.5,12.5,0.0).

0
LINE
8
Walls
10
16.5
20
12.5
30
0.0
11
46.5
21
12.5
31
0.0

My task is to “understand” the object “3dsolid” that have also the
“Proprietary data”, ie the binary data. Searching in Google I found that
this data are set according to the ACIS *.sab standard (the link in the
first post), so I think I can read that binary… isn’t it?

On 2011-04-22 2:49 PM, 7stud – wrote:

Alessandro B. wrote in post #994473:

Do not think of binary files as containing lines. A binary file is a
long continuous sequence of integers contained in a varying number of
bytes.
That’s OK. but the file I need to parse is a special txt file (DXF
format) that consist of couple-of-line:
Binary files do not have lines. Until you can understand that, you
cannot proceed. Binary files consist of blocks of bytes. Each block
contains some data. Each block consists of a different number of bytes.

Its not to helpful to someone trying to deal with DXF files to make such
a strong distinction between binary and text files. I haven’t worked
with them and hope I never have to. A quick look at the Wikipedia
article and the most recent Autocad spec suggests that the files may be
best thought of as a mixture of binary and ASCII data. The original DXF
files were text files where each line was a key value pair with the
value generally a decimal representation of a floating point number.
There is now an optional file format that contains binary
representations of the numbers to reduce precision losses caused by
repeated conversions and save some space. Most of the 270 page
specification appears to describe the ASCII format with the binary
format introduced on page 242.

You can get a recent DXF spec at:

This may give a helpful overview:

Alessandro’s problem is to read and parse a file that contains small
fields to be interpreted as ASCII text, binary integers, floating point
numbers, etc. Just what will come next is determined by what came just
before with reference to a 270 page document which has a few
examples in Visual Basic 6.

I would proceed as follows:

  • Figure out which kinds of primitive data are expected in the files of
    interest.

  • For each kind, write and test a function to read and convert one such
    item.

  • Write a function to read the next entity record from the file. Its
    likely that this function
    should return a Ruby object that represents the particular kind of
    entity.

The ACIS spec says “The header is followed by a sequence of entity
records.
Each entity record consists of a sequence number (optional), an entity
type identifier,
the entity data, and a terminator.”

So to read an entity record, first read the sequence number if present,
then read the type identifier. The type identifier should be used to
select an appropriate function to read the data part of the entity
record. Then read the terminator unless it was already used to end the
entity data.

Essential tools:

Something to examine and print pieces of the data in hexadecimal. Use
this to explore the
data and resolve questions about byte order, number encoding, etc.

The ruby String pack and unpack functions.

Possibly an assortment of colored pencils to mark up printed hex dumps
of the data.

There may be some Ruby tools specifically intended for this kind of
work.

Caveat:
I may have written more than I know about some of the details but I
think the general ideas are correct.

– Bill