On 2011-04-22 2:49 PM, 7stud – wrote:
Alessandro B. wrote in post #994473:
Do not think of binary files as containing lines. A binary file is a
long continuous sequence of integers contained in a varying number of
bytes.
That’s OK. but the file I need to parse is a special txt file (DXF
format) that consist of couple-of-line:
Binary files do not have lines. Until you can understand that, you
cannot proceed. Binary files consist of blocks of bytes. Each block
contains some data. Each block consists of a different number of bytes.
Its not to helpful to someone trying to deal with DXF files to make such
a strong distinction between binary and text files. I haven’t worked
with them and hope I never have to. A quick look at the Wikipedia
article and the most recent Autocad spec suggests that the files may be
best thought of as a mixture of binary and ASCII data. The original DXF
files were text files where each line was a key value pair with the
value generally a decimal representation of a floating point number.
There is now an optional file format that contains binary
representations of the numbers to reduce precision losses caused by
repeated conversions and save some space. Most of the 270 page
specification appears to describe the ASCII format with the binary
format introduced on page 242.
You can get a recent DXF spec at:
http://images.autodesk.com/adsk/files/autocad_2012_pdf_dxf-reference_enu.pdf
This may give a helpful overview:
http://en.wikipedia.org/wiki/Dxf
Alessandro’s problem is to read and parse a file that contains small
fields to be interpreted as ASCII text, binary integers, floating point
numbers, etc. Just what will come next is determined by what came just
before with reference to a 270 page document which has a few
examples in Visual Basic 6.
I would proceed as follows:
-
Figure out which kinds of primitive data are expected in the files of
interest.
-
For each kind, write and test a function to read and convert one such
item.
-
Write a function to read the next entity record from the file. Its
likely that this function
should return a Ruby object that represents the particular kind of
entity.
The ACIS spec says “The header is followed by a sequence of entity
records.
Each entity record consists of a sequence number (optional), an entity
type identifier,
the entity data, and a terminator.”
So to read an entity record, first read the sequence number if present,
then read the type identifier. The type identifier should be used to
select an appropriate function to read the data part of the entity
record. Then read the terminator unless it was already used to end the
entity data.
Essential tools:
Something to examine and print pieces of the data in hexadecimal. Use
this to explore the
data and resolve questions about byte order, number encoding, etc.
The ruby String pack and unpack functions.
Possibly an assortment of colored pencils to mark up printed hex dumps
of the data.
There may be some Ruby tools specifically intended for this kind of
work.
Caveat:
I may have written more than I know about some of the details but I
think the general ideas are correct.
– Bill