Read write integer in binary into a file

acemtp · October 25, 2007, 4:36pm

Hello,

I have some big files with lot of “unsigned int” (4 bytes) numbers and I
want to read and write on these files.

Currently, I found this to write:

myfile << [mynum].pack(“i”)

and to read:

mynum = myfile.read(4).unpack(“i”).first

I wonder if there’s not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.

Thank you.

acemtp · October 25, 2007, 5:04pm

Hi,
----- Original Message -----
From: “Vianney L.” [email protected]
Newsgroups: comp.lang.ruby
To: “ruby-talk ML” [email protected]
Sent: Thursday, October 25, 2007 11:36 PM
Subject: read write integer in binary into a file

mynum = myfile.read(4).unpack(“i”).first

I wonder if there’s not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.

Thank you.

How about Marshal?

myfile << Marshal.dump(mynum)

and

mynum = Marshal.load(myfile.read)

Regards,

Park H.

acemtp · October 25, 2007, 5:08pm

How about Marshal?

Files are filled by an external C application that do something like:
fwrite(fp, 4, myint);

Se I have to use the same file format.

acemtp · October 25, 2007, 5:19pm

It seems that the marshaling of a number doesn’t give a 4 bytes:

irb(main):036:0> mynum
=> 56515
irb(main):037:0> [mynum].pack(“i”)
=> “\303\334\000\000”
irb(main):038:0> Marshal.dump(mynum)
=> “\004\bi\002\303\334”

acemtp · October 25, 2007, 5:16pm

Vianney L. wrote:

How about Marshal?

Files are filled by an external C application that do something like:
fwrite(fp, 4, myint);

Se I have to use the same file format.

What file format? I dont see any problem with using Marshal, it doesnt
need a file format specified its simply just a marshal dump.

acemtp · October 25, 2007, 6:09pm

I wrote a function to do this which seems slightly faster, but could
perhaps stand some optimization:

def pack_int32(n)
str = ’ ’
str[3] = n >> 24
str[2] = n >> 16
str[1] = n >> 8
str[0] = n
str
end

Here are the benchmark results vs the other methods mentioned:

              user     system      total        real

[].pack(i): 6.234000 0.235000 6.469000 ( 6.500000)
pack_int32: 5.719000 0.015000 5.734000 ( 5.734000)
Marshal.dump: 6.594000 0.219000 6.813000 ( 6.813000)

I included Marshal.dump for completeness, but agree that it doesn’t
appear to be meant for this sort of thing. Here’s the source to run
the benchmark:

require ‘benchmark’
number = 2_000_000
n = 1_000_000
Benchmark.bm(12) do |x|
x.report(’[].pack(i):’) { n.times do; [number].pack(‘i’); end }
x.report(‘pack_int32:’) { n.times do; pack_int32(number); end }
x.report(‘Marshal.dump:’) { n.times do; Marshal.dump(number); end }
end

Adam

acemtp · October 25, 2007, 6:36pm

On Oct 25, 10:09 am, Adam P. [email protected] wrote:

end
the benchmark:

require ‘benchmark’
number = 2_000_000
n = 1_000_000
Benchmark.bm(12) do |x|
x.report(‘[].pack(i):’) { n.times do; [number].pack(‘i’); end }
x.report(‘pack_int32:’) { n.times do; pack_int32(number); end }
x.report(‘Marshal.dump:’) { n.times do; Marshal.dump(number); end }
end

Using only the number 2_000_000 seems to skew the results. I see your
results with your test, but if I change it slightly to use a variety
of integers, I get more balanced results:

require ‘benchmark’
MAX = 2**30
n = 1_000_000
nums = (0…n).map{ (rand*MAX).to_i }

Benchmark.bmbm do |x|
x.report(‘pack(i):’) { nums.each{ |num| [num].pack(‘i’) } }
x.report(‘pack32:’) { nums.each{ |num| pack_int32(num) } }
x.report(‘Dump:’) { nums.each{ |num| Marshal.dump(num) } }
end

Rehearsal --------------------------------------------
pack(i): 5.813000 0.109000 5.922000 ( 5.984000)
pack32: 5.234000 0.000000 5.234000 ( 5.281000)
Dump: 5.906000 0.125000 6.031000 ( 6.063000)
---------------------------------- total: 17.187000sec

           user     system      total        real

pack(i): 5.687000 0.125000 5.812000 ( 5.875000)
pack32: 5.141000 0.016000 5.157000 ( 5.188000)
Dump: 6.000000 0.078000 6.078000 ( 6.141000)

acemtp · October 25, 2007, 6:17pm

On Oct 25, 9:36 am, Vianney L. [email protected] wrote:

mynum = myfile.read(4).unpack(“i”).first

I wonder if there’s not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.

Thank you.

Posted viahttp://www.ruby-forum.com/.

Do you have to deal with each number individually? Maybe you could
build up an array of numbers and then pack them all at once:

arr = []
while work_to_do do
mynum = generate_next_number
arr << mynum
end
myfile.write arr.pack(‘i*’)

That way you aren’t creating a new array for each number.

Similarly, for reading the file:
data = file.read
num_array = data.unpack(‘i*’)

The ‘*’ in (un)pack means to process the rest of the data in the same
way.

acemtp · December 13, 2007, 2:26pm

Vianney L. wrote:

Hello,

I have some big files with lot of “unsigned int” (4 bytes) numbers and I
want to read and write on these files.

Currently, I found this to write:

myfile << [mynum].pack(“i”)

and to read:

mynum = myfile.read(4).unpack(“i”).first

I wonder if there’s not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.

Thank you.

irb(main):001:0> f=open(‘test’,‘w’)
=> #<File:test>
irb(main):002:0> f<<[65535].pack(‘i’)
=> #<File:test>
irb(main):003:0> f.tell
=> 4
irb(main):004:0> f<<[720850].pack(‘i’)
=> #<File:test>
irb(main):005:0> f.tell
=> 9
the integer 720850 takes 5 bytes in my file,but it should take 4 bytes
only!How can I fix this?Thanks!

acemtp · December 13, 2007, 2:42pm

Wu Junchen wrote:

irb(main):001:0> f=open(‘test’,‘w’)
=> #<File:test>
irb(main):002:0> f<<[65535].pack(‘i’)
=> #<File:test>
irb(main):003:0> f.tell
=> 4
irb(main):004:0> f<<[720850].pack(‘i’)
=> #<File:test>
irb(main):005:0> f.tell
=> 9
the integer 720850 takes 5 bytes in my file,but it should take 4 bytes
only!How can I fix this?Thanks!

irb
irb(main):001:0> x = [720850].pack(‘i’)
=> “\322\377\n\000”
irb(main):002:0> x.length
=> 4

So clearly the integer 720850 is packed into 4 bytes as requested. Why
does it occupy 5 bytes in the file? But see the “\n” in position 2? That
means that the 3rd byte is a newline character, and on Windows, in text
files, Ruby turns newlines into CRLF. 2 bytes! Since you’ve got binary
data in your file you don’t want to write a text file, so you must open
the file with the “b” flag in addition to “w”:

f = open(“test”, “wb”)