Byte Arrays in Ruby

Detlef_R · December 31, 2013, 5:40pm

I have an algorithm which uses an array of boolean (True/False) data.
I would like to conserve and extend memory use by having the array
use less memory.

Is it possible in Ruby (2.1.0 >) to effectively create byte arrays
so that each index (array[n]) is a byte boolean?

Or, is there a gem that provides this feature?

jabari_z · December 31, 2013, 6:05pm

Subject: Byte Arrays in Ruby
Date: mar 31 dic 13 05:40:32 +0100

Quoting Jabari Z. ([email protected]):

I have an algorithm which uses an array of boolean (True/False) data.
I would like to conserve and extend memory use by having the array
use less memory.

Is it possible in Ruby (2.1.0 >) to effectively create byte arrays
so that each index (array[n]) is a byte boolean?

Maybe there is something that helps in that direction, but from what I
know, since in Ruby everything is an object, even if you were to limit
the disk occupation of the actual variable to one byte, it would then
need all the management infrastructure that each other object
requires. Thus, I think you would not save so much.

If you are short in memory, you may create a C-coded class that
manages blocks of these bytes - say 1k or 1M - where the storage is
kept in an allocated C array. You’ll access single values or ranges of
values through calls included in your C code. But when data enters
Ruby-land, I’d have it converted to Ruby integers.

The way to best code such a class would depend on your actual needs.

Carlo

jabari_z · December 31, 2013, 6:07pm

On Tue, Dec 31, 2013 at 11:40 AM, Jabari Z. [email protected]
wrote:

I have an algorithm which uses an array of boolean (True/False) data.
I would like to conserve and extend memory use by having the array
use less memory.

Is it possible in Ruby (2.1.0 >) to effectively create byte arrays
so that each index (array[n]) is a byte boolean?

Or, is there a gem that provides this feature?

If you search “boolean” on Ruby-Toolbox.com, it shows several hits
that will store booleans into bits in an ActiveRecord model. A
quick glance over the first several screenfuls didn’t show one to do
it generically (i.e., w/o AR, or at least not depending on a
database). So, if you want something generic, it sounds like a decent
project for you to build a gem to do.

-Dave

jabari_z · December 31, 2013, 6:11pm

Is it something like this GitHub - ingramj/bitarray: A bitarray class for Ruby, implemented as a C extension. ?

Search for bit array (not byte array). More results!

Abinoam Jr.

On Tue, Dec 31, 2013 at 1:59 PM, Dave A.

jabari_z · January 1, 2014, 11:13am

On Dec 31, 2013, at 8:40, Jabari Z. [email protected] wrote:

I have an algorithm which uses an array of boolean (True/False) data.
I would like to conserve and extend memory use by having the array
use less memory.

Why? Do you have a measurable problem?

jabari_z · December 31, 2013, 6:10pm

Subject: Re: Byte Arrays in Ruby
Date: mar 31 dic 13 06:04:05 +0100

Quoting Carlo E. Prelz ([email protected]):

Maybe there is something that helps in that direction, but from what I
know, since in Ruby everything is an object, even if you were to limit
the disk occupation of the actual variable to one byte, it would then
need all the management infrastructure that each other object
requires. Thus, I think you would not save so much.

Oops. I had not noticed that you’d like to use those data as
booleans. In Ruby, booleans are already quite optimized memory-wise.

The saving is even less in this case.

Carlo

jabari_z · January 1, 2014, 10:54pm

On Tue, Dec 31, 2013 at 6:09 PM, Carlo E. Prelz [email protected]
wrote:

Oops. I had not noticed that you’d like to use those data as
booleans. In Ruby, booleans are already quite optimized memory-wise.

The saving is even less in this case.

Actually no. If you need, say, 120 boolean flags you can store them
in 15 bytes (120 bits / 8 bits per byte). If you have a Ruby Array
with 120 entries you need 120 times the size of a single pointer (IIRC
4 bytes = 32 bit on a 32 bit Ruby). That makes 120*4 = 480 bytes.

Kind regards

robert

jabari_z · January 1, 2014, 10:46pm

Ryan D. wrote in post #1131938:

On Dec 31, 2013, at 8:40, Jabari Z. [email protected] wrote:

I have an algorithm which uses an array of boolean (True/False) data.
I would like to conserve and extend memory use by having the array
use less memory.

Why? Do you have a measurable problem?

Yes.

Ideally I want to store the boolean (True/False) data in arrays that
take the least amount of memory to enable me to run out of the cpu cache
as much as possible and to allow the algorithm to search over larger
numbers.
Currently I’m memory limited on the size of numbers I can process.

On a 32/64 bit cpu a boolean would take 4/8 bytes of memory.
Even if I could do just byte addressing I could compute up to 4/8 times
larger numbers, and with bit arrays 32/64 times larger number within the
same address space, which I could fit in cache memory.

This is done all the time with languages like C/++, et al, which allow
you to control hardware at the bit level of registers and memory.

If you chunk your sizes correctly, the performance/memory advantages can
exceed the administrative overhead versus standard arrays,

It would be nice if Ruby would include this capability natively.

I did try the gems that were suggested (Peter C. and Ingram’s) and
they do work for me. Now I want to see if I can make them faster or find
something that is.

So a grateful thanks to Abinoam Jr. for pointing me to that gem.

jabari_z · January 3, 2014, 6:58am

Robert K. wrote in post #1131967:

On Wed, Jan 1, 2014 at 10:46 PM, Jabari Z. [email protected] wrote:

Ideally I want to store the boolean (True/False) data in arrays that
take the least amount of memory to enable me to run out of the cpu cache
as much as possible and to allow the algorithm to search over larger
numbers.

What algorithm is that?

Currently I’m memory limited on the size of numbers I can process.

On a 32/64 bit cpu a boolean would take 4/8 bytes of memory.
Even if I could do just byte addressing I could compute up to 4/8 times
larger numbers, and with bit arrays 32/64 times larger number within the
same address space, which I could fit in cache memory.

This is done all the time with languages like C/++, et al, which allow
you to control hardware at the bit level of registers and memory.

Well, if your performance needs are that pressing then Ruby may not
be the proper tool.

If you chunk your sizes correctly, the performance/memory advantages can
exceed the administrative overhead versus standard arrays,

It would be nice if Ruby would include this capability natively.

There are indeed some options (see below - albeit probably not exactly
what you expect) but I am not sure whether the performance gains you
are hoping for will be realized. The reason is that if the rest of
your algorithm is implemented in Ruby and it allocates objects then
there will be overhead of Ruby’s GC plus you have no control over the
placement in memory. The effect of well aligned byte arrays in memory
may just not measurable.

I did try the gems that were suggested (Peter C. and Ingram’s) and
they do work for me. Now I want to see if I can make them faster or find
something that is.

There are two quite obvious alternatives:

use Integers

use Strings

I once cooked something together using Strings with an optimization
using chunks:
Basis for a bit set which uses String as storage. Storage is split up into multiple String instances to avoid too large allocations. · GitHub

Note sure what state that is in.

If you rarely change bits and query frequently then a simple Integer
works pretty well because operator [] is used for bit indexing:

irb(main):003:0> flag = ‘0011011010’.to_i(2)
=> 218
irb(main):004:0> flag.to_s 16
=> “da”
irb(main):005:0> 8.times {|idx| printf “%3d %1d\n”, idx, flag[idx]}
0 0
1 1
2 0
3 1
4 1
5 0
6 1
7 1
=> 8

If there are frequent updates then solutions using String are more
efficient. If you consider non Ruby solutions then of course a C
solution is even more efficient - if that difference is measurable. I
would always start with a Ruby implementation and only switch to a C
implementation if Ruby proves too slow.

Kind regards

robert

Thanks for the info on strings holding flags.

I tried it in my algorithm, and it can functionally work, but it is
waaayy, waaayy, slower than with arrays,

I need to store boolean arrays on the order of 100s of millions of bits
and greater. I recall reading that strings greater than something like
just 29 characters cause a severe processing hit in performance, so this
isn’t a practical solution for me.

What I was able to do, though, is use use Ingram’s bitarray gem (which
is written in C) and tweak my code to actually not only run faster than
the code with standard Ruby arrays, but so allow me to process at
minimum 4 times larger arrays.

More performance/features can definitely be squeezed out of the gem
since it seems development activity ceased on it in 2008. In the
comments in the source code he even stated some of the algorithms were
not optimal, and could be improved.

I believe there are a number of C/C++ libraries, or techniques from
them, that can be used to improve the code. I’m not really a C/C++ coder
though, so someone who is would probably be aware of tricks and
practices to make the code better (faster).

This would seem like a nice project for something like a Google Summer
of Code (GSoC) to create a standard Ruby BitArray library.

jabari_z · January 1, 2014, 11:07pm

On Wed, Jan 1, 2014 at 10:46 PM, Jabari Z. [email protected] wrote:

Ideally I want to store the boolean (True/False) data in arrays that
take the least amount of memory to enable me to run out of the cpu cache
as much as possible and to allow the algorithm to search over larger
numbers.

What algorithm is that?

Currently I’m memory limited on the size of numbers I can process.

On a 32/64 bit cpu a boolean would take 4/8 bytes of memory.
Even if I could do just byte addressing I could compute up to 4/8 times
larger numbers, and with bit arrays 32/64 times larger number within the
same address space, which I could fit in cache memory.

This is done all the time with languages like C/++, et al, which allow
you to control hardware at the bit level of registers and memory.

Well, if your performance needs are that pressing then Ruby may not
be the proper tool.

If you chunk your sizes correctly, the performance/memory advantages can
exceed the administrative overhead versus standard arrays,

It would be nice if Ruby would include this capability natively.

There are indeed some options (see below - albeit probably not exactly
what you expect) but I am not sure whether the performance gains you
are hoping for will be realized. The reason is that if the rest of
your algorithm is implemented in Ruby and it allocates objects then
there will be overhead of Ruby’s GC plus you have no control over the
placement in memory. The effect of well aligned byte arrays in memory
may just not measurable.

I did try the gems that were suggested (Peter C. and Ingram’s) and
they do work for me. Now I want to see if I can make them faster or find
something that is.

There are two quite obvious alternatives:

use Integers
use Strings

I once cooked something together using Strings with an optimization
using chunks:

gist.github.com

https://gist.github.com/rklemme/2775600

BitSet.rb

#!/usr/bin/ruby

# Implementation of a bit set which stores membershio in binary
# Strings managed in chunks.
class BitSet
  ZERO = "\000".encode(Encoding::BINARY).freeze
  DEFAULT_SEGMENT_SIZE = 1024 * 1024

  # Create the BitSet instance. If max_num is given, we preallocate
  # as much memory as needed.

This file has been truncated. show original

Note sure what state that is in.

If you rarely change bits and query frequently then a simple Integer
works pretty well because operator [] is used for bit indexing:

irb(main):003:0> flag = ‘0011011010’.to_i(2)
=> 218
irb(main):004:0> flag.to_s 16
=> “da”
irb(main):005:0> 8.times {|idx| printf “%3d %1d\n”, idx, flag[idx]}
0 0
1 1
2 0
3 1
4 1
5 0
6 1
7 1
=> 8

If there are frequent updates then solutions using String are more
efficient. If you consider non Ruby solutions then of course a C
solution is even more efficient - if that difference is measurable. I
would always start with a Ruby implementation and only switch to a C
implementation if Ruby proves too slow.

Kind regards

robert

jabari_z · January 3, 2014, 2:23pm

On Fri, Jan 3, 2014 at 12:10 PM, Robert K.
[email protected] wrote:

On Fri, Jan 3, 2014 at 6:58 AM, Jabari Z. [email protected] wrote:

I tried it in my algorithm, and it can functionally work, but it is
waaayy, waaayy, slower than with arrays,

Well, I guess that is the usual space vs. time tradeoff.

I noticed a potential bottleneck when expanding individual segments. I
changed that (line 57f), added preallocation and added unit tests.

gist.github.com

https://gist.github.com/rklemme/2775600

BitSet.rb

#!/usr/bin/ruby

# Implementation of a bit set which stores membershio in binary
# Strings managed in chunks.
class BitSet
  ZERO = "\000".encode(Encoding::BINARY).freeze
  DEFAULT_SEGMENT_SIZE = 1024 * 1024

  # Create the BitSet instance. If max_num is given, we preallocate
  # as much memory as needed.

This file has been truncated. show original

If you like you could redo your tests with and without preallocation
and see how much it differs.

Cheers

robert

jabari_z · January 5, 2014, 5:14pm

Robert K. wrote in post #1132117:

On Fri, Jan 3, 2014 at 12:10 PM, Robert K.
[email protected] wrote:

On Fri, Jan 3, 2014 at 6:58 AM, Jabari Z. [email protected] wrote:

I tried it in my algorithm, and it can functionally work, but it is
waaayy, waaayy, slower than with arrays,

Well, I guess that is the usual space vs. time tradeoff.

I noticed a potential bottleneck when expanding individual segments. I
changed that (line 57f), added preallocation and added unit tests.

Basis for a bit set which uses String as storage. Storage is split up into multiple String instances to avoid too large allocations. · GitHub

If you like you could redo your tests with and without preallocation
and see how much it differs.

Cheers

robert

Here is the article on Strings length performance I alluded to.

Never Create Ruby Strings Longer Than 23 Characters
http:/patshaughnessy.net/2012/1/4/never-create-ruby-strings-longer-than-23-characters

This came out in January 2012 and talked about Ruby 1.9 behavior. Would
be interesting to see if this applies to 2.0/2.1 as well.

Also besides a native BitArray class, a ByteArray class would also be
very useful. Since most cpus are byte-addressable they would be faster
than BitArrays while extending memory usage beyond regular Arrays.

Providing these array options would make Ruby much more useful for
arithmetic, statistical, scientific and engineering applications by
allowing Ruby to be more inherently able to manipulate hardware and do
boolean arithmetic in a more efficient manner.

Maybe the Ruby-Core people (or someone with an itch to scratch) can
consider these for future native (C) implementations.

jabari_z · January 5, 2014, 5:50pm

On Sun, Jan 5, 2014 at 5:14 PM, Jabari Z. [email protected] wrote:

Robert K. wrote in post #1132117:

On Fri, Jan 3, 2014 at 12:10 PM, Robert K.
[email protected] wrote:

On Fri, Jan 3, 2014 at 6:58 AM, Jabari Z. [email protected] wrote:

I noticed a potential bottleneck when expanding individual segments. I
changed that (line 57f), added preallocation and added unit tests.

Basis for a bit set which uses String as storage. Storage is split up into multiple String instances to avoid too large allocations. · GitHub

If you like you could redo your tests with and without preallocation
and see how much it differs.

Btw, did you do some testing with the changed implementation from above?

Here is the article on Strings length performance I alluded to.

Never Create Ruby Strings Longer Than 23 Characters

http:/patshaughnessy.net/2012/1/4/never-create-ruby-strings-longer-than-23-characters

This came out in January 2012 and talked about Ruby 1.9 behavior. Would
be interesting to see if this applies to 2.0/2.1 as well.

Thanks for the reference! I do believe though that the article is not
fully correct. For example, if I am not mistaken a shared string
comes into existence by operations like #dup and #clone - not by
simple assignment. Reason: assignment just copies the reference which
should be the pointer to the struct - not the struct.

I would only distinguish only two types of String: heap strings and
inline (or embedded) strings. Sharing is just one more of operation of
the string. Sharing basically means that there are more than one
String instances referencing the same character (or rather byte)
array. This makes copy-on-write necessary.

Also, when using preallocation with the BitSet implementation I
provided none of these issues are pertinent as they are only relevant
on (re)allocations. Even without preallocation efficiency completely
depends on the usage patterns.

Also besides a native BitArray class, a ByteArray class would also be
very useful. Since most cpus are byte-addressable they would be faster
than BitArrays while extending memory usage beyond regular Arrays.

There is a ByteArray class: surprisingly it’s called “String”. You
can set the encoding to ‘BINARY’ as I have done in the BitSet class
but that is not necessary. There are String#bytesize, String#getbyte
and String#setbyte and a few other methods:

irb(main):008:0> String.instance_methods.grep /byte/i
=> [:bytesize, :getbyte, :setbyte, :byteslice, :bytes, :each_byte]

Providing these array options would make Ruby much more useful for
arithmetic, statistical, scientific and engineering applications by
allowing Ruby to be more inherently able to manipulate hardware and do
boolean arithmetic in a more efficient manner.

For numeric applications I believe often NArray is used.

Cheers

robert

jabari_z · January 3, 2014, 12:10pm

On Fri, Jan 3, 2014 at 6:58 AM, Jabari Z. [email protected] wrote:

Thanks for the info on strings holding flags.

You’re welcome.

I tried it in my algorithm, and it can functionally work, but it is
waaayy, waaayy, slower than with arrays,

Well, I guess that is the usual space vs. time tradeoff.

I need to store boolean arrays on the order of 100s of millions of bits
and greater. I recall reading that strings greater than something like
just 29 characters cause a severe processing hit in performance, so this
isn’t a practical solution for me.

Do you have a reference for that? I am not aware of this but the
information would be interesting.

What I was able to do, though, is use use Ingram’s bitarray gem (which
is written in C) and tweak my code to actually not only run faster than
the code with standard Ruby arrays, but so allow me to process at
minimum 4 times larger arrays.

If you know the size beforehand, you can allocate a single String and
use that with #getbyte and #setbyte. At the moment I cannot imagine
what additional performance penalties that may cause because there is
no reallocation or GC involved in that case.

This would seem like a nice project for something like a Google Summer
of Code (GSoC) to create a standard Ruby BitArray library.

I’d call it BitSet but yes, I agree.

Kind regards

robert