Memory-efficient set of Fixnums

dubstep · May 23, 2012, 12:22pm

Hi,

I have to:

generate a database of a couple dozens of millions of Fixnums
(ranging from 0 to 2^32 - 1), while avoiding redundancy
iterate through them
quickly search for the presence of a given Fixnum in the database

The Set class fulfills the speed conditions and conveniently handles
redundancy itself, but its uses up too much memory. It looks like each
entry uses up around a 100 bytes, even if I only put 4 bytes in there.
Array#include? is too slow without solving the memory problem.
Representing each Fixnum as 4 bytes in a huge String doesn’t use up much
memory at all and String#include? is fast enough, but I can’t tell it to
only search by 4 bytes increments.

Could someone help me with a solution for this problem? Thank you in
advance.

geoder101 · May 23, 2012, 3:18pm

Did you try a Hash?

geoder101 · May 23, 2012, 3:52pm

Thanks, but Hash has an even bigger memory imprint, so that won’t do
either.

geoder101 · May 23, 2012, 4:21pm

Well, how about using binary search on a sorted array? That’s the
simplest way.

geoder101 · May 23, 2012, 4:37pm

How about an array, where the the generated numbers are the indexes?
Can store a flag value, and any not set value defaults to nil
May be less memory than Hash?

geoder101 · May 23, 2012, 4:40pm

On Wed, May 23, 2012 at 12:22 PM, George D. [email protected]
wrote:

Representing each Fixnum as 4 bytes in a huge String doesn’t use up much
memory at all and String#include? is fast enough, but I can’t tell it to
only search by 4 bytes increments.

You could use a BigNum as bitset. But that might be slow. I once
created a BitSet class which uses a String internally for storage.
IIRC that was quite efficient and fast. I put up an illustrating
example here

gist.github.com

https://gist.github.com/rklemme/2775600

BitSet.rb

#!/usr/bin/ruby

# Implementation of a bit set which stores membershio in binary
# Strings managed in chunks.
class BitSet
  ZERO = "\000".encode(Encoding::BINARY).freeze
  DEFAULT_SEGMENT_SIZE = 1024 * 1024

  # Create the BitSet instance. If max_num is given, we preallocate
  # as much memory as needed.

This file has been truncated. show original

Kind regards

robert

geoder101 · May 23, 2012, 5:56pm

On 05/23/2012 06:17 AM, Chris H. wrote:

Did you try a Hash?

Set uses a Hash internally. It’s basically just a different interface to
a Hash.

-Justin

geoder101 · May 23, 2012, 5:01pm

George D. wrote in post #1061810:

generate a database of a couple dozens of millions of Fixnums
(ranging from 0 to 2^32 - 1), while avoiding redundancy

iterate through them

quickly search for the presence of a given Fixnum in the database

Hi,

For memory efficiency, if you are on Windows, you can use RTensor with
type of “unsigned”. If not, you can use other memory-efficient
structure for 4-byte data (such as NArray, although its “int” is 4-byte
signed integer).

For quick search, as has been pointed to, you have to store the data in
sorted order, so that you can run binary search.

If your data are static, it is fine. If your data are dynamic (often
insert and remove), then we still need to find another data structure
(such as C++ STL Set).

Regards,

Bill

geoder101 · May 23, 2012, 5:59pm

Can you afford a fixed 512MB of RAM? Then you can just have a 2^32 entry
bitmap (2^29 bytes)

Iteration might be fast enough, you just need to search for bytes which
are not \x00

geoder101 · May 23, 2012, 6:20pm

Thank you for your help,

Sadly, it’s dynamic data. Basically, I start with half a million
integers, then I generate from this a bigger set of integers (using
millions of calls to the include? method in the process), that I must
add to the initial set while avoiding redundancy. Then rinse and repeat
until around 20 million.

I tried using a long string encoded in UTF-32BE, but the include? method
is much too slow. These numbers are out of range for NArray, and this
has to run on Linux so I can’t use RTensor either. A 512 MB Bitmap could
work. I will start by trying Robert K.'s BitSet. Thank you very
much.

Regards,

George

geoder101 · May 23, 2012, 7:28pm

On Wed, May 23, 2012 at 6:44 PM, Admin T. [email protected]
wrote:

Based on your parameter description, I think the bitmap suggested by
Brian C. is the best choice.

Just for the record: that is the same approach as I suggested earlier.

gist.github.com

https://gist.github.com/rklemme/2775600

BitSet.rb

#!/usr/bin/ruby

# Implementation of a bit set which stores membershio in binary
# Strings managed in chunks.
class BitSet
  ZERO = "\000".encode(Encoding::BINARY).freeze
  DEFAULT_SEGMENT_SIZE = 1024 * 1024

  # Create the BitSet instance. If max_num is given, we preallocate
  # as much memory as needed.

This file has been truncated. show original

Cheers

robert

geoder101 · May 23, 2012, 6:44pm

George D. wrote in post #1061849:

Sadly, it’s dynamic data. Basically, I start with half a million
integers, then I generate from this a bigger set of integers (using
millions of calls to the include? method in the process), that I must
add to the initial set while avoiding redundancy. Then rinse and repeat
until around 20 million.

Hi,

Based on your parameter description, I think the bitmap suggested by
Brian C. is the best choice.

Regards,

Bill

geoder101 · May 23, 2012, 7:29pm

Alright, using Robert K.'s BitSet class, Brian C.'s suggestion
for the full database mixed with Sets for intermediate steps (especially
iteration) did the
trick. It may be a bit too slow, but nothing unreasonable, and some time
can realistically be saved here and there.

Thank you very much for your valuable help.

Best regards,

George

geoder101 · May 23, 2012, 10:21pm

I have to:

generate a database of a couple dozens of millions of Fixnums
(ranging from 0 to 2^32 - 1), while avoiding redundancy

google_hash gem may help.

geoder101 · May 23, 2012, 8:50pm

Robert K. wrote in post #1061860:

On Wed, May 23, 2012 at 6:44 PM, Admin T. [email protected]
wrote:

Based on your parameter description, I think the bitmap suggested by
Brian C. is the best choice.

Just for the record: that is the same approach as I suggested earlier.
Basis for a bit set which uses String as storage. Storage is split up into multiple String instances to avoid too large allocations. · GitHub

Cheers

robert

Hi Robert,

I am sorry that I did not look at your github.

Brian C. gave an explicit space feasibility for the problem (half
gigabyte of RAM) but he does not explicitly show how to do the bitset.
You gave the explicit code how to do it, but the code does not
explicitly say how much RAM will be needed for the problem at hand.

I am glad that the original poster finally could handle the problem.

Regards,

Bill

geoder101 · May 24, 2012, 1:09am

Ryan D. wrote in post #1061877:

Also… smells like homework.

But if not, it reminds me of this article:

Hi,

Whether it is a homework or not, it is a very realistic problem that we
may encounter in our daily programming.

The article is about using probabilistic algorithms with some level of
error. I think we all assumed that the original poster wants an error
of zero. (If the problem could not be solved using several gigabytes of
RAM, or if the computation time took too long, then probabilistic
algorithms ought to be considered.)

Regards,

Bill

geoder101 · May 24, 2012, 2:28am

On May 23, 2012, at 4:09 PM, Admin T. wrote:

Whether it is a homework or not, it is a very realistic problem that we
may encounter in our daily programming.

Absolutely! I know all my databases store off tens of millions of unique
integers that I iterate over and search against all the time. That’s
pretty much all I’ve been doing over my last 22 years of professional
development.

The article is about using probabilistic algorithms with some level of
error.

Some level of precalculated error, which is what makes this an
interesting solution.

I think we all assumed that the original poster wants an error
of zero. (If the problem could not be solved using several gigabytes of
RAM, or if the computation time took too long, then probabilistic
algorithms ought to be considered.)

Making the assumption that requirements as stated are correct is what
dooms software to be bad. Using a 512 megabyte in-memory bitmap is NOT a
good solution, it’s just that it is one of the simplest solutions that
meets all the (ridiculous) requirements.

geoder101 · May 23, 2012, 10:29pm

On May 23, 2012, at 03:22 , George D. wrote:

Representing each Fixnum as 4 bytes in a huge String doesn’t use up much
memory at all and String#include? is fast enough, but I can’t tell it to
only search by 4 bytes increments.

Could someone help me with a solution for this problem? Thank you in
advance.

This seems like one of those %w[small fast good].pick(2) problems.
Also… smells like homework.

But if not, it reminds me of this article:

and their implementation:

it’s java, so you could use it directly via jruby… or you can use unix
IO via the included shell scripts

geoder101 · May 24, 2012, 3:12am

On May 23, 2012, at 5:53 PM, Admin T. wrote:

Ryan D. wrote in post #1061894:

Making the assumption that requirements as stated are correct is what
dooms software to be bad. Using a 512 megabyte in-memory bitmap is NOT a
good solution, it’s just that it is one of the simplest solutions that
meets all the (ridiculous) requirements.

I think we just helped the original poster to handle the software
problem.

I’m not so sure that we helped.

geoder101 · May 24, 2012, 2:53am

Ryan D. wrote in post #1061894:

Making the assumption that requirements as stated are correct is what
dooms software to be bad. Using a 512 megabyte in-memory bitmap is NOT a
good solution, it’s just that it is one of the simplest solutions that
meets all the (ridiculous) requirements.

I think we just helped the original poster to handle the software
problem. Whether the requirements are solid or not, I think it is
better to leave it out to the original poster.

Regards,

Bill