Hi, I have to: 1) generate a database of a couple dozens of millions of Fixnums (ranging from 0 to 2^32 - 1), while avoiding redundancy 2) iterate through them 3) quickly search for the presence of a given Fixnum in the database The Set class fulfills the speed conditions and conveniently handles redundancy itself, but its uses up too much memory. It looks like each entry uses up around a 100 bytes, even if I only put 4 bytes in there. Array#include? is too slow without solving the memory problem. Representing each Fixnum as 4 bytes in a huge String doesn't use up much memory at all and String#include? is fast enough, but I can't tell it to only search by 4 bytes increments. Could someone help me with a solution for this problem? Thank you in advance.
on 2012-05-23 12:22
on 2012-05-23 16:37
How about an array, where the the generated numbers are the indexes? Can store a flag value, and any not set value defaults to nil May be less memory than Hash?
on 2012-05-23 16:40
On Wed, May 23, 2012 at 12:22 PM, George Dupre <lists@ruby-forum.com> wrote: > Representing each Fixnum as 4 bytes in a huge String doesn't use up much > memory at all and String#include? is fast enough, but I can't tell it to > only search by 4 bytes increments. You could use a BigNum as bitset. But that might be slow. I once created a BitSet class which uses a String internally for storage. IIRC that was quite efficient and fast. I put up an illustrating example here https://gist.github.com/2775600 Kind regards robert
on 2012-05-23 17:01
George Dupre wrote in post #1061810: > 1) generate a database of a couple dozens of millions of Fixnums > (ranging from 0 to 2^32 - 1), while avoiding redundancy > 2) iterate through them > 3) quickly search for the presence of a given Fixnum in the database Hi, For memory efficiency, if you are on Windows, you can use RTensor with type of "unsigned". If not, you can use other memory-efficient structure for 4-byte data (such as NArray, although its "int" is 4-byte signed integer). For quick search, as has been pointed to, you have to store the data in sorted order, so that you can run binary search. If your data are static, it is fine. If your data are dynamic (often insert and remove), then we still need to find another data structure (such as C++ STL Set). Regards, Bill
on 2012-05-23 17:56
On 05/23/2012 06:17 AM, Chris Hulan wrote:
> Did you try a Hash?
Set uses a Hash internally. It's basically just a different interface to
a Hash.
-Justin
on 2012-05-23 17:59
Can you afford a fixed 512MB of RAM? Then you can just have a 2^32 entry bitmap (2^29 bytes) Iteration might be fast enough, you just need to search for bytes which are not \x00
on 2012-05-23 18:20
Thank you for your help, Sadly, it's dynamic data. Basically, I start with half a million integers, then I generate from this a bigger set of integers (using millions of calls to the include? method in the process), that I must add to the initial set while avoiding redundancy. Then rinse and repeat until around 20 million. I tried using a long string encoded in UTF-32BE, but the include? method is much too slow. These numbers are out of range for NArray, and this has to run on Linux so I can't use RTensor either. A 512 MB Bitmap could work. I will start by trying Robert Klemme's BitSet. Thank you very much. Regards, George
on 2012-05-23 18:44
George Dupre wrote in post #1061849: > > Sadly, it's dynamic data. Basically, I start with half a million > integers, then I generate from this a bigger set of integers (using > millions of calls to the include? method in the process), that I must > add to the initial set while avoiding redundancy. Then rinse and repeat > until around 20 million. Hi, Based on your parameter description, I think the bitmap suggested by Brian Candler is the best choice. Regards, Bill
on 2012-05-23 19:28
On Wed, May 23, 2012 at 6:44 PM, Admin Tensor <lists@ruby-forum.com> wrote: > Based on your parameter description, I think the bitmap suggested by > Brian Candler is the best choice. Just for the record: that is the same approach as I suggested earlier. https://gist.github.com/2775600 Cheers robert
on 2012-05-23 19:29
Alright, using Robert Klemme's BitSet class, Brian Candler's suggestion for the full database mixed with Sets for intermediate steps (especially iteration) did the trick. It may be a bit too slow, but nothing unreasonable, and some time can realistically be saved here and there. Thank you very much for your valuable help. Best regards, George
on 2012-05-23 20:50
Robert Klemme wrote in post #1061860: > On Wed, May 23, 2012 at 6:44 PM, Admin Tensor <lists@ruby-forum.com> > wrote: >> Based on your parameter description, I think the bitmap suggested by >> Brian Candler is the best choice. > > Just for the record: that is the same approach as I suggested earlier. > https://gist.github.com/2775600 > > Cheers > > robert Hi Robert, I am sorry that I did not look at your github. Brian Candler gave an explicit space feasibility for the problem (half gigabyte of RAM) but he does not explicitly show how to do the bitset. You gave the explicit code how to do it, but the code does not explicitly say how much RAM will be needed for the problem at hand. I am glad that the original poster finally could handle the problem. Regards, Bill
on 2012-05-23 22:21
> I have to: > 1) generate a database of a couple dozens of millions of Fixnums > (ranging from 0 to 2^32 - 1), while avoiding redundancy google_hash gem may help.
on 2012-05-23 22:29
On May 23, 2012, at 03:22 , George Dupre wrote: > Representing each Fixnum as 4 bytes in a huge String doesn't use up much > memory at all and String#include? is fast enough, but I can't tell it to > only search by 4 bytes increments. > > Could someone help me with a solution for this problem? Thank you in > advance. This seems like one of those %w[small fast good].pick(2) problems. Also... smells like homework. But if not, it reminds me of this article: http://highscalability.com/blog/2012/4/5/big-data-... and their implementation: https://github.com/clearspring/stream-lib it's java, so you could use it directly via jruby... or you can use unix IO via the included shell scripts
on 2012-05-24 01:09
Ryan Davis wrote in post #1061877: > Also... smells like homework. > > But if not, it reminds me of this article: > > http://highscalability.com/blog/2012/4/5/big-data-... Hi, Whether it is a homework or not, it is a very realistic problem that we may encounter in our daily programming. The article is about using probabilistic algorithms with some level of error. I think we all assumed that the original poster wants an error of zero. (If the problem could not be solved using several gigabytes of RAM, or if the computation time took too long, then probabilistic algorithms ought to be considered.) Regards, Bill
on 2012-05-24 02:28
On May 23, 2012, at 4:09 PM, Admin Tensor wrote: > Whether it is a homework or not, it is a very realistic problem that we > may encounter in our daily programming. Absolutely! I know all my databases store off tens of millions of unique integers that I iterate over and search against all the time. That's pretty much all I've been doing over my last 22 years of professional development. > The article is about using probabilistic algorithms with some level of > error. Some level of _precalculated_ error, which is what makes this an interesting solution. > I think we all assumed that the original poster wants an error > of zero. (If the problem could not be solved using several gigabytes of > RAM, or if the computation time took too long, then probabilistic > algorithms ought to be considered.) Making the assumption that requirements as stated are correct is what dooms software to be bad. Using a 512 megabyte in-memory bitmap is NOT a _good_ solution, it's just that it is one of the simplest solutions that meets all the (ridiculous) requirements.
on 2012-05-24 02:53
Ryan Davis wrote in post #1061894: > Making the assumption that requirements as stated are correct is what > dooms software to be bad. Using a 512 megabyte in-memory bitmap is NOT a > _good_ solution, it's just that it is one of the simplest solutions that > meets all the (ridiculous) requirements. I think we just helped the original poster to handle the software problem. Whether the requirements are solid or not, I think it is better to leave it out to the original poster. Regards, Bill
on 2012-05-24 03:12
On May 23, 2012, at 5:53 PM, Admin Tensor wrote: > Ryan Davis wrote in post #1061894: >> Making the assumption that requirements as stated are correct is what >> dooms software to be bad. Using a 512 megabyte in-memory bitmap is NOT a >> _good_ solution, it's just that it is one of the simplest solutions that >> meets all the (ridiculous) requirements. > > I think we just helped the original poster to handle the software > problem. I'm not so sure that we helped.
on 2012-05-24 19:34
Hi, Sorry, I should have given more details. While I am doing this is an academic setting, it isn't homework in the traditional sense. This is part of my "personal initiative project", reaching outside of my field of studies (math and physics) but with outside help allowed. It revolves around a Korean 4x4 board game, which I am currently trying to solve through brute force, hence the ridiculous requirements. Each possible position of the board is matched to an integer. An error of zero is imperative, but using 512 MB of RAM isn't an issue at all: I only need to run that calculation once, in a reasonable time. I will also look into this google_hash gem, it should also be useful. Thank you once again for your help. Best regards, George
on 2012-05-24 20:22
George Dupre wrote in post #1062009: > > I will also look into this google_hash gem, it should also be useful. Hi, I think in Ruby speed is not the primary objective. So if you need the code to run faster (by about the same order of magnitude), probably you can transform the critical part into C. Regards, Bill
on 2012-05-25 21:56
Hi George, On Thu, May 24, 2012 at 7:34 PM, George Dupre <lists@ruby-forum.com> wrote: > > Thank you once again for your help. I advanced the implementation a bit - now it doesn't choke as easily because of the single large String - there are #clear and #empty? methods - resetting is more efficient - there is Enumerable functionality https://gist.github.com/2775600 Have fun! Kind regards robert
on 2012-05-26 06:34
On Thu, May 24, 2012 at 1:22 PM, Admin Tensor <lists@ruby-forum.com>
wrote:
>
+1
The Pickaxe has a good section about this (it's actually about how to
integrate with a C library, but if you know C and know how to make Ruby
talk to C, then you should be able to get it).
on 2012-05-26 16:34
>I advanced the implementation a bit Thank you very much, this is incredibly useful! >I think in Ruby speed is not the primary objective. So if you need the code to run faster (by about the same order of magnitude), probably you can transform the critical part into C. I don't know much C at all, but I did use RubyInline for a few critical, but very simple methods. With that and Robert Klemme's BitSet, the program is efficient enough, now I just have to let it run for a while. Thank you once again for your help!
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.