I'm trying to figure out a good way to extend an Array, when the items may not be added in index order, and I don't know the final highest index. Does anyone have any suggestions? What occurs to me is that I could double the length of the array whenever I want to add more than one element by doing something like: a = a + Array.new(a.length) Naturally, it would need to be a bit more complicated, as, e.g., I'd need to ensure that the resulting array actually *would* hold the proposed index. Still, something only a little more complicated should _work_, but is that the best approach?
on 2012-11-12 18:17
on 2012-11-12 18:44
push? I've used that when I didn't know (or care) about the length of the array. Wayne
on 2012-11-12 19:22
Charles Hixson wrote in post #1084111: > I'm trying to figure out a good way to extend an Array, when the items > may not be added in index order, and I don't know the final highest > index. Does anyone have any suggestions? This happens automatically. >> a = [:foo, :bar] => [:foo, :bar] >> a[10] = :baz => :baz >> a => [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz] As others have said: using a Hash may be more efficient, if there are large gaps - but then you may have to sort the keys if you want to iterate over it in order. It depends on your use case.
on 2012-11-12 22:12
Brian Candler wrote: >>> a > => [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz] > > As others have said: using a Hash may be more efficient, if there are > large gaps - but then you may have to sort the keys if you want to > iterate over it in order. It depends on your use case. > Hashes are slow compared to direct indexing. Push might be a good answer, but not if it allocates storage item by item. Allocation of memory is another expensive operation. But I *am* considering using gdbm instead of an Array. That is slow like a hash, but it eliminates needing to persist at a later time, and it conserves RAM usage. OTOH, it would be a lot faster to just loop through the array marshaling out item by item. Or perhaps just doing the entire thing in one blurt. I'm expecting this array to eventually use over a GB of storage, so efficiency considerations are somewhat significant. gdbm solves some of these, by not dealing with the entire array at once, but only with the recently active instances. But it's slower, which I was trying to avoid by using an Array. However, I don't know just how many items the Array will need to hold, it could easily be in the millions, and will at least be in the hundreds of thousands. So pre-allocating isn't very desirable, but neither is adding instance by instance. (Besides, they won't necessarily come in order. It could well be that I'll get the 100,000th entry immediately after getting the 21st. No problem for a hash, but if I'm going to slow down to hash speed, I should probably just go directly to gdbm.)
on 2012-11-13 02:03
Brian Candler wrote: >>> a > => [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz] > > As others have said: using a Hash may be more efficient, if there are > large gaps - but then you may have to sort the keys if you want to > iterate over it in order. It depends on your use case. > Rereading your post, I see I missed a crucial part of it: a[10] = :baz => a == [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz] And THAT was the part I really needed to get the first time.
on 2012-11-13 11:45
On Mon, Nov 12, 2012 at 10:12 PM, Charles Hixson <charleshixsn@earthlink.net> wrote: > Brian Candler wrote: >> As others have said: using a Hash may be more efficient, if there are >> large gaps - but then you may have to sort the keys if you want to >> iterate over it in order. It depends on your use case. >> > Hashes are slow compared to direct indexing. Did you measure it? If not, please do. It's easy to arrive at wrong conclusions by assuming what may be truth in other contexts in true in a new context as well. > I'm expecting this array to eventually use over a GB of storage, so Are you talking about memory or number of instances? I am asking because both are not the same and it may be more difficult to asses memory consumption than you think. > efficiency considerations are somewhat significant. gdbm solves some of > these, by not dealing with the entire array at once, but only with the > recently active instances. But it's slower, which I was trying to avoid by > using an Array. However, I don't know just how many items the Array will > need to hold, it could easily be in the millions, and will at least be in > the hundreds of thousands. So how do you know we are talking about "over a GB of storage"? > So pre-allocating isn't very desirable, but > neither is adding instance by instance. (Besides, they won't necessarily > come in order. It could well be that I'll get the 100,000th entry > immediately after getting the 21st. No problem for a hash, but if I'm going > to slow down to hash speed, I should probably just go directly to gdbm.) Again: measure before you judge. It's really much better to base decisions on facts than on speculation. Cheers robert
on 2012-11-13 14:13
Charles Hixson wrote in post #1084133: > Hashes are slow compared to direct indexing. Says who? > Push might be a good > answer, but not if it allocates storage item by item. It doesn't. Look at the MRI C implementation if you really care: the underlying array storage size is increased in chunks. Ditto for Strings. void rb_ary_store(ary, idx, val) ... if (idx >= RARRAY(ary)->aux.capa) { long new_capa = RARRAY(ary)->aux.capa / 2; if (new_capa < ARY_DEFAULT_SIZE) { new_capa = ARY_DEFAULT_SIZE; } if (new_capa >= ARY_MAX_SIZE - idx) { new_capa = (ARY_MAX_SIZE - idx) / 2; } new_capa += idx; REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa); RARRAY(ary)->aux.capa = new_capa; } [from ruby-1.8.7-p352/array.c] > Allocation of memory is another expensive operation Says who? Expensive compared to all the other things going on in your program? Your program will probably be creating objects left-right-and-centre, and having them garbage collected too. I think you are approaching this the wrong way. I suggest you first write some code that works, using the most simple and logical code. *Then* optimise it if required: and optimise by measuring. Even for experienced programmers, usually the hot-spot turns out not to be where they guess it might be. And as you have already implied, you might need to optimise for memory usage over speed. From a speed point of view, I'm pretty confident in saying that the implementation of Hash (in C) versus the implementation of Array (in C) is unlikely to be the bottleneck in your program. An Array may use less memory than a Hash - but since the Array or Hash holds only object references (essentially pointers), both may be small compared to the memory allocated to the objects you are holding within them.
on 2012-11-14 06:31
Robert Klemme wrote: > a new context as well. > >> I'm expecting this array to eventually use over a GB of storage, so > Are you talking about memory or number of instances? I am asking > because both are not the same and it may be more difficult to asses > memory consumption than you think. I'm talking about data size. The number of instances would probably be about 1/20 of the number of items. (Varies because the items themselves contain Arrays that vary in size.) But please note that this is a rough estimate, and could be an underestimate (though I've tried to make a generous estimate). > efficiency considerations are somewhat significant. gdbm solves some of > these, by not dealing with the entire array at once, but only with the > recently active instances. But it's slower, which I was trying to avoid by > using an Array. However, I don't know just how many items the Array will > need to hold, it could easily be in the millions, and will at least be in > the hundreds of thousands. > So how do you know we are talking about "over a GB of storage"? It's my best estimate before I actually build the thing. I hope it's a generous estimate, as smaller amounts of data would be easier to handle, but I don't want to say that I haven't underestimated. > > robert > > -- > remember.guy do |as, often| as.you_can - without end > http://blog.rubybestpractices.com/ I've measured it in several languages, and occasionally implemented the hash algorithm from scratch in C (two or three different ways of handling collisions...but Unicode means that I don't want to use C, though pointers was my original reason). I'll admit I haven't measured it yet in Ruby, but hash cannot be fast, though direct indexing can be slow. It's in the nature of the algorithm. It can be quite fast compared to a tree lookup, but not compared to a direct index. FWIW, the answer I was looking for is that Arrays are automatically extended to the last index inserted into. This probably implies that Arrays use some sort of linked block index, though another possibility is that it just allocates a honking big chunk of memory, and does a copy whenever it needs to resize the array. This seems unlikely. Also silly. So I'll bet on a linked block index. (N.B.: Some languages DO just allocate large blocks of RAM for their arrays, but most of the ones that I know do that are compiler languages, and they've got reasonable grounds...big blocks are more efficient to access, and if you're a compiler the code you compile can be as efficient as you are. So the trade-offs are different. Even then they try to avoid copying.) P.S.: Developing this in Ruby *IS* a measurement test. If I can do it in Ruby, then I've guessed the right size. But picking a bad algorithm to implement wouldn't prove much of anything. If it doesn't work, I'll need to fall back to a faster language. Probably D. That's one reason I want to avoid gdbm. Not only is RAM faster, but it's an easier conversion to another language if it turns out that I need a faster language. Ruby's a lot faster to develop in, so it's not a bad decision even if Ruby turns out to be too slow (or not able to handle the required amount of memory), because then I'll have a pretty much written application that I can translate to a faster language. (One reason for D is that there are lots of places where I can execute loops in parallel easily. That would be harder in Ruby, Python, etc., though in most ways they are easier.)
on 2012-11-14 07:07
On Nov 13, 2012, at 21:30 , Charles Hixson <charleshixsn@earthlink.net>
wrote:
> FWIW, the answer I was looking for is that Arrays are automatically extended to
the last index inserted into. This probably implies that Arrays use some sort of
linked block index, though another possibility is that it just allocates a honking
big chunk of memory, and does a copy whenever it needs to resize the array. This
seems unlikely. Also silly. So I'll bet on a linked block index.
Welcome to silly... tho I'm not sure why you think that.
on 2012-11-14 08:20
Ryan Davis wrote: > On Nov 13, 2012, at 21:30 , Charles Hixson<charleshixsn@earthlink.net> wrote: > >> FWIW, the answer I was looking for is that Arrays are automatically extended to the last index inserted into. This probably implies that Arrays use some sort of linked block index, though another possibility is that it just allocates a honking big chunk of memory, and does a copy whenever it needs to resize the array. This seems unlikely. Also silly. So I'll bet on a linked block index. > Welcome to silly... tho I'm not sure why you think that. That's not what the code I saw showed...admittedly my C skill are very poor. If it DOES do massive copying, I'm going to need a very different approach. And I'm not sure what would work. But why wouldn't Ruby use memory mapping? Copying a GB of memory around doesn't sound like a feasible approach, even though this would be done at full C speed. If that's true, then using a database is the only way I can think of that would work, and there are several reasons why I want to avoid that.
on 2012-11-14 09:00
On Wed, Nov 14, 2012 at 6:30 AM, Charles Hixson <charleshixsn@earthlink.net> wrote: > I've measured it in several languages, and occasionally implemented the hash > algorithm from scratch in C (two or three different ways of handling > collisions...but Unicode means that I don't want to use C, though pointers > was my original reason). I'll admit I haven't measured it yet in Ruby, but > hash cannot be fast, though direct indexing can be slow. It's in the nature > of the algorithm. It can be quite fast compared to a tree lookup, but not > compared to a direct index. That is still speculation. > FWIW, the answer I was looking for is that Arrays are automatically extended > to the last index inserted into. This probably implies that Arrays use some > sort of linked block index, though another possibility is that it just > allocates a honking big chunk of memory, and does a copy whenever it needs > to resize the array. This seems unlikely. Please stop speculating and verify the facts (i.e. check Ruby source code). You'll do yourself a big favor. Note also that you can still implement a type with the same interface as Hash or Array and drop it into your implementation later if you find the std lib types to be too slow for your purposes. Cheers robert
on 2012-11-14 09:46
Charles Hixson wrote in post #1084366: > FWIW, the answer I was looking for is that Arrays are automatically > extended to the last index inserted into. This probably implies that > Arrays use some sort of linked block index, though another possibility > is that it just allocates a honking big chunk of memory, and does a copy > whenever it needs to resize the array. This seems unlikely. Also > silly. So I'll bet on a linked block index. Then you lose the bet. I already posted the code. The code says: when you set an element beyond the end, the underlying array storage capacity (aux.capa) is increased to the new end index plus half the current capacity; new storage is allocated and the old data copied into the new (read the manpage for "realloc"). So the data is copied but the number of copy operations goes down as the array grows. This is of course *way* more efficient for most purposes than an linked list implementation, because the memory is contiguous and all subsequent get/set operations are then a simple pointer offsets rather than walking a list (even a list of chunks). > P.S.: Developing this in Ruby *IS* a measurement test. If I can do it > in Ruby, then I've guessed the right size. But picking a bad algorithm > to implement wouldn't prove much of anything. You're talking about two different things. Picking a bad algorithm is of course a bad idea, and will give you bad performance (relative to a good algorithm) in any language. But what you are discussing is the same algorithm, simply whether c[k] is slower or more memory inefficient if collection c is an Array or a Hash. This is almost certainly irrelevant in the context of your application. Code using the obvious approach: if the keys are all non-negative integers and not too sparse, use an Array. Otherwise use a Hash. If it doesn't all fit into RAM then this is probably due to the space used by the objects themselves, not the collection (I notice you haven't asked how much RAM a String object uses). If it doesn't run fast enough then the set/get operations of Array or Hash are almost certainly not the culprit.
on 2012-11-14 20:07
Brian Candler wrote: > rb_ary_store(ary, idx, val) > new_capa += idx; > > memory than a Hash - but since the Array or Hash holds only object > references (essentially pointers), both may be small compared to the > memory allocated to the objects you are holding within them. > Thank you. I'm going to need to think about the implications of that. It's looking like I need to change something basic in my design. An Array was the simple and most logical code, but if Arrays are as slow as Hashes (or nearly so) something is not appropriate for this project. Perhaps I WILL need to use a database.
on 2012-11-14 21:14
On Nov 14, 2012, at 11:06 , Charles Hixson <charleshixsn@earthlink.net>
wrote:
> I'm going to need to think about the implications of that. It's looking like I
need to change something basic in my design. An Array was the simple and most
logical code, but if Arrays are as slow as Hashes (or nearly so) something is not
appropriate for this project. Perhaps I WILL need to use a database.
Like everyone else on this thread, I'm going to have to say:
would you PLEASE just fucking measure something already???
require 'benchmark'
it's not hard.
on 2012-11-14 21:49
Brian Candler wrote: > Charles Hixson wrote in post #1084133: >> Hashes are slow compared to direct indexing. > Says who? This is a commonly true statement for a wide variety of languages. I am quite likely to be translating this work into another language. In the other language I am certain that it is a true statement. Unless Array access is significantly slower than Hash access (which I cannot believe) this is the way to proceed. > >> Push might be a good >> answer, but not if it allocates storage item by item. > It doesn't. Look at the MRI C implementation if you really care: the > underlying array storage size is increased in chunks. Ditto for Strings. That's ok. push was the wrong answer anyway. The right answer was that Arrays automatically extend to encompass the largest index that you use to store into them. > if (new_capa>= ARY_MAX_SIZE - idx) { > new_capa = (ARY_MAX_SIZE - idx) / 2; > } > new_capa += idx; > REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa); > RARRAY(ary)->aux.capa = new_capa; > } > > [from ruby-1.8.7-p352/array.c] Sorry, my C isn't that good, and there are a lot of undefined terms in that snippet. I can't really say I understand it. I can usually guess what's going on, but undefined macros make me quite unsure...and I try to avoid pointers when writing code, because I want to understand it later. For that matter, I usually avoid macros, too. I realize that this is a common C coding style, but it's one reason I dislike C. I'd prefer D, FORTRAN, or even Ada or Eiffel. (Yaa...the code is what it is. But to me what it is is confusing.) > >> Allocation of memory is another expensive operation > Says who? Expensive compared to all the other things going on in your > program? Your program will probably be creating objects > left-right-and-centre, and having them garbage collected too. Among system calls, allocation of memory is one of the more expensive. OTOH, the comment earlier about Array automatically extending itself answers this as there are steps automatically taken to minimize the number of allocations of memory. I don't think that using push repeatedly is a good idea (and in any case, it doesn't fit with my logic, unless there is something about the system that demands that it be done that way, which there isn't). > > I think you are approaching this the wrong way. I suggest you first > write some code that works, using the most simple and logical code. > *Then* optimise it if required: and optimise by measuring. Even for > experienced programmers, usually the hot-spot turns out not to be where > they guess it might be. And as you have already implied, you might need > to optimise for memory usage over speed. If I design the basic framework of the application around a poor design, I'll have to rewrite the entire thing. This is something to get right at the start. It's not a premature optimization. There are lots of places where I'm taking to "do enough and patch it later" approach. The reason this isn't one, is because this needs to be handled now. > > > From a speed point of view, I'm pretty confident in saying that the > implementation of Hash (in C) versus the implementation of Array (in C) > is unlikely to be the bottleneck in your program. An Array may use less > memory than a Hash - but since the Array or Hash holds only object > references (essentially pointers), both may be small compared to the > memory allocated to the objects you are holding within them. > OK. If it's not the bottleneck, that will be very good. Is there ANY reason that Hash is a better choice? I don't believe that there is, but I could be wrong. Certainly if they are the same speed, and use about the same amount of memory, I would prefer to use an Array. My expectation is that Array will be faster as well as using significantly less RAM, but were they the same, I would still prefer to use an Array, as it will match more nearly the preferred order of I/O (among other reasons). It is true that my reasons for preferring an Array are all minor, and if Array has some real drawback, then I would readily switch to Hash. But I haven't heard of one. I had some questions about how to use it, but those have, I believe, been answered. P.S.: I understand that if I use a Hash, item will be retrieved in the order entered when I iterate through them. This isn't my desired result. I want to retrieve them in index # order. I'd rather not have to sort them. P.P.S: A part of my design is driven by my desire to have a language independent form of serialization...and YAML won't work for my purposes. I may decide to have an isomorphic realization of this in a binary serialization that isn't language independent, but that WOULD be premature optimization.
on 2012-11-15 03:29
On 15 November 2012 06:48, Charles Hixson <charleshixsn@earthlink.net>wrote: > Sorry, my C isn't that good, and there are a lot of undefined terms in > that snippet. I can't really say I understand it. I can usually guess > what's going on, but undefined macros make me quite unsure...and I try to > avoid pointers when writing code, because I want to understand it later. > For that matter, I usually avoid macros, too. I realize that this is a > common C coding style, but it's one reason I dislike C. I'd prefer D, > FORTRAN, or even Ada or Eiffel. (Yaa...the code is what it is. But to me > what it is is confusing.) > This is going to sound elitist and snobby, but I'm pretty sure that's the whole problem right there. If you avoid pointers in C because they're confusing, then do not ever write C. Pointers are a fundamental part of the language and how you get things done in it, and actually quite simple when you're thinking at the C level. It's not a coding style, it's how C works. And yes, without pointers preallocated arrays will often be faster than everything else, because without pointers you're losing the ability to do operations like memory copying and collection restructuring and object comparisons by reference; reading and writing entire objects is always slow. I don't know the Ruby C API at all, but my guess at interpreting the snippet would be: if (idx>= RARRAY(ary)->aux.capa) { /* if the index is beyond the RARRAY (Ruby Array)'s auxiliary capacity (i.e. total allocated space?) */ long new_capa = RARRAY(ary)->aux.capa / 2; /* new_capa = current capa / 2 */ if (new_capa< ARY_DEFAULT_SIZE) { /* clamp to some globally defined minimum */ new_capa = ARY_DEFAULT_SIZE; /* (e.g. don't extend by 3, when 512 would be more sensible?) */ } if (new_capa>= ARY_MAX_SIZE - idx) { /* clamp to some globally understood maximum */ new_capa = (ARY_MAX_SIZE - idx) / 2; /* i.e. half the available maximum space */ } new_capa += idx; REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa); /* some realloc() macro -- looks exactly like realloc() to me */ RARRAY(ary)->aux.capa = new_capa; /* tell the array what its auxiliary capacity has become */ } If I design the basic framework of the application around a poor design, > I'll have to rewrite the entire thing. This is something to get right at > the start. It's not a premature optimization. There are lots of places > where I'm taking to "do enough and patch it later" approach. The reason > this isn't one, is because this needs to be handled now. Indeed. Write the algorithm. Optimise the algorithm. Make sure you're using sensible algorithmic techniques. Which container you're using isn't part of that algorithm. Again: which container you're using isn't part of the algorithm, especially in an OO context like Ruby. In the algorithm you say "add to the container" or "look up thingy in the container." Then when you've implemented it, if the algorithm is gravy but execution takes too long, maybe think about alternative _containers_, based on how your (already good) algorithm is using it. E.g. sparse storage suggests not using an array, random inserts but sorted iteration implies some sort of heap or something, etc. The main point is, the outer algorithm is the big deal; you abstract the container with a little black box called "the container" until you know more about how the whole system works. Ruby is a high enough level language that you can abstract the container's inner workings like this safely. It's also high enough that you can't really predict the behaviour without _trying_ it (or without having written dozens of other similar programs and getting a feel for how various objects behave.) So again, in summary: write the part of the algorithm you can control, and make it as good as can be. Then use empirical testing to find the best objects to fill in those black boxes. -- Matthew Kerwin, B.Sc (CompSci) (Hons) http://matthew.kerwin.net.au/ ABN: 59-013-727-651 "You'll never find a programming language that frees you from the burden of clarifying your ideas." - xkcd
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.