Extending Array instances

I’m trying to figure out a good way to extend an Array, when the items
may not be added in index order, and I don’t know the final highest
index. Does anyone have any suggestions?

What occurs to me is that I could double the length of the array
whenever I want to add more than one element by doing something like:
a = a + Array.new(a.length)
Naturally, it would need to be a bit more complicated, as, e.g., I’d
need to ensure that the resulting array actually would hold the
proposed index. Still, something only a little more complicated should
work, but is that the best approach?

push?

I’ve used that when I didn’t know (or care) about the length of the
array.

Wayne

Charles H. wrote in post #1084111:

I’m trying to figure out a good way to extend an Array, when the items
may not be added in index order, and I don’t know the final highest
index. Does anyone have any suggestions?

This happens automatically.

a = [:foo, :bar]
=> [:foo, :bar]

a[10] = :baz
=> :baz

a
=> [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

As others have said: using a Hash may be more efficient, if there are
large gaps - but then you may have to sort the keys if you want to
iterate over it in order. It depends on your use case.

I’d just use a hash.

– Matma R.

Brian C. wrote:

a
=> [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

As others have said: using a Hash may be more efficient, if there are
large gaps - but then you may have to sort the keys if you want to
iterate over it in order. It depends on your use case.

Rereading your post, I see I missed a crucial part of it:

a[10] = :baz
=> a == [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

And THAT was the part I really needed to get the first time.

Brian C. wrote:

a
=> [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

As others have said: using a Hash may be more efficient, if there are
large gaps - but then you may have to sort the keys if you want to
iterate over it in order. It depends on your use case.

Hashes are slow compared to direct indexing. Push might be a good
answer, but not if it allocates storage item by item. Allocation of
memory is another expensive operation. But I am considering using
gdbm instead of an Array. That is slow like a hash, but it eliminates
needing to persist at a later time, and it conserves RAM usage. OTOH,
it would be a lot faster to just loop through the array marshaling out
item by item. Or perhaps just doing the entire thing in one blurt.

I’m expecting this array to eventually use over a GB of storage, so
efficiency considerations are somewhat significant. gdbm solves some of
these, by not dealing with the entire array at once, but only with the
recently active instances. But it’s slower, which I was trying to avoid
by using an Array. However, I don’t know just how many items the Array
will need to hold, it could easily be in the millions, and will at least
be in the hundreds of thousands. So pre-allocating isn’t very
desirable, but neither is adding instance by instance. (Besides, they
won’t necessarily come in order. It could well be that I’ll get the
100,000th entry immediately after getting the 21st. No problem for a
hash, but if I’m going to slow down to hash speed, I should probably
just go directly to gdbm.)

On Mon, Nov 12, 2012 at 10:12 PM, Charles H.
[email protected] wrote:

Brian C. wrote:

As others have said: using a Hash may be more efficient, if there are
large gaps - but then you may have to sort the keys if you want to
iterate over it in order. It depends on your use case.

Hashes are slow compared to direct indexing.

Did you measure it? If not, please do. It’s easy to arrive at wrong
conclusions by assuming what may be truth in other contexts in true in
a new context as well.

I’m expecting this array to eventually use over a GB of storage, so

Are you talking about memory or number of instances? I am asking
because both are not the same and it may be more difficult to asses
memory consumption than you think.

efficiency considerations are somewhat significant. gdbm solves some of
these, by not dealing with the entire array at once, but only with the
recently active instances. But it’s slower, which I was trying to avoid by
using an Array. However, I don’t know just how many items the Array will
need to hold, it could easily be in the millions, and will at least be in
the hundreds of thousands.

So how do you know we are talking about “over a GB of storage”?

So pre-allocating isn’t very desirable, but
neither is adding instance by instance. (Besides, they won’t necessarily
come in order. It could well be that I’ll get the 100,000th entry
immediately after getting the 21st. No problem for a hash, but if I’m going
to slow down to hash speed, I should probably just go directly to gdbm.)

Again: measure before you judge. It’s really much better to base
decisions on facts than on speculation.

Cheers

robert

Charles H. wrote in post #1084133:

Hashes are slow compared to direct indexing.

Says who?

Push might be a good
answer, but not if it allocates storage item by item.

It doesn’t. Look at the MRI C implementation if you really care: the
underlying array storage size is increased in chunks. Ditto for Strings.

void
rb_ary_store(ary, idx, val)

if (idx >= RARRAY(ary)->aux.capa) {
long new_capa = RARRAY(ary)->aux.capa / 2;

    if (new_capa < ARY_DEFAULT_SIZE) {
        new_capa = ARY_DEFAULT_SIZE;
    }
    if (new_capa >= ARY_MAX_SIZE - idx) {
        new_capa = (ARY_MAX_SIZE - idx) / 2;
    }
    new_capa += idx;
    REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa);
    RARRAY(ary)->aux.capa = new_capa;
}

[from ruby-1.8.7-p352/array.c]

Allocation of memory is another expensive operation

Says who? Expensive compared to all the other things going on in your
program? Your program will probably be creating objects
left-right-and-centre, and having them garbage collected too.

I think you are approaching this the wrong way. I suggest you first
write some code that works, using the most simple and logical code.
Then optimise it if required: and optimise by measuring. Even for
experienced programmers, usually the hot-spot turns out not to be where
they guess it might be. And as you have already implied, you might need
to optimise for memory usage over speed.

From a speed point of view, I’m pretty confident in saying that the
implementation of Hash (in C) versus the implementation of Array (in C)
is unlikely to be the bottleneck in your program. An Array may use less
memory than a Hash - but since the Array or Hash holds only object
references (essentially pointers), both may be small compared to the
memory allocated to the objects you are holding within them.

Robert K. wrote:

a new context as well.

I’m expecting this array to eventually use over a GB of storage, so
Are you talking about memory or number of instances? I am asking
because both are not the same and it may be more difficult to asses
memory consumption than you think.
I’m talking about data size. The number of instances would probably be
about 1/20 of the number of items. (Varies because the items themselves
contain Arrays that vary in size.) But please note that this is a rough
estimate, and could be an underestimate (though I’ve tried to make a
generous estimate).
efficiency considerations are somewhat significant. gdbm solves some of
these, by not dealing with the entire array at once, but only with the
recently active instances. But it’s slower, which I was trying to avoid by
using an Array. However, I don’t know just how many items the Array will
need to hold, it could easily be in the millions, and will at least be in
the hundreds of thousands.
So how do you know we are talking about “over a GB of storage”?
It’s my best estimate before I actually build the thing. I hope it’s a
generous estimate, as smaller amounts of data would be easier to handle,
but I don’t want to say that I haven’t underestimated.

robert


remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
I’ve measured it in several languages, and occasionally implemented the
hash algorithm from scratch in C (two or three different ways of
handling collisions…but Unicode means that I don’t want to use C,
though pointers was my original reason). I’ll admit I haven’t measured
it yet in Ruby, but hash cannot be fast, though direct indexing can be
slow. It’s in the nature of the algorithm. It can be quite fast
compared to a tree lookup, but not compared to a direct index.
FWIW, the answer I was looking for is that Arrays are automatically
extended to the last index inserted into. This probably implies that
Arrays use some sort of linked block index, though another possibility
is that it just allocates a honking big chunk of memory, and does a copy
whenever it needs to resize the array. This seems unlikely. Also
silly. So I’ll bet on a linked block index. (N.B.: Some languages DO
just allocate large blocks of RAM for their arrays, but most of the ones
that I know do that are compiler languages, and they’ve got reasonable
grounds…big blocks are more efficient to access, and if you’re a
compiler the code you compile can be as efficient as you are. So the
trade-offs are different. Even then they try to avoid copying.)

P.S.: Developing this in Ruby IS a measurement test. If I can do it
in Ruby, then I’ve guessed the right size. But picking a bad algorithm
to implement wouldn’t prove much of anything. If it doesn’t work, I’ll
need to fall back to a faster language. Probably D. That’s one reason
I want to avoid gdbm. Not only is RAM faster, but it’s an easier
conversion to another language if it turns out that I need a faster
language.
Ruby’s a lot faster to develop in, so it’s not a bad decision even if
Ruby turns out to be too slow (or not able to handle the required amount
of memory), because then I’ll have a pretty much written application
that I can translate to a faster language. (One reason for D is that
there are lots of places where I can execute loops in parallel easily.
That would be harder in Ruby, Python, etc., though in most ways they are
easier.)

Ryan D. wrote:

On Nov 13, 2012, at 21:30 , Charles H.[email protected] wrote:

FWIW, the answer I was looking for is that Arrays are automatically extended to
the last index inserted into. This probably implies that Arrays use some sort of
linked block index, though another possibility is that it just allocates a honking
big chunk of memory, and does a copy whenever it needs to resize the array. This
seems unlikely. Also silly. So I’ll bet on a linked block index.
Welcome to silly… tho I’m not sure why you think that.
That’s not what the code I saw showed…admittedly my C skill are very
poor. If it DOES do massive copying, I’m going to need a very different
approach. And I’m not sure what would work. But why wouldn’t Ruby use
memory mapping? Copying a GB of memory around doesn’t sound like a
feasible approach, even though this would be done at full C speed.

If that’s true, then using a database is the only way I can think of
that would work, and there are several reasons why I want to avoid that.

On Wed, Nov 14, 2012 at 6:30 AM, Charles H.
[email protected] wrote:

I’ve measured it in several languages, and occasionally implemented the hash
algorithm from scratch in C (two or three different ways of handling
collisions…but Unicode means that I don’t want to use C, though pointers
was my original reason). I’ll admit I haven’t measured it yet in Ruby, but
hash cannot be fast, though direct indexing can be slow. It’s in the nature
of the algorithm. It can be quite fast compared to a tree lookup, but not
compared to a direct index.

That is still speculation.

FWIW, the answer I was looking for is that Arrays are automatically extended
to the last index inserted into. This probably implies that Arrays use some
sort of linked block index, though another possibility is that it just
allocates a honking big chunk of memory, and does a copy whenever it needs
to resize the array. This seems unlikely.

Please stop speculating and verify the facts (i.e. check Ruby source
code). You’ll do yourself a big favor.

Note also that you can still implement a type with the same interface
as Hash or Array and drop it into your implementation later if you
find the std lib types to be too slow for your purposes.

Cheers

robert

On Nov 13, 2012, at 21:30 , Charles H. [email protected]
wrote:

FWIW, the answer I was looking for is that Arrays are automatically extended to
the last index inserted into. This probably implies that Arrays use some sort of
linked block index, though another possibility is that it just allocates a honking
big chunk of memory, and does a copy whenever it needs to resize the array. This
seems unlikely. Also silly. So I’ll bet on a linked block index.

Welcome to silly… tho I’m not sure why you think that.

Charles H. wrote in post #1084366:

FWIW, the answer I was looking for is that Arrays are automatically
extended to the last index inserted into. This probably implies that
Arrays use some sort of linked block index, though another possibility
is that it just allocates a honking big chunk of memory, and does a copy
whenever it needs to resize the array. This seems unlikely. Also
silly. So I’ll bet on a linked block index.

Then you lose the bet. I already posted the code.

The code says: when you set an element beyond the end, the underlying
array storage capacity (aux.capa) is increased to the new end index plus
half the current capacity; new storage is allocated and the old data
copied into the new (read the manpage for “realloc”). So the data is
copied but the number of copy operations goes down as the array grows.

This is of course way more efficient for most purposes than an linked
list implementation, because the memory is contiguous and all subsequent
get/set operations are then a simple pointer offsets rather than walking
a list (even a list of chunks).

P.S.: Developing this in Ruby IS a measurement test. If I can do it
in Ruby, then I’ve guessed the right size. But picking a bad algorithm
to implement wouldn’t prove much of anything.

You’re talking about two different things. Picking a bad algorithm is of
course a bad idea, and will give you bad performance (relative to a good
algorithm) in any language.

But what you are discussing is the same algorithm, simply whether c[k]
is slower or more memory inefficient if collection c is an Array or a
Hash. This is almost certainly irrelevant in the context of your
application.

Code using the obvious approach: if the keys are all non-negative
integers and not too sparse, use an Array. Otherwise use a Hash.

If it doesn’t all fit into RAM then this is probably due to the space
used by the objects themselves, not the collection (I notice you haven’t
asked how much RAM a String object uses). If it doesn’t run fast enough
then the set/get operations of Array or Hash are almost certainly not
the culprit.

On Nov 14, 2012, at 11:06 , Charles H. [email protected]
wrote:

I’m going to need to think about the implications of that. It’s looking like I
need to change something basic in my design. An Array was the simple and most
logical code, but if Arrays are as slow as Hashes (or nearly so) something is not
appropriate for this project. Perhaps I WILL need to use a database.

Like everyone else on this thread, I’m going to have to say:

would you PLEASE just fucking measure something already???

require ‘benchmark’

it’s not hard.

Brian C. wrote:

rb_ary_store(ary, idx, val)
new_capa += idx;

memory than a Hash - but since the Array or Hash holds only object
references (essentially pointers), both may be small compared to the
memory allocated to the objects you are holding within them.

Thank you.

I’m going to need to think about the implications of that. It’s looking
like I need to change something basic in my design. An Array was the
simple and most logical code, but if Arrays are as slow as Hashes (or
nearly so) something is not appropriate for this project. Perhaps I
WILL need to use a database.

Brian C. wrote:

Charles H. wrote in post #1084133:

Hashes are slow compared to direct indexing.
Says who?
This is a commonly true statement for a wide variety of languages. I am
quite likely to be translating this work into another language. In the
other language I am certain that it is a true statement. Unless Array
access is significantly slower than Hash access (which I cannot believe)
this is the way to proceed.

Push might be a good
answer, but not if it allocates storage item by item.
It doesn’t. Look at the MRI C implementation if you really care: the
underlying array storage size is increased in chunks. Ditto for Strings.
That’s ok. push was the wrong answer anyway. The right answer was that
Arrays automatically extend to encompass the largest index that you use
to store into them.
if (new_capa>= ARY_MAX_SIZE - idx) {
new_capa = (ARY_MAX_SIZE - idx) / 2;
}
new_capa += idx;
REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa);
RARRAY(ary)->aux.capa = new_capa;
}

[from ruby-1.8.7-p352/array.c]
Sorry, my C isn’t that good, and there are a lot of undefined terms in
that snippet. I can’t really say I understand it. I can usually guess
what’s going on, but undefined macros make me quite unsure…and I try
to avoid pointers when writing code, because I want to understand it
later. For that matter, I usually avoid macros, too. I realize that
this is a common C coding style, but it’s one reason I dislike C. I’d
prefer D, FORTRAN, or even Ada or Eiffel. (Yaa…the code is what it
is. But to me what it is is confusing.)

Allocation of memory is another expensive operation
Says who? Expensive compared to all the other things going on in your
program? Your program will probably be creating objects
left-right-and-centre, and having them garbage collected too.
Among system calls, allocation of memory is one of the more expensive.
OTOH, the comment earlier about Array automatically extending itself
answers this as there are
steps automatically taken to minimize the number of allocations of
memory. I don’t think that using push repeatedly is a good idea (and in
any case, it doesn’t fit with my logic, unless there is something about
the system that demands that it be done that way, which there isn’t).

I think you are approaching this the wrong way. I suggest you first
write some code that works, using the most simple and logical code.
Then optimise it if required: and optimise by measuring. Even for
experienced programmers, usually the hot-spot turns out not to be where
they guess it might be. And as you have already implied, you might need
to optimise for memory usage over speed.
If I design the basic framework of the application around a poor design,
I’ll have to rewrite the entire thing. This is something to get right
at the start. It’s not a premature optimization. There are lots of
places where I’m taking to “do enough and patch it later” approach. The
reason this isn’t one, is because this needs to be handled now.

From a speed point of view, I’m pretty confident in saying that the
implementation of Hash (in C) versus the implementation of Array (in C)
is unlikely to be the bottleneck in your program. An Array may use less
memory than a Hash - but since the Array or Hash holds only object
references (essentially pointers), both may be small compared to the
memory allocated to the objects you are holding within them.

OK. If it’s not the bottleneck, that will be very good. Is there ANY
reason that Hash is a better choice? I don’t believe that there is, but
I could be wrong. Certainly if they are the same speed, and use about
the same amount of memory, I would prefer to use an Array. My
expectation is that Array will be faster as well as using significantly
less RAM, but were they the same, I would still prefer to use an Array,
as it will match more nearly the preferred order of I/O (among other
reasons). It is true that my reasons for preferring an Array are all
minor, and if Array has some real drawback, then I would readily switch
to Hash. But I haven’t heard of one. I had some questions about how to
use it, but those have, I believe, been answered.
P.S.: I understand that if I use a Hash, item will be retrieved in the
order entered when I iterate through them. This isn’t my desired
result. I want to retrieve them in index # order. I’d rather not have
to sort them.

P.P.S: A part of my design is driven by my desire to have a language
independent form of serialization…and YAML won’t work for my
purposes. I may decide to have an isomorphic realization of this in a
binary serialization that isn’t language independent, but that WOULD be
premature optimization.

On 15 November 2012 06:48, Charles H.
[email protected]wrote:

Sorry, my C isn’t that good, and there are a lot of undefined terms in
that snippet. I can’t really say I understand it. I can usually guess
what’s going on, but undefined macros make me quite unsure…and I try to
avoid pointers when writing code, because I want to understand it later.
For that matter, I usually avoid macros, too. I realize that this is a
common C coding style, but it’s one reason I dislike C. I’d prefer D,
FORTRAN, or even Ada or Eiffel. (Yaa…the code is what it is. But to me
what it is is confusing.)

This is going to sound elitist and snobby, but I’m pretty sure that’s
the
whole problem right there. If you avoid pointers in C because they’re
confusing, then do not ever write C. Pointers are a fundamental part of
the
language and how you get things done in it, and actually quite simple
when
you’re thinking at the C level. It’s not a coding style, it’s how C
works.
And yes, without pointers preallocated arrays will often be faster than
everything else, because without pointers you’re losing the ability to
do
operations like memory copying and collection restructuring and object
comparisons by reference; reading and writing entire objects is always
slow.

I don’t know the Ruby C API at all, but my guess at interpreting the
snippet would be:

if (idx>= RARRAY(ary)->aux.capa) {  /* if the index is beyond the

RARRAY (Ruby Array)'s auxiliary capacity (i.e. total allocated space?)
/
long new_capa = RARRAY(ary)->aux.capa / 2; /
new_capa = current
capa / 2 */

     if (new_capa<  ARY_DEFAULT_SIZE) { /* clamp to some globally

defined minimum /
new_capa = ARY_DEFAULT_SIZE; /
(e.g. don’t extend by 3,
when
512 would be more sensible?) /
}
if (new_capa>= ARY_MAX_SIZE - idx) { /
clamp to some globally
understood maximum /
new_capa = (ARY_MAX_SIZE - idx) / 2; /
i.e. half the
available maximum space /
}
new_capa += idx;
REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa); /
some realloc()
macro – looks exactly like realloc() to me /
RARRAY(ary)->aux.capa = new_capa; /
tell the array what its
auxiliary capacity has become */
}

If I design the basic framework of the application around a poor design,

I’ll have to rewrite the entire thing. This is something to get right at
the start. It’s not a premature optimization. There are lots of places
where I’m taking to “do enough and patch it later” approach. The reason
this isn’t one, is because this needs to be handled now.

Indeed. Write the algorithm. Optimise the algorithm. Make sure you’re
using
sensible algorithmic techniques. Which container you’re using isn’t part
of
that algorithm.

Again: which container you’re using isn’t part of the algorithm,
especially
in an OO context like Ruby. In the algorithm you say “add to the
container” or “look up thingy in the container.” Then when you’ve
implemented it, if the algorithm is gravy but execution takes too long,
maybe think about alternative containers, based on how your (already
good) algorithm is using it. E.g. sparse storage suggests not using an
array, random inserts but sorted iteration implies some sort of heap or
something, etc. The main point is, the outer algorithm is the big deal;
you abstract the container with a little black box called “the
container”
until you know more about how the whole system works.

Ruby is a high enough level language that you can abstract the
container’s
inner workings like this safely. It’s also high enough that you can’t
really predict the behaviour without trying it (or without having
written
dozens of other similar programs and getting a feel for how various
objects
behave.)

So again, in summary: write the part of the algorithm you can control,
and
make it as good as can be. Then use empirical testing to find the best
objects to fill in those black boxes.


Matthew K., B.Sc (CompSci) (Hons)
http://matthew.kerwin.net.au/
ABN: 59-013-727-651

“You’ll never find a programming language that frees
you from the burden of clarifying your ideas.” - xkcd