Forum: Ruby Extending Array instances

Posted by Charles Hixson (Guest)
on 2012-11-12 18:17
(Received via mailing list)
I'm trying to figure out a good way to extend an Array, when the items
may not be added in index order, and I don't know the final highest
index.  Does anyone have any suggestions?

What occurs to me is that I could double the length of the array
whenever I want to add more than one element by doing something like:
a    =    a    + Array.new(a.length)
Naturally, it would need to be a bit more complicated, as, e.g., I'd
need to ensure that the resulting array actually *would* hold the
proposed index.  Still, something only a little more complicated should
_work_, but is that the best approach?
Posted by Wayne Brisette (Guest)
on 2012-11-12 18:44
(Received via mailing list)
push?

I've used that when I didn't know (or care) about the length of the 
array.

Wayne
Posted by Bartosz DziewoƄski (matmarex)
on 2012-11-12 18:52
(Received via mailing list)
I'd just use a hash.

-- Matma Rex
Posted by Brian Candler (candlerb)
on 2012-11-12 19:22
Charles Hixson wrote in post #1084111:
> I'm trying to figure out a good way to extend an Array, when the items
> may not be added in index order, and I don't know the final highest
> index.  Does anyone have any suggestions?

This happens automatically.

>> a = [:foo, :bar]
=> [:foo, :bar]
>> a[10] = :baz
=> :baz
>> a
=> [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

As others have said: using a Hash may be more efficient, if there are 
large gaps - but then you may have to sort the keys if you want to 
iterate over it in order. It depends on your use case.
Posted by Charles Hixson (Guest)
on 2012-11-12 22:12
(Received via mailing list)
Brian Candler wrote:
>>> a
> =>  [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]
>
> As others have said: using a Hash may be more efficient, if there are
> large gaps - but then you may have to sort the keys if you want to
> iterate over it in order. It depends on your use case.
>
Hashes are slow compared to direct indexing.  Push might be a good
answer, but not if it allocates storage item by item.  Allocation of
memory is another expensive operation.  But I *am* considering using
gdbm instead of an Array.  That is slow like a hash, but it eliminates
needing to persist at a later time, and it conserves RAM usage.  OTOH,
it would be a lot faster to just loop through the array marshaling out
item by item.  Or perhaps just doing the entire thing in one blurt.

I'm expecting this array to eventually use over a GB of storage, so
efficiency considerations are somewhat significant.  gdbm solves some of
these, by not dealing with the entire array at once, but only with the
recently active instances.  But it's slower, which I was trying to avoid
by using an Array.  However, I don't know just how many items the Array
will need to hold, it could easily be in the millions, and will at least
be in the hundreds of thousands.  So pre-allocating isn't very
desirable, but neither is adding instance by instance.  (Besides, they
won't necessarily come in order.  It could well be that I'll get the
100,000th entry immediately after getting the 21st.  No problem for a
hash, but if I'm going to slow down to hash speed, I should probably
just go directly to gdbm.)
Posted by Charles Hixson (Guest)
on 2012-11-13 02:03
(Received via mailing list)
Brian Candler wrote:
>>> a
> =>  [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]
>
> As others have said: using a Hash may be more efficient, if there are
> large gaps - but then you may have to sort the keys if you want to
> iterate over it in order. It depends on your use case.
>
Rereading your post, I see I missed a crucial part of it:

a[10] = :baz
=>  a == [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

And THAT was the part I really needed to get the first time.
Posted by Robert Klemme (robert_k78)
on 2012-11-13 11:45
(Received via mailing list)
On Mon, Nov 12, 2012 at 10:12 PM, Charles Hixson
<charleshixsn@earthlink.net> wrote:
> Brian Candler wrote:

>> As others have said: using a Hash may be more efficient, if there are
>> large gaps - but then you may have to sort the keys if you want to
>> iterate over it in order. It depends on your use case.
>>
> Hashes are slow compared to direct indexing.

Did you measure it?  If not, please do.  It's easy to arrive at wrong
conclusions by assuming what may be truth in other contexts in true in
a new context as well.

> I'm expecting this array to eventually use over a GB of storage, so

Are you talking about memory or number of instances?  I am asking
because both are not the same and it may be more difficult to asses
memory consumption than you think.

> efficiency considerations are somewhat significant.  gdbm solves some of
> these, by not dealing with the entire array at once, but only with the
> recently active instances.  But it's slower, which I was trying to avoid by
> using an Array.  However, I don't know just how many items the Array will
> need to hold, it could easily be in the millions, and will at least be in
> the hundreds of thousands.

So how do you know we are talking about "over a GB of storage"?

>  So pre-allocating isn't very desirable, but
> neither is adding instance by instance.  (Besides, they won't necessarily
> come in order.  It could well be that I'll get the 100,000th entry
> immediately after getting the 21st.  No problem for a hash, but if I'm going
> to slow down to hash speed, I should probably just go directly to gdbm.)

Again: measure before you judge.  It's really much better to base
decisions on facts than on speculation.

Cheers

robert
Posted by Brian Candler (candlerb)
on 2012-11-13 14:13
Charles Hixson wrote in post #1084133:
> Hashes are slow compared to direct indexing.

Says who?

> Push might be a good
> answer, but not if it allocates storage item by item.

It doesn't. Look at the MRI C implementation if you really care: the 
underlying array storage size is increased in chunks. Ditto for Strings.

void
rb_ary_store(ary, idx, val)
...
    if (idx >= RARRAY(ary)->aux.capa) {
        long new_capa = RARRAY(ary)->aux.capa / 2;

        if (new_capa < ARY_DEFAULT_SIZE) {
            new_capa = ARY_DEFAULT_SIZE;
        }
        if (new_capa >= ARY_MAX_SIZE - idx) {
            new_capa = (ARY_MAX_SIZE - idx) / 2;
        }
        new_capa += idx;
        REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa);
        RARRAY(ary)->aux.capa = new_capa;
    }

[from ruby-1.8.7-p352/array.c]

> Allocation of memory is another expensive operation

Says who? Expensive compared to all the other things going on in your 
program? Your program will probably be creating objects 
left-right-and-centre, and having them garbage collected too.

I think you are approaching this the wrong way. I suggest you first 
write some code that works, using the most simple and logical code. 
*Then* optimise it if required: and optimise by measuring. Even for 
experienced programmers, usually the hot-spot turns out not to be where 
they guess it might be. And as you have already implied, you might need 
to optimise for memory usage over speed.

From a speed point of view, I'm pretty confident in saying that the 
implementation of Hash (in C) versus the implementation of Array (in C) 
is unlikely to be the bottleneck in your program. An Array may use less 
memory than a Hash - but since the Array or Hash holds only object 
references (essentially pointers), both may be small compared to the 
memory allocated to the objects you are holding within them.
Posted by Charles Hixson (Guest)
on 2012-11-14 06:31
(Received via mailing list)
Robert Klemme wrote:
> a new context as well.
>
>> I'm expecting this array to eventually use over a GB of storage, so
> Are you talking about memory or number of instances?  I am asking
> because both are not the same and it may be more difficult to asses
> memory consumption than you think.
I'm talking about data size.  The number of instances would probably be
about 1/20 of the number of items.  (Varies because the items themselves
contain Arrays that vary in size.)  But please note that this is a rough
estimate, and could be an underestimate (though I've tried to make a
generous estimate).
> efficiency considerations are somewhat significant.  gdbm solves some of
> these, by not dealing with the entire array at once, but only with the
> recently active instances.  But it's slower, which I was trying to avoid by
> using an Array.  However, I don't know just how many items the Array will
> need to hold, it could easily be in the millions, and will at least be in
> the hundreds of thousands.
> So how do you know we are talking about "over a GB of storage"?
It's my best estimate before I actually build the thing.  I hope it's a
generous estimate, as smaller amounts of data would be easier to handle,
but I don't want to say that I haven't underestimated.
>
> robert
>
> --
> remember.guy do |as, often| as.you_can - without end
> http://blog.rubybestpractices.com/
I've measured it in several languages, and occasionally implemented the
hash algorithm from scratch in C (two or three different ways of
handling collisions...but Unicode means that I don't want to use C,
though pointers was my original reason).  I'll admit I haven't measured
it yet in Ruby, but hash cannot be fast, though direct indexing can be
slow.  It's in the nature of the algorithm.  It can be quite fast
compared to a tree lookup, but not compared to a direct index.
FWIW, the answer I was looking for is that Arrays are automatically
extended to the last index inserted into.  This probably implies that
Arrays use some sort of linked block index, though another possibility
is that it just allocates a honking big chunk of memory, and does a copy
whenever it needs to resize the array.  This seems unlikely.  Also
silly.  So I'll bet on a linked block index.  (N.B.:  Some languages DO
just allocate large blocks of RAM for their arrays, but most of the ones
that I know do that are compiler languages, and they've got reasonable
grounds...big blocks are more efficient to access, and if you're a
compiler the code you compile can be as efficient as you are.  So the
trade-offs are different.  Even then they try to avoid copying.)

P.S.:  Developing this in Ruby *IS* a measurement test.  If I can do it
in Ruby, then I've guessed the right size.  But picking a bad algorithm
to implement wouldn't prove much of anything.  If it doesn't work, I'll
need to fall back to a faster language.  Probably D.  That's one reason
I want to avoid gdbm.  Not only is RAM faster, but it's an easier
conversion to another language if it turns out that I need a faster
language.
Ruby's a lot faster to develop in, so it's not a bad decision even if
Ruby turns out to be too slow (or not able to handle the required amount
of memory), because then I'll have a pretty much written application
that I can translate to a faster language.  (One reason for D is that
there are lots of places where I can execute loops in parallel easily.
That would be harder in Ruby, Python, etc., though in most ways they are
easier.)
Posted by Ryan Davis (Guest)
on 2012-11-14 07:07
(Received via mailing list)
On Nov 13, 2012, at 21:30 , Charles Hixson <charleshixsn@earthlink.net> 
wrote:

> FWIW, the answer I was looking for is that Arrays are automatically extended to 
the last index inserted into.  This probably implies that Arrays use some sort of 
linked block index, though another possibility is that it just allocates a honking 
big chunk of memory, and does a copy whenever it needs to resize the array.  This 
seems unlikely.  Also silly. So I'll bet on a linked block index.

Welcome to silly... tho I'm not sure why you think that.
Posted by Charles Hixson (Guest)
on 2012-11-14 08:20
(Received via mailing list)
Ryan Davis wrote:
> On Nov 13, 2012, at 21:30 , Charles Hixson<charleshixsn@earthlink.net>  wrote:
>
>> FWIW, the answer I was looking for is that Arrays are automatically extended to 
the last index inserted into.  This probably implies that Arrays use some sort of 
linked block index, though another possibility is that it just allocates a honking 
big chunk of memory, and does a copy whenever it needs to resize the array.  This 
seems unlikely.  Also silly. So I'll bet on a linked block index.
> Welcome to silly... tho I'm not sure why you think that.
That's not what the code I saw showed...admittedly my C skill are very
poor.  If it DOES do massive copying, I'm going to need a very different
approach.  And I'm not sure what would work.  But why wouldn't Ruby use
memory mapping?  Copying a GB of memory around doesn't sound like a
feasible approach, even though this would be done at full C speed.

If that's true, then using a database is the only way I can think of
that would work, and there are several reasons why I want to avoid that.
Posted by Robert Klemme (robert_k78)
on 2012-11-14 09:00
(Received via mailing list)
On Wed, Nov 14, 2012 at 6:30 AM, Charles Hixson
<charleshixsn@earthlink.net> wrote:
> I've measured it in several languages, and occasionally implemented the hash
> algorithm from scratch in C (two or three different ways of handling
> collisions...but Unicode means that I don't want to use C, though pointers
> was my original reason).  I'll admit I haven't measured it yet in Ruby, but
> hash cannot be fast, though direct indexing can be slow.  It's in the nature
> of the algorithm.  It can be quite fast compared to a tree lookup, but not
> compared to a direct index.

That is still speculation.

> FWIW, the answer I was looking for is that Arrays are automatically extended
> to the last index inserted into.  This probably implies that Arrays use some
> sort of linked block index, though another possibility is that it just
> allocates a honking big chunk of memory, and does a copy whenever it needs
> to resize the array.  This seems unlikely.

Please stop speculating and verify the facts (i.e. check Ruby source
code).  You'll do yourself a big favor.

Note also that you can still implement a type with the same interface
as Hash or Array and drop it into your implementation later if you
find the std lib types to be too slow for your purposes.

Cheers

robert
Posted by Brian Candler (candlerb)
on 2012-11-14 09:46
Charles Hixson wrote in post #1084366:
> FWIW, the answer I was looking for is that Arrays are automatically
> extended to the last index inserted into.  This probably implies that
> Arrays use some sort of linked block index, though another possibility
> is that it just allocates a honking big chunk of memory, and does a copy
> whenever it needs to resize the array.  This seems unlikely.  Also
> silly.  So I'll bet on a linked block index.

Then you lose the bet. I already posted the code.

The code says: when you set an element beyond the end, the underlying 
array storage capacity (aux.capa) is increased to the new end index plus 
half the current capacity; new storage is allocated and the old data 
copied into the new (read the manpage for "realloc"). So the data is 
copied but the number of copy operations goes down as the array grows.

This is of course *way* more efficient for most purposes than an linked 
list implementation, because the memory is contiguous and all subsequent 
get/set operations are then a simple pointer offsets rather than walking 
a list (even a list of chunks).

> P.S.:  Developing this in Ruby *IS* a measurement test.  If I can do it
> in Ruby, then I've guessed the right size.  But picking a bad algorithm
> to implement wouldn't prove much of anything.

You're talking about two different things. Picking a bad algorithm is of 
course a bad idea, and will give you bad performance (relative to a good 
algorithm) in any language.

But what you are discussing is the same algorithm, simply whether c[k] 
is slower or more memory inefficient if collection c is an Array or a 
Hash. This is almost certainly irrelevant in the context of your 
application.

Code using the obvious approach: if the keys are all non-negative 
integers and not too sparse, use an Array. Otherwise use a Hash.

If it doesn't all fit into RAM then this is probably due to the space 
used by the objects themselves, not the collection (I notice you haven't 
asked how much RAM a String object uses). If it doesn't run fast enough 
then the set/get operations of Array or Hash are almost certainly not 
the culprit.
Posted by Charles Hixson (Guest)
on 2012-11-14 20:07
(Received via mailing list)
Brian Candler wrote:
> rb_ary_store(ary, idx, val)
>          new_capa += idx;
>
> memory than a Hash - but since the Array or Hash holds only object
> references (essentially pointers), both may be small compared to the
> memory allocated to the objects you are holding within them.
>
Thank you.

I'm going to need to think about the implications of that.  It's looking
like I need to change something basic in my design.  An Array was the
simple and most logical code,  but if Arrays are as slow as Hashes (or
nearly so) something is not appropriate for this project.  Perhaps I
WILL need to use a database.
Posted by Ryan Davis (Guest)
on 2012-11-14 21:14
(Received via mailing list)
On Nov 14, 2012, at 11:06 , Charles Hixson <charleshixsn@earthlink.net> 
wrote:

> I'm going to need to think about the implications of that.  It's looking like I 
need to change something basic in my design.  An Array was the simple and most 
logical code,  but if Arrays are as slow as Hashes (or nearly so) something is not 
appropriate for this project.  Perhaps I WILL need to use a database.

Like everyone else on this thread, I'm going to have to say:

would you PLEASE just fucking measure something already???

  require 'benchmark'

it's not hard.
Posted by Charles Hixson (Guest)
on 2012-11-14 21:49
(Received via mailing list)
Brian Candler wrote:
> Charles Hixson wrote in post #1084133:
>> Hashes are slow compared to direct indexing.
> Says who?
This is a commonly true statement for a wide variety of languages.  I am
quite likely to be translating this work into another language.  In the
other language I am certain that it is a true statement.  Unless Array
access is significantly slower than Hash access (which I cannot believe)
this is the way to proceed.
>
>> Push might be a good
>> answer, but not if it allocates storage item by item.
> It doesn't. Look at the MRI C implementation if you really care: the
> underlying array storage size is increased in chunks. Ditto for Strings.
That's ok.  push was the wrong answer anyway.  The right answer was that
Arrays automatically extend to encompass the largest index that you use
to store into them.
>          if (new_capa>= ARY_MAX_SIZE - idx) {
>              new_capa = (ARY_MAX_SIZE - idx) / 2;
>          }
>          new_capa += idx;
>          REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa);
>          RARRAY(ary)->aux.capa = new_capa;
>      }
>
> [from ruby-1.8.7-p352/array.c]
Sorry, my C isn't that good, and there are a lot of undefined terms in
that snippet.  I can't really say I understand it.  I can usually guess
what's going on, but undefined macros make me quite unsure...and I try
to avoid pointers when writing code, because I want to understand it
later.  For that matter, I usually avoid macros, too.  I realize that
this is a common C coding style, but it's one reason I dislike C.  I'd
prefer D, FORTRAN, or even Ada or Eiffel.  (Yaa...the code is what it
is.  But to me what it is is confusing.)
>
>> Allocation of memory is another expensive operation
> Says who? Expensive compared to all the other things going on in your
> program? Your program will probably be creating objects
> left-right-and-centre, and having them garbage collected too.
Among system calls, allocation of memory is one of the more expensive.
OTOH, the comment earlier about Array automatically extending itself
answers this as there are
steps automatically taken to minimize the number of allocations of
memory.  I don't think that using push repeatedly is a good idea (and in
any case, it doesn't fit with my logic, unless there is something about
the system that demands that it be done that way, which there isn't).
>
> I think you are approaching this the wrong way. I suggest you first
> write some code that works, using the most simple and logical code.
> *Then* optimise it if required: and optimise by measuring. Even for
> experienced programmers, usually the hot-spot turns out not to be where
> they guess it might be. And as you have already implied, you might need
> to optimise for memory usage over speed.
If I design the basic framework of the application around a poor design,
I'll have to rewrite the entire thing.  This is something to get right
at the start.  It's not a premature optimization.  There are lots of
places where I'm taking to "do enough and patch it later" approach.  The
reason this isn't one, is because this needs to be handled now.
>
> > From a speed point of view, I'm pretty confident in saying that the
> implementation of Hash (in C) versus the implementation of Array (in C)
> is unlikely to be the bottleneck in your program. An Array may use less
> memory than a Hash - but since the Array or Hash holds only object
> references (essentially pointers), both may be small compared to the
> memory allocated to the objects you are holding within them.
>
OK.  If it's not the bottleneck, that will be very good.  Is there ANY
reason that Hash is a better choice?  I don't believe that there is, but
I could be wrong.  Certainly if they are the same speed, and use about
the same amount of memory, I would prefer to use an Array.  My
expectation is that Array will be faster as well as using significantly
less RAM, but were they the same, I would still prefer to use an Array,
as it will match more nearly the preferred order of I/O (among other
reasons).  It is true that my reasons for preferring an Array are all
minor, and if Array has some real drawback, then I would readily switch
to Hash.  But I haven't heard of one.  I had some questions about how to
use it, but those have, I believe, been answered.
P.S.:  I understand that if I use a Hash, item will be retrieved in the
order entered when I iterate through them.  This isn't my desired
result.  I want to retrieve them in index # order.  I'd rather not have
to sort them.

P.P.S:  A part of my design is driven by my desire to have a language
independent form of serialization...and YAML won't work for my
purposes.  I may decide to have an isomorphic realization of this in a
binary serialization that isn't language independent, but that WOULD be
premature optimization.
Posted by Matthew Kerwin (mattyk)
on 2012-11-15 03:29
(Received via mailing list)
On 15 November 2012 06:48, Charles Hixson 
<charleshixsn@earthlink.net>wrote:

> Sorry, my C isn't that good, and there are a lot of undefined terms in
> that snippet.  I can't really say I understand it.  I can usually guess
> what's going on, but undefined macros make me quite unsure...and I try to
> avoid pointers when writing code, because I want to understand it later.
>  For that matter, I usually avoid macros, too.  I realize that this is a
> common C coding style, but it's one reason I dislike C.  I'd prefer D,
> FORTRAN, or even Ada or Eiffel.  (Yaa...the code is what it is.  But to me
> what it is is confusing.)
>

This is going to sound elitist and snobby, but I'm pretty sure that's 
the
whole problem right there. If you avoid pointers in C because they're
confusing, then do not ever write C. Pointers are a fundamental part of 
the
language and how you get things done in it, and actually quite simple 
when
you're thinking at the C level.  It's not a coding style, it's how C 
works.
 And yes, without pointers preallocated arrays will often be faster than
everything else, because without pointers you're losing the ability to 
do
operations like memory copying and collection restructuring and object
comparisons by reference; reading and writing entire objects is always 
slow.

I don't know the Ruby C API at all, but my guess at interpreting the
snippet would be:

    if (idx>= RARRAY(ary)->aux.capa) {  /* if the index is beyond the
RARRAY (Ruby Array)'s auxiliary capacity (i.e. total allocated space?) 
*/
        long new_capa = RARRAY(ary)->aux.capa / 2; /* new_capa = current
capa / 2 */

         if (new_capa<  ARY_DEFAULT_SIZE) { /* clamp to some globally
defined minimum */
             new_capa = ARY_DEFAULT_SIZE;  /* (e.g. don't extend by 3, 
when
512 would be more sensible?) */
         }
         if (new_capa>= ARY_MAX_SIZE - idx) { /* clamp to some globally
understood maximum */
             new_capa = (ARY_MAX_SIZE - idx) / 2; /* i.e. half the
available maximum space */
         }
         new_capa += idx;
         REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa); /* some realloc()
macro -- looks exactly like realloc() to me */
         RARRAY(ary)->aux.capa = new_capa; /* tell the array what its
auxiliary capacity has become */
     }


If I design the basic framework of the application around a poor design,
> I'll have to rewrite the entire thing.  This is something to get right at
> the start.  It's not a premature optimization.  There are lots of places
> where I'm taking to "do enough and patch it later" approach.  The reason
> this isn't one, is because this needs to be handled now.


Indeed. Write the algorithm. Optimise the algorithm. Make sure you're 
using
sensible algorithmic techniques. Which container you're using isn't part 
of
that algorithm.

Again: which container you're using isn't part of the algorithm, 
especially
in an OO context like Ruby.  In the algorithm you say "add to the
container" or "look up thingy in the container."  Then when you've
implemented it, if the algorithm is gravy but execution takes too long,
maybe think about alternative _containers_, based on how your (already
good) algorithm is using it.  E.g. sparse storage suggests not using an
array, random inserts but sorted iteration implies some sort of heap or
something, etc.  The main point is, the outer algorithm is the big deal;
you abstract the container with a little black box called "the 
container"
until you know more about how the whole system works.

Ruby is a high enough level language that you can abstract the 
container's
inner workings like this safely.  It's also high enough that you can't
really predict the behaviour without _trying_ it (or without having 
written
dozens of other similar programs and getting a feel for how various 
objects
behave.)

So again, in summary: write the part of the algorithm you can control, 
and
make it as good as can be. Then use empirical testing to find the best
objects to fill in those black boxes.

--
  Matthew Kerwin, B.Sc (CompSci) (Hons)
  http://matthew.kerwin.net.au/
  ABN: 59-013-727-651

  "You'll never find a programming language that frees
  you from the burden of clarifying your ideas." - xkcd
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.