Why Does Hash Apparently Reorder Its Internal Representation

casper_the_ghost · August 22, 2006, 2:08am

Hashes are typically implemented with a number of buckets pre-allocated,
usually in the form of an array. Hash values are calculated on the keys
to
be inserted, and a bucket is chosen mod bucket_count. So it’s pretty
much
anyone’s guess where a given hash will put a key. The apparent
reordering is
actually not quite correct; what was seen is apparent ordering. The
fact
that the insertion order was retained was only a lucky coincidence. Try
inserting a, b, and c in a different order.

To make matters worse, a hash will frequently have collisions when
choosing
a bucket, so you generally also have a linked list in each bucket to
hold
values that collide. Then if there are n elements with the same hash mod
bucket_count, you’ll have to iterate over them looking for the correct
element (in Ruby’s case, I believe the one that matches eq?). To avoid
too
many collisions impacting hash performance, it may be necessary to
expand
the number of buckets and rehash everything…at a cost of n, where n is
the
number of elements inserted up to that point. After rehashing, all
apparent
ordering will likely have disappeared.

The general associative structure is called a map, not a hash, because
it
maps keys to values. A hash is one way to implement a map that gives
insertion and lookup performance guarantees; specifically, values can be
inserted, removed, and queries in constant time (as opposed to some
polynomial of n elements). The trade-off is that order is not
guaranteed,
since order ends up being determined by the hash values.

You can implement a map that preserves or maintains order in multiple
ways;
you can use a tree–or specifically a “red/black tree”, which is memory
optimal but requires at most log(n) time a given value, and something
like
log(n) to rebalance the tree periodically. The characteristic used to
maintain order (and balance) in the tree is not important. You can also
maintain a hash and an ordered list internally, using the list for
iteration
and the hash for all other operations. Hash lookups remain constant time
but
modifications take some polynomial of n (because you now have to update
the
ordered list).

It’s all about tradeoffs. Array is a minimum implementation of an
ordered
list…in that it maintains the order you provide for it. It is not a
sorted
list, which maintains an order based on some external characteristic
(like a
sort algorithm or <=>). Hash is a minimal implementation of an unordered
map, which provides fast access to named keys but does not preserve
order or
perform any sorting.

Those calling for an ordered hash are really looking for an ordered map,
and
it seems in many cases they want a sorted map (or an ordered or sorted
“associative array”, as others have mentioned…tomato/tomato). Hash
should
not be modified to preserve order, since it would add overhead to many,
many
cases where order is unnecessary additional overhead. I wouldn’t be
opposed
to adding a general-purpose Tree, to allow those who want to implement
an
ordered map a fast, native data structure to do so; but I would oppose
adding uncommon variants of hash to the standard libraries, specifically
the
OHash (and IHash?!?) that have come up a few times.

casper_the_ghost · August 22, 2006, 3:56pm

Hal F. wrote:

Phrogz wrote:

And, for what it’s worth, JavaScript’s Object primitive is also an
associative array that retains insertion order for iterating keys,
while exhibiting performance characteristics of a hash.

I find that interesting, since so many appear convinced that
a so-called ordered hash would be “too slow.”

Do you know anything about the internal implementation?

I don’t. I wouldn’t call most JavaScript engines ‘fast’, but I suspect
that has to do with aspects other than the native Object type.

Without having tried it, I would think that preserving insertion order
would be a (small) memory size addition to each hash (as you’ve
mentioned before), a tiny slowdown in the speed to store a new entry,
and no performance change in accessing an entry.

Anyhow, because I’m too lazy to do so, you (or someone else) can get
source code for the SpiderMonkey[1] and Rhino[2] and check out how
insertion order is preserved on Object.

[1] http://www.mozilla.org/js/spidermonkey/ - the C-based JS engine
[2] http://www.mozilla.org/rhino/ - the Java-based JS engine

casper_the_ghost · August 22, 2006, 5:50pm

Phrogz wrote:

And, for what it’s worth, JavaScript’s Object primitive is also an
associative array that retains insertion order for iterating keys,
while exhibiting performance characteristics of a hash.

Clarification: apparently the ECMAScript spec[1] don’t actually require
that the insertion order be preserved; it’s just that the major JS
engine implementations happen to do so. On the order of properties, the
spec actually says:

“Step 5: Get name of the next property […] that doesn’t have the
DontEnum attribute.
[…]
The mechanics of enumerating the properties (step 5 in the first
algorithm, step 6 in the second) is implementation dependent. The order
of enumeration is defined by the object. Properties of the object being
enumerated may be deleted during enumeration. If a property that has
not yet been visited during enumeration is deleted, then it will not be
visited. If new properties are added to the object being enumerated
during enumeration, the newly added properties are not guaranteed to be
visited in the active enumeration.”

[1]
http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf

casper_the_ghost · August 22, 2006, 4:51am

[email protected] wrote in message
news:2d4684b07b9f965e5078fdb46754fac3@shiny…

purposes of symmetry. I’m not even sure there is any need for either,
such that I may be trying to achieve the unnecessary?..

How would this even work?
For instance:

hash = { 0 => ‘a’, 1 => ‘b’, ‘c’ => ‘d’ }

for me, this prints ‘a’, ‘d’, ‘b’, in that order…

hash.each { |k,e| puts e }

puts hash[1] # what will this print?

There's no way to tell whether the last line of this code is trying

to
access the second element of the hash or the one that maps to the Fixnum
1…

casper_the_ghost · August 23, 2006, 2:14am

Phrogz wrote:

source code for the SpiderMonkey[1] and Rhino[2] and check out how
insertion order is preserved on Object.

[1] http://www.mozilla.org/js/spidermonkey/ - the C-based JS engine
[2] http://www.mozilla.org/rhino/ - the Java-based JS engine

Haha… I’ve put that in my notes. That’s all I guarantee right now.

Don’t ever get in a laziness contest with a master.

Hal

casper_the_ghost · August 23, 2006, 2:50am

[email protected] wrote:

Hi –

Is there a reason why you don’t simply inheret from Hash, or is this
just another one of those There’s More Than One Way To Do It scenarios?

You mean TMTOTMTOWTDIS?

Groan.

And while we’re sliding horribly Hoff topic, is it just me, or is there
an uncanny amount of ‘The Newbie that cried “Wolf!”’ (or rather “Bug!”)
threads cropping out lately? Not like I’m complaining, they sidetrack
into relatively interesting, if not too related exchanges very soon, but
still.

David V.

casper_the_ghost · August 22, 2006, 7:37pm

Hal F. wrote:

an order – first element, second element, and so on. We can “sort”
a regular hash, but we get an array back.)

You’re right. Ordered merely means that elements have a position, I had
it confused with ‘sorted’.

I once proposed the name “Map” for such a class – there was some reason
this wasn’t considered good, but I can’t recall why.

Yup, associated arrays are maps, not hashes.

After reading the nutter’s on google groups (hate the broken
nntp<->mailing list concept), I realize that calling them hashes isn’t
appropriate. Once you change their caracteristics they aren’t really
hashes any more.

SortedMap (or TreeMap) and (Insertion)OrderedMap (or perhaps
LinkedHashMap; doubly linked list + hash) are probably better names…?

Isak

casper_the_ghost · August 23, 2006, 3:36am

I don’t think ‘ordered’ is a good name for something sorted by
insertion order.

The fact that something can be “sorted” at all is only because it
has an order, i.e., is sequential.

Uh, no.

Whether it can be sorted depends on whether the contents are
comparable.

Both are partly true. The sorting depends on comparable, but if the
data structure is not sequencable (for lack of a better term) the
results of sorting will need to be returned in something that is
(such as an Array) rather than updated in place. In other words, you
could implement #sort, but not #sort!.

Matthew

casper_the_ghost · August 23, 2006, 3:30am

On 8/21/06, Hal F. [email protected] wrote:

Isak wrote:

I don’t think ‘ordered’ is a good name for something sorted by
insertion order.

The fact that something can be “sorted” at all is only because it
has an order, i.e., is sequential.

Uh, no.

Whether it can be sorted depends on whether the contents are comparable.

irb(main):004:0> class Foo
irb(main):005:1> end
=> nil
irb(main):006:0> [Foo.new, Foo.new].sort
NoMethodError: undefined method <=>' for #<Foo:0xb7dfe658> from (irb):6:in sort’
from (irb):6
from :0

–
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

casper_the_ghost · August 23, 2006, 4:18am

Rick DeNatale wrote:

Uh, no.

Whether it can be sorted depends on whether the contents are comparable.

I wasn’t speaking in a Ruby sense. And even if I were, there
are other ways of sorting besides calling the sort method.

Hal