On Wed, Jul 26, 2006 at 05:03:50AM +0900, Charles O Nutter wrote:

And even in the TreeMap case, you sacrifice performance for features.

Very true. While a peek at the source code will reveal that the

Red-Black tree it uses is well-implemented, nothing comes for free.

An unordered hash will be the fastest for lookup and insertion.

That depends, though it *is* generally true. However, your

datastructures should suit the problem domain and the constraints you’re

working under.

Adding any

additional features will slow it down out of the necessity of imposing an

order on either keys or values. With TreeMap, I think it’s something like

O(log n) lookup times rather than the theoretical O(1) with a straight hash,

and that’s not counting any tree rebalancing needed after future insertions.

Theoretically. However, while looking at speed through the lens of time

complexity can be a good rule of thumb, it’s sometimes somewhat

misleading.

For instance, say you had a dataset that you had to do lookups upon, and

you’d two choices as to the method to use for lookups. The first is a

linear scan of the unordered data, and the other is some keyed lookup.

The former would give you O(n) performance, while the latter would give

O(1) performance.

If you didn’t have any more information, you’d be forgiven for choosing

the latter method, but say scanning each element takes only .01ms,

whereas the calculation required in the latter method takes 2s per

element, it’s a no-brainer to pick the the O(n) method over the O(1)

method if your dataset is of the right size.

In the case of hashes, you only get O(1) if (a) the hash can be

calculated in constant time and (b) you’re using a perfect hash

function. Strictly speaking, just as Quicksort’s performance is O(n^2),

insertion and lookup for hashtables is O(n+m) where the hash degrades to

something resembling a bunch of keyed linked lists and *n* is the mean

time to convert the hash key to an index, and *m* is the mean time

required to do a linear scan of the appropriate linked list. But such

cases are rare, but inherent in the way hashes are implemented.

Big-O notation is only meant as a guide to algorithm choice and is no

substitute for and understanding of the tradeoffs involved with each

potential algorithm and how they actually perform when used with your

datasets.

Trees are good when you (a) want to guarantee how long it will take to

add,

lookup, or remove an element or (b) need the keys to be ordered by some

property. Most of the time, we don’t need these properties.

At minimum, you’ll double up memory storing an array and a hash together, or

you’ll sacrifice performance maintaining and traversing a balanced tree.

…

Leave hash alone; custom data structures are fine when the peculiar beast

known as an “ordered hash” is actually useful.

Agreed. There’s no sense in adding additional complexity to the language

to cope with something like that which most of us only use rarely, if

ever. That is, after all, what libraries are for.

K.