How to order a hash based on its keys?

SSISSaki_Baz_C.SS_S · June 22, 2011, 4:07pm

On Wed, Jun 22, 2011 at 3:47 PM, Iaki Baz C. [email protected]
wrote:

end

priority.
course, but taking into account the weight of each SRV record).

I’ve it already working, but I will take a more deepth look to your
above code to integrate it.

Hm… Now I am not sure any more whether you need a associative
container at all. Now we have three steps. From what I understand
you do:

items are collected
items are ordered by priority (what you call second container above)
items are ordered by some randomization algorithm

I’m obviously missing other access operations out here. So far I
don’t see any need for a Hash or similar structure. What else is it
that you need to do with these items? What properties do they have?

Kind regards

robert

SSISSaki_Baz_C.SS_S · June 22, 2011, 3:24pm

On Wed, Jun 22, 2011 at 11:50 AM, Iaki Baz C. [email protected]
wrote:

a trade off here. What are your requirements anyway?

Ok. I get some data and convert it into a Hash whose key is a priority
value (integer, 0 is the best priority). The value of each entry is of
course the data value associated with such priority. But such hash has
not been ordered, this is, probably entry 1 has priority 2 while entre
2 has priority 1 (or whatever).

Well, in that case an Array would be much more efficient. You just
need a way to deal with holes, e.g.

prio_data = []

def prio_data.each_pair
each_with_index do |x,i|
yield i, x if x
end
end

prio_data[1] = “foo”
prio_data[4] = “bar”

prio_data.each_pair {|prio,v| printf “key=%p val=%p\n”, prio, v}

Then I just want to get a new hash in which entries are ordered by
priority.

Why do you need a second container object? Is this just for sorting
purposes or do you actually need to maintain the state from a given
point in time?

Kind regards

robert

PS: In similar situations a priority queue can be helpful.

SSISSaki_Baz_C.SS_S · June 22, 2011, 4:20pm

2011/6/22 Robert K. [email protected]:

that you need to do with these items? What properties do they have?
Step 1)

I get data (DNS SRV records), which is an array with entries like this:

#<EventMachine::Udns::RR_SRV:0x00000001ea4970 @domain=“domain.org”,
@priority=1, @weight=50, @port=5062>

Each entry is a SRV record. It has a priority (lowest value is the
best) and weight (higher value means more probability to choose it).

The client getting such array of SRV records wants to connect to a
server based on SRV priorities/weight, which works as follows:

SRV records with best priority must be used first.
If there are various SRV records with same priority, then the
probability of choosing each one depends on its weight value (random
with weight).
If the chosen SRV record fails to connect (i.e. server down) next
one (based on same priority/weight rules) must be tryed.

SSISSaki_Baz_C.SS_S · June 22, 2011, 5:08pm

2011/6/22 Piotr S. [email protected]:

srvs.sort { |a, b| (a.priority <=> b.priority).nonzero? or (b.weight <=>
a.weight).nonzero? or rand }

The problem is that a SRV record with priority 1 and weight 20 must
not be always preferred over another record with priority 1 and
weight 10 (the first record must be chosen ~2 times as its weight is
double high).

Anyhow I’ve already coded a method for such weight algorithm

Thanks.

SSISSaki_Baz_C.SS_S · June 22, 2011, 5:02pm

Iñaki Baz C.:

I get data (DNS SRV records), which is an array with entries like this:

#<EventMachine::Udns::RR_SRV:0x00000001ea4970 @domain=“domain.org”, @priority=1,
@weight=50, @port=5062>

Each entry is a SRV record. It has a priority (lowest value is the
best) and weight (higher value means more probability to choose it).

The client getting such array of SRV records wants to connect to a
server based on SRV priorities/weight, which works as follows:

SRV records with best priority must be used first.

If there are various SRV records with same priority, then the
probability of choosing each one depends on its weight value (random
with weight).

If the chosen SRV record fails to connect (i.e. server down) next
one (based on same priority/weight rules) must be tryed.

Something to the tune of

srvs.sort { |a, b| (a.priority <=> b.priority).nonzero? or (b.weight <=>
a.weight).nonzero? or rand }

should work, then.

— Piotr S.

SSISSaki_Baz_C.SS_S · June 22, 2011, 7:38pm

On Wed, Jun 22, 2011 at 4:19 PM, Iaki Baz C. [email protected]
wrote:

don’t see any need for a Hash or similar structure. What else is it
Each entry is a SRV record. It has a priority (lowest value is the
best) and weight (higher value means more probability to choose it).

The client getting such array of SRV records wants to connect to a
server based on SRV priorities/weight, which works as follows:

SRV records with best priority must be used first.

If there are various SRV records with same priority, then the
probability of choosing each one depends on its weight value (random
with weight).

If the chosen SRV record fails to connect (i.e. server down) next
one (based on same priority/weight rules) must be tryed.

Aha! This is how I’d probably do it.

gist.github.com

https://gist.github.com/rklemme/1040631

output.txt

#
# statistics
#

#<WorkQueue:0x10039da4
 @p=#<Proc:0x10032734>,
 @q=
  [#<Picker:0x10039d7c
    @items=[#<struct SRV domain="server-1", port=1234, priority=1, weight=1>],
    @total=1,

This file has been truncated. show original

srv.rb

# Picks any of given items based on weight
class Picker
  def initialize(items = nil, &weight)
    @total = 0
    @items = []
    @weight = weight
    items.each {|it| self << it} if items
  end

  def empty?

This file has been truncated. show original

Kind regards

robert

SSISSaki_Baz_C.SS_S · June 23, 2011, 12:11am

2011/6/22 Robert K. [email protected]:

Aha! This is how I’d probably do it.
Prioritized work queue with weighted items · GitHub

It’s really cool

I’ve done it using a different approach, but I think the main
algorithm is similar:

pick = rand(@total)
@items.each_with_index do |it, idx|
  sum += @weight[it]
  return idx if sum > pick
end

I do something similar:

https://gist.github.com/1041339

Note that my code does not use real SRV records for now, but similar
objects (hash/arrays in fact) and I provide it in a specific format (a
hash with SRV priorities as keys).

I’m mostly interested in efficience and I’m satisfied with the results
of my code.

Thanks a lot.

SSISSaki_Baz_C.SS_S · June 23, 2011, 5:13pm

On Thu, Jun 23, 2011 at 12:11 AM, Iaki Baz C. [email protected]
wrote:

2011/6/22 Robert K. [email protected]:

Aha! This is how I’d probably do it.
Prioritized work queue with weighted items · GitHub

It’s really cool

Oh, thank you!

https://gist.github.com/1041339

Note that my code does not use real SRV records for now, but similar
objects (hash/arrays in fact) and I provide it in a specific format (a
hash with SRV priorities as keys).

I rather use specific types than Array and Hash since those make the
code more readable. Also, this encourages more OOish programs where
you distribute functionality across classes.

I’d also not use recursion in srv_entries_randomize() - a loop is
usually more efficient. And btw. you calculate the total weight every
time the method is invoked as sum of all entries while I maintain the
@total and adjust it only for every insertion and removal.

I notice you have a require ‘benchmark’ in there but I don’t see any
Benchmark methods used… You also seem to have the habit of placing
assignments in method argument lists or control flow statements. This
makes code harder to read and is really only needed in case of loops,
e.g.

while (str = io.gets)
printf “We have read: %p\n”, str
end

There is one thing I don’t understand in your code: you have two
randomizations in there: in line 8 there is rand() similar to what I
have done and in line 34 there is shuffle. Why do you do that? Is
there a requirement that hasn’t been mentioned yet?

I’m mostly interested in efficience and I’m satisfied with the results
of my code.

Well, then that’s good.

Thanks a lot.

You’re welcome!

Kind regards

robert

SSISSaki_Baz_C.SS_S · June 23, 2011, 6:42pm

2011/6/23 Robert K. [email protected]:

I’d also not use recursion in srv_entries_randomize() - a loop is
usually more efficient.

I was not able to do it with a loop, neither writing the algorithm in
a paper, I always got bad statistical results. Are you sure sure that
your code generates statistical results? for example, with my example
data:

priorities = {
1 => [[0, :“server-1”]],
2 => [[16, :“server-2-A”], [4, :“server-2-B”], [8, :“server-2-C”]],
4 => [[50, :“server-3”]]
}

I get there correct results (columns mean position from 1 to 5):

Iterating 50000 times…

Results:

server-1: 50000 0 0 0 0
server-2-A: 0 28488 16302 5210 0
server-2-B: 0 7226 12245 30529 0
server-2-C: 0 14286 21453 14261 0
server-3: 0 0 0 0 50000

And btw. you calculate the total weight every
time the method is invoked as sum of all entries while I maintain the
@total and adjust it only for every insertion and removal.

Right. I’m trying to improve that. However take into account that my
code does not need to create an instance. Instead it will be a class
method (or a module method like DNS::srv_randomize), so I cannot use
attributes (or I should not).

But you are right, I must get removing the recursion. If you can prove
me that your code gets same results for 10000 iterations with same
input data I will adapt my code

I notice you have a require ‘benchmark’ in there but I don’t see any
Benchmark methods used…

Initially I used it. Later I just do a “start_ime = Time.now” and so.

You also seem to have the habit of placing
assignments in method argument lists or control flow statements. This
makes code harder to read and is really only needed in case of loops,
e.g.

while (str = io.gets)
printf “We have read: %p\n”, str
end

Right, in fact I did it due to performance reasons, to avoid double
access to the same element of a hash, but I’ve realized that depending
on the case, it’s just more efficient to perform double access rather
than generating a new variable.

There is one thing I don’t understand in your code: you have two
randomizations in there: in line 8 there is rand() similar to what I
have done and in line 34 there is shuffle. Why do you do that? Is
there a requirement that hasn’t been mentioned yet?

You are right, sorry. If a SRV record has weight 0, there should be no
chance it to be the chosen first (before other records with same
priority and weight greater than 0). So what I do is remove SRV
records with same priority and weight 0 and then make a simple shuffle
with them, adding the results in the last position. For example:

priority 1, weight 10, domain “server1”, port 5060
priority 1, weight 0, domain “server2”, port 5060
priority 1, weight 0, domain “server3”, port 5060

In this case, records 2 and 3 should always be chosen after record 1
(which has weight > 0). The order of records with weight 0 must be
random.

Really thanks a lot for your interest. It’s very helpful.

SSISSaki_Baz_C.SS_S · June 24, 2011, 9:47am

On Thu, Jun 23, 2011 at 7:12 PM, Iaki Baz C. [email protected]
wrote:

2011/6/23 Robert K. [email protected]:

I rather use specific types than Array and Hash since those make the
code more readable. Also, this encourages more OOish programs where
you distribute functionality across classes.

I know that my code is not very “ellegant” and does not follow Ruby’s fashion
But this code will not be exposed to users of a library. Instead this
will be used internally by my application/library, which will expose
decent API methods.

Still the readability argument applies. Imagine you have to maintain
the code after a year not looking at it - or someone else has to
maintain it…

Kind regards

robert

SSISSaki_Baz_C.SS_S · June 23, 2011, 7:13pm

2011/6/23 Robert K. [email protected]:

I rather use specific types than Array and Hash since those make the
code more readable. Also, this encourages more OOish programs where
you distribute functionality across classes.

I know that my code is not very “ellegant” and does not follow Ruby’s
fashion
But this code will not be exposed to users of a library. Instead this
will be used internally by my application/library, which will expose
decent API methods.

Regards.

SSISSaki_Baz_C.SS_S · June 24, 2011, 10:27am

On Thu, Jun 23, 2011 at 6:42 PM, Iaki Baz C. [email protected]
wrote:

2011/6/23 Robert K. [email protected]:

I’d also not use recursion in srv_entries_randomize() - a loop is
usually more efficient.

I was not able to do it with a loop, neither writing the algorithm in
a paper, I always got bad statistical results.

That was certainly not an effect of the recursion. You probably
accidentally added another issue.

I get there correct results (columns mean position from 1 to 5):

Would that be sufficient statistical enough for you?

gist.github.com

https://gist.github.com/rklemme/1040631

output.txt

#
# statistics
#

#<WorkQueue:0x10039da4
 @p=#<Proc:0x10032734>,
 @q=
  [#<Picker:0x10039d7c
    @items=[#<struct SRV domain="server-1", port=1234, priority=1, weight=1>],
    @total=1,

This file has been truncated. show original

srv.rb

# Picks any of given items based on weight
class Picker
  def initialize(items = nil, &weight)
    @total = 0
    @items = []
    @weight = weight
    items.each {|it| self << it} if items
  end

  def empty?

This file has been truncated. show original

And btw. you calculate the total weight every
time the method is invoked as sum of all entries while I maintain the
@total and adjust it only for every insertion and removal.

Right. I’m trying to improve that. However take into account that my
code does not need to create an instance. Instead it will be a class
method (or a module method like DNS::srv_randomize), so I cannot use
attributes (or I should not).

Well, that’s not exactly true: your code creates an Array (stored in
ordered_targets) so you could as well create another object.

But you are right, I must get removing the recursion. If you can prove
me that your code gets same results for 10000 iterations with same
input data I will adapt my code

see above

access to the same element of a hash, but I’ve realized that depending
on the case, it’s just more efficient to perform double access rather
than generating a new variable.

My remark has nothing to do whatsoever with avoiding double access to
a Hash. I was specifically talking about these:

11: if rnd < prio = entry[0]
106: printf “INFO: Time elapsed : %.4f seconds\n”, time_elapsed =
time_end - time_start

which are better written as

prio = entry[0]
if rnd < prio

time_elapsed = time_end - time_start
printf “INFO: Time elapsed : %.4f seconds\n”, time_elapsed

Much more readable and no performance difference other than maybe one
more access to a local variable which is negligible IMHO.

priority 1, weight 10, domain “server1”, port 5060

priority 1, weight 0, domain “server2”, port 5060

priority 1, weight 0, domain “server3”, port 5060

In this case, records 2 and 3 should always be chosen after record 1
(which has weight > 0). The order of records with weight 0 must be
random.

Weight 0 generally means “do not use”. Basically you change the
algorithm to also include items whose weight is 0. I am not sure I
would add that complexity. If someone uses weight 0 then he may
actually do that on purpose. If not then it’s a bug and should be
flagged accordingly (e.g. by raising an exception). Also: this will
also change weight of all other elements, because the probabilities
are skewed. I’d rather provide proper weights as inputs and avoid
illegal weights.

Really thanks a lot for your interest. It’s very helpful.

Good! These kinds of discussions are the ones I like being around.
Everybody learns something along the way.

Kind regards

robert

SSISSaki_Baz_C.SS_S · June 28, 2011, 2:21pm

Hi,

unfortunately I don’t have time to follow up in detail just now.

On Sat, Jun 25, 2011 at 7:31 PM, Iaki Baz C. [email protected]
wrote:

2011/6/24 Robert K. [email protected]:

I was not able to do it with a loop, neither writing the algorithm in
a paper, I always got bad statistical results.

That was certainly not an effect of the recursion. You probably
accidentally added another issue.

Yes, and still I’m looking for the introduced bug

Prioritized work queue with weighted items · GitHub
Yours:

INFO: Time elapsed : 2.5718 seconds
INFO: Resolution speed : 38883.8211 resolutions/second

Of course my code is much more ugly.

Can you post the exact testing code in a gist?

Right. I’m trying to improve that. However take into account that my
code does not need to create an instance. Instead it will be a class
method (or a module method like DNS::srv_randomize), so I cannot use
attributes (or I should not).

Well, that’s not exactly true: your code creates an Array (stored in
ordered_targets) so you could as well create another object.

Yes, but it’s just an Array allocation (using [] rather than
Array.new, which is much faster).

Probably true.

mean “do not use”, but “rarely select it at the first choice”:
Even then I would leave special treatment of this out of class
WorkQueue (in my code) because that is something specific to the use
case at hand. I’d probably adjust weights before handing instances
over to the WorkQueue (e.g. multiply with 100 and set to 1 if still
0).

http://tools.ietf.org/html/rfc2782:

Note also that the selection mechanism is detalied in the last paragraph.

I’ll leave that for later.

Really thanks a lot for your interest. It’s very helpful.

Good! These kinds of discussions are the ones I like being around.
Everybody learns something along the way.

Sure

Thanks a lot again.

You’re welcome!

Kind regards

robert

SSISSaki_Baz_C.SS_S · June 25, 2011, 7:32pm

2011/6/24 Robert K. [email protected]:

I was not able to do it with a loop, neither writing the algorithm in
a paper, I always got bad statistical results.

That was certainly not an effect of the recursion. You probably
accidentally added another issue.

Yes, and still I’m looking for the introduced bug

Prioritized work queue with weighted items · GitHub
Yes

However, I’ve compared efficience (with 100.000 iterations):

Mine:

INFO: Time elapsed : 0.9837 seconds
INFO: Resolution speed : 101656.6751 resolutions/second

Yours:

INFO: Time elapsed : 2.5718 seconds
INFO: Resolution speed : 38883.8211 resolutions/second

Of course my code is much more ugly.

Right. I’m trying to improve that. However take into account that my
code does not need to create an instance. Instead it will be a class
method (or a module method like DNS::srv_randomize), so I cannot use
attributes (or I should not).

Well, that’s not exactly true: your code creates an Array (stored in
ordered_targets) so you could as well create another object.

Yes, but it’s just an Array allocation (using [] rather than
Array.new, which is much faster).

if rnd < prio

time_elapsed = time_end - time_start
printf “INFO: Time elapsed : %.4f seconds\n”, time_elapsed

Much more readable and no performance difference other than maybe one
more access to a local variable which is negligible IMHO.

Right.

Weight 0 generally means “do not use”. Basically you change the
algorithm to also include items whose weight is 0. I am not sure I
would add that complexity. If someone uses weight 0 then he may
actually do that on purpose. If not then it’s a bug and should be
flagged accordingly (e.g. by raising an exception). Also: this will
also change weight of all other elements, because the probabilities
are skewed. I’d rather provide proper weights as inputs and avoid
illegal weights.

DNS SRV are explained in RFC 2782, and entries with weight 0 don’t
mean “do not use”, but “rarely select it at the first choice”:

http://tools.ietf.org/html/rfc2782:

Priority
The priority of this target host. A client MUST attempt to
contact the target host with the lowest-numbered priority it can
reach; target hosts with the same priority SHOULD be tried in an
order defined by the weight field. The range is 0-65535. This
is a 16 bit unsigned integer in network byte order.

Weight
A server selection mechanism. The weight field specifies a
relative weight for entries with the same priority. Larger
weights SHOULD be given a proportionately higher probability of
being selected. The range of this number is 0-65535. This is a
16 bit unsigned integer in network byte order. Domain
administrators SHOULD use Weight 0 when there isn’t any server
selection to do, to make the RR easier to read for humans (less
noisy). In the presence of records containing weights greater
than 0, records with weight 0 should have a very small chance of
being selected.

    In the absence of a protocol whose specification calls for the
    use of other weighting information, a client arranges the SRV
    RRs of the same Priority in the order in which target hosts,
    specified by the SRV RRs, will be contacted. The following
    algorithm SHOULD be used to order the SRV RRs of the same
    priority:

    To select a target to be contacted next, arrange all SRV RRs
    (that have not been ordered yet) in any order, except that all
    those with weight 0 are placed at the beginning of the list.

    Compute the sum of the weights of those RRs, and with each RR
    associate the running sum in the selected order. Then choose a
    uniform random number between 0 and the sum computed
    (inclusive), and select the RR whose running sum value is the
    first in the selected order which is greater than or equal to
    the random number selected. The target host specified in the
    selected SRV RR is the next one to be contacted by the client.
    Remove this SRV RR from the set of the unordered SRV RRs and
    apply the described algorithm to the unordered SRV RRs to select
    the next target host.  Continue the ordering process until there
    are no unordered SRV RRs.  This process is repeated for each
    Priority.

Note also that the selection mechanism is detalied in the last
paragraph.

Really thanks a lot for your interest. It’s very helpful.

Good! These kinds of discussions are the ones I like being around.
Everybody learns something along the way.

Sure

Thanks a lot again.

SSISSaki_Baz_C.SS_S · July 1, 2011, 12:47pm

2011/6/28 Robert K. [email protected]:

Of course my code is much more ugly.

Can you post the exact testing code in a gist?

Hi Robert, please let me some time to update it and will post it.

Thanks a lot.

How to order a hash based on its keys?

Results:

server-1: 50000 0 0 0 0 server-2-A: 0 28488 16302 5210 0 server-2-B: 0 7226 12245 30529 0 server-2-C: 0 14286 21453 14261 0 server-3: 0 0 0 0 50000

I get there correct results (columns mean position from 1 to 5):

Mine:

Yours:

server-1: 50000 0 0 0 0
server-2-A: 0 28488 16302 5210 0
server-2-B: 0 7226 12245 30529 0
server-2-C: 0 14286 21453 14261 0
server-3: 0 0 0 0 50000