Database-like Algorithmic Problem

hi all,

ok here’s the scoop. i have a collection of hashes all with an arbitrary
number of symbolic keys pointing to arbitrary objects. i have a template
hash also containing symbol keys pointing to ‘matching objects’, i.e.
objects to be matched with the objects in the aforementioned collection,
and i want to pull out the first match. an example:

please don’t psychoanalyze the below :slight_smile:

collection = [
{ :foo => ‘python’, :bar => ‘php’, :x => 69 },
{ :foo => ‘scheme’, :mama => ‘ruby’, :papa => ‘C’ },
{ :x => 386 },
{ :mama => ‘ruby’, :papa => ‘simula’, :lil_bro => ‘fortran’ }
]

template1 = { :mama => ‘ruby’, :papa => ‘C’ }
template2 = { :x => Numeric }

match_first(template1, collection)

should return: { :foo => ‘scheme’, :mama => ‘ruby’, :papa => ‘C’ }

match_all(template2, collection)

should return: [{:x => 386 }, {:foo => ‘python’, :bar => ‘php’, :x =>

69}]

so obviously === is being used internally as is evidenced by Numeric in
template2, but what kind of datastructures/algorithms should i be
looking at for the most efficient (time-wise) implementation, keeping in
mind, there may be many hashes in the collection and those hashes are
generally not very large, i.e. contain few key/value pairs. i’m not too
savy on the taxonomy of algorithms so any pointers to what i should be
looking at to read up on would be appreciated.

my potentially naive solution is something like the following (in full
color Parked at Loopia):

require ‘set’

@collection = {}

def add hash
hash.each do |key,value|
(@collection[key] ||= Set.new) << hash
end
end

def candidates template
sets = []
template.keys.each do |key|
sets << @collection[key] if @collection.include? key
end
if sets.empty?
return sets
else
return sets.inject(sets.pop) do |intersection, set|
intersection & set
end
end
end

def match template, hash
template.each do |key, match|
return false unless match === hash[key]
end
return true
end

def match_first template
candidates(template).each do |hash|
return hash if match(template, hash)
end
return nil
end

def match_all template
return candidates(template).select do |hash|
match(template, hash)
end
end

[
{ :foo => ‘python’, :bar => ‘php’, :x => 69 },
{ :foo => ‘scheme’, :mama => ‘ruby’, :papa => ‘C’ },
{ :x => 386, :papa => ‘clause’ },
{ :mama => ‘ruby’, :papa => ‘simula’, :lil_bro => ‘fortran’ },
{ :mama => 1946, :boo => ‘far’, :x => ‘x’}
].each do |hash|
add hash
end

puts match_first({ :mama => ‘ruby’, :papa => ‘C’ }).inspect
puts match_all({:foo => String}).inspect

thanks for reading this far!
_c

2007/12/18, Christophe M. [email protected]:

ok here’s the scoop. i have a collection of hashes all with an arbitrary
number of symbolic keys pointing to arbitrary objects. i have a template
hash also containing symbol keys pointing to ‘matching objects’, i.e.
objects to be matched with the objects in the aforementioned collection,
and i want to pull out the first match. an example:

please don’t psychoanalyze the below :slight_smile:

You are crazy. Please contact a doctor ASAP. Oh, sorry. :slight_smile:

match_first(template1, collection)
savy on the taxonomy of algorithms so any pointers to what i should be
looking at to read up on would be appreciated.

my potentially naive solution is something like the following (in full
color Parked at Loopia):

thanks for reading this far!

I think you are on a pretty good way. Here’s what I would do
differently:

  1. Do no recreate all those candidate collections for every
    match_first call. Instead, stuff everything into a class and provide a
    method that will create an /index/ of the data (see 2).

  2. I would use Hashes to store your indexes, i.e. instead of creating
    candidates all over again, create them once (see 1) and keep them
    there.

Kind regards

robert

You are crazy. Please contact a doctor ASAP. Oh, sorry. :slight_smile:

:slight_smile:

I think you are on a pretty good way. Here’s what I would do
differently:

thanks robert. yeah i’d like to do some caching, but as hashes are
constantly
added and removed from the collection, i’ll have to weigh up the
overhead of managing a cache, maybe even implement it and do some
profiling w/ and w/o. guess i’ll have to read up on cache algs.

regards,
_c

On Tue, 18 Dec 2007 10:54:54 -0500, Christophe M. wrote:

please don’t psychoanalyze the below :slight_smile:

Does it please you to believe that I please don’t psychoanalyze the
below?

cruiserdan wrote:

class DB

end

thanks for that, i benchmarked it and it is many times faster. now i
have to see how i’m going to do caching etc…

_c

Interesting problem. Aside from seeing a doctor, my suggestions are in
the comments in the revised variant below:


require ‘set’

encapsulate

class DB

use hash’s default syntax

def initialize
@c = Hash.new{ | hash, key | hash[key] = Set.new }
end

was: add, use << to follow ruby idiom for array, set

store val with hash for quicker comparison

def <<(hash)
hash.each do |key,val|
@c[key] << [ val, hash ]
end
end

need a delete operation …

def delete(hash)
@c.delete(hash)
end

was: match_all, use select to follow ruby enumerable idiom

determine candidates on the fly, avoid intersection code using Set

def select(template)
good = Set.new; bad = Set.new
template.each do |key,val1|
(@c[key]).each do |val2,hash|
( val1 === val2 ? good : bad ) << hash unless bad.include?
hash
end
end
return good.to_a
end

was: match_first, use select to follow ruby enumerable idiom

determine candidates on the fly, avoid intersection code using Set

def find(template)
good = Set.new; bad = Set.new
template.each do |key,val1|
(@c[key]).each do |val2,hash|
( val1 === val2 ? good : bad ) << hash unless bad.include?
hash
end
return good.to_a.first unless good.empty?
end
return nil
end
end


HTH!

Dan Y.
http://dev.zeraweb.com/

there’s a bug in yours. not sure where yet. try

d = DB.new
d << {:a => 2, :b => 1, :c => ‘b’}
puts d.find({ :a => Numeric, :b => ‘a’ }).inspect

returns {:a => 2, :b => 1, :c => ‘b’}

Christophe M. wrote:

cruiserdan wrote:

class DB

end

thanks for that, i benchmarked it and it is many times faster. now i
have to see how i’m going to do caching etc…

_c

oops, i read my benchmarks wrong, it is actually not that much faster

yours: 9.020000 0.940000 9.960000 ( 11.283543)
mine: 9.220000 1.380000 10.600000 ( 11.127394)

but is an improvement, so thanks. i’ll do some more comprehensive
benchmarks
and post them here in a bit.

Okay, here’s a fixed version.

There were 2 problems. The first was that I forgot to remove the bad
(unmatched) hashes before returning a result. The second was that you
have to go through every key in the template to find a match (I was
returning after the first match). Thus find is no more efficient than
select due to the way the data is stored.

I also took out the ‘unless bad.include? hash’ clause from the compare
because I think it actually is slower that way, depending on what you
are comparing. I thought it would be faster than the original version
posted because I don’t have to generate a set of candidates first, so
the initial benchmarks were disappointing. But maybe this version will
do even better. I look forward to seeing your results.


new version of select

def select(template)
good = Set.new; bad = Set.new
template.each do |key,val1|
(@c[key]).each do |val2,hash|
( val1 === val2 ? good : bad ) << hash # unless bad.include?
hash
end
end
return (good - bad).to_a
end

new version of find

def find(template)
r = select template
r.first unless r.empty?
end


Regards,
Dan