Persisting complex (and possibly large) objects

Dobai-Pataky_BSSSSl · December 2, 2010, 7:28am

This question may be so obvious that even Marnen won’t answer it…

I need to store a set of relatively complex objects in the db – in my
case, these are discrete cumulative distribution functions: they take a
long time to construct, but I want the lookup to be fast. Since the
lookup is essentially a binary search, a B-Tree or some variant would be
a sensible data structure.

I have choices: (a) I store entire YAML’d B-Trees, (b) I store simple
data structure (e.g. arrays of numeric pairs) and spend time
constructing a B-Tree from it, or © choose another approach that
someone on this list points out.

Can this forum offer any advice on size / speed tradeoffs for YAML’d
objects? If unpacking a YAML’d object is fast, then (a), storing the
entire B-Tree is probably the best approach. If slow, then perhaps I’m
better off with (b), storing a minimal data structure and reconstructing
the B-Tree when I need it.

Thoughts?

ff

fearless_fool · December 2, 2010, 4:17pm

Fearless F. wrote in post #965610:

This question may be so obvious that even Marnen won’t answer it…

I can’t resist a challenge like that.

I need to store a set of relatively complex objects in the db – in my
case, these are discrete cumulative distribution functions: they take a
long time to construct, but I want the lookup to be fast. Since the
lookup is essentially a binary search, a B-Tree or some variant would be
a sensible data structure.

I have choices: (a) I store entire YAML’d B-Trees, (b) I store simple
data structure (e.g. arrays of numeric pairs) and spend time
constructing a B-Tree from it, or (c) choose another approach that
someone on this list points out.

How about C? Nested sets are generally a great way of storing arbitrary
trees in SQL databases for easy retrieval. The awesome_nested_set
plugin makes this easy to do in Rails.

If you’re unfamiliar with nested sets, I recommend
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html as a
good overview.

Or do I misunderstand? What’s the nature of the tree you’d like to
store?

Can this forum offer any advice on size / speed tradeoffs for YAML’d
objects? If unpacking a YAML’d object is fast, then (a), storing the
entire B-Tree is probably the best approach. If slow, then perhaps I’m
better off with (b), storing a minimal data structure and reconstructing
the B-Tree when I need it.

Speed may not be your immediate concern with YAML: the big problem with
that approach is that it bypasses the structure of the database and
makes the data very difficult to query. It’s a decent last resort, but
it is a last resort.

Thoughts?

ff

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

fearless_fool · December 2, 2010, 4:25pm

On 2 December 2010 06:28, Fearless F. [email protected] wrote:

I need to store a set of relatively complex objects in the db – in my
case, these are discrete cumulative distribution functions: they take a
long time to construct, but I want the lookup to be fast. Since the
lookup is essentially a binary search, a B-Tree or some variant would be
a sensible data structure.

Thoughts?

I’ve an inkling this is out of my comfort zone, but my initial
thoughts (feel free to totally disregard…) are:

a) serialize? or is that what you meant by storing a “YAML’d B-Tree”?
b) if these are complex objects that need to be stored in the DB, can
you not use a DB designed to store complex objects (like Mongo)?
c) Marnen’s nested set sounds like a potential good solution too
(given what I think I understand from your description…)

fearless_fool · December 2, 2010, 5:48pm

Fearless F. wrote in post #965610:

This question may be so obvious that even Marnen won’t answer it…

I need to store a set of relatively complex objects in the db – in my
case, these are discrete cumulative distribution functions: they take a
long time to construct, but I want the lookup to be fast. Since the
lookup is essentially a binary search, a B-Tree or some variant would be
a sensible data structure.

I have choices: (a) I store entire YAML’d B-Trees, (b) I store simple
data structure (e.g. arrays of numeric pairs) and spend time
constructing a B-Tree from it, or © choose another approach that
someone on this list points out.

ff

Tough to call… What’s your intended usage scenario?

Depending on the resolution required from the CDF, could you just store
built distributions in a table and query against it?

With normalized inputs and the right indices, the query should be fast,
and trading off record count versus accuracy, you could always do a
linear interpolation between any two stored points. Need higher
interpolated accuracy? Then increase the resolution of stored points to
minimize the interpolated gap.

Ugh… I gave up the practice of sadistics years ago, and now you’re
making me remember it… Curse you Red Baron!

fearless_fool · December 3, 2010, 12:38am

@marnen, @railsdog: I owe you a big thank you. Once I realized that
looking up the CDF was really just piecewise linear approximation, then
casting it into pure SQL was easy. In fact, it gave me a chance to test
out the new AREL-based query interface.

If anyone is interested, it’s surprisingly simple: Assume you store x0,
y0, and slope for each line segment like this:

ActiveRecord::Schema.define do
  create_table(:lin_segs, :force => true) do |t|
    t.integer :tbl_id
    t.float :x0
    t.float :y0
    t.float :m
  end
end

The AREL-style query (replete with linear interpolation) is simply:

def lookup(x, table_id)
seg = LineSegment.where(:table_id => table_id).order(‘x0
DESC’).where(“x0 < #{x}”).limit(1).first
seg.y0 + (x - seg.x0) * seg.m
end

If you insert “bumpers” at each end of your table (an entry with x0 =
LARGEST_NEGATIVE_FLOAT and an entry with x0 = max X), then you don’t
even need additional logic to handle out-of-bounds x values.

Thanks for the nudge. I’m getting more comfortable thinking in SQL.

ff

fearless_fool · December 2, 2010, 7:58pm

This may surprise the forum, but Marnen is right – it makes
sense to simply structure the CDFs as SQL queries.

I know how to write CDF functions in a procedural language but haven’t
yet attempted them using SQL, but @railsdog’s post reminded me that it
won’t be too hard (thank you).

CDF functions are easily implemented as tables of linear segments – I
can store the vertices and the slopes, so linear interpolation between
the breakpoints will be fast.

And – doh! – a google search for “cumulative distribution function
sql” gives me a real jump start. Why didn’t I think of that sooner?

Thanks, y’all.

ff