This question may be so obvious that even Marnen won’t answer it…

I need to store a set of relatively complex objects in the db – in my

case, these are discrete cumulative distribution functions: they take a

long time to construct, but I want the lookup to be fast. Since the

lookup is essentially a binary search, a B-Tree or some variant would be

a sensible data structure.

I have choices: (a) I store entire YAML’d B-Trees, (b) I store simple

data structure (e.g. arrays of numeric pairs) and spend time

constructing a B-Tree from it, or © choose another approach that

someone on this list points out.

Can this forum offer any advice on size / speed tradeoffs for YAML’d

objects? If unpacking a YAML’d object is fast, then (a), storing the

entire B-Tree is probably the best approach. If slow, then perhaps I’m

better off with (b), storing a minimal data structure and reconstructing

the B-Tree when I need it.

Thoughts?

Fearless F. wrote in post #965610:

This question may be so obvious that even Marnen won’t answer it…

I can’t resist a challenge like that.

I need to store a set of relatively complex objects in the db – in my

case, these are discrete cumulative distribution functions: they take a

long time to construct, but I want the lookup to be fast. Since the

lookup is essentially a binary search, a B-Tree or some variant would be

a sensible data structure.

I have choices: (a) I store entire YAML’d B-Trees, (b) I store simple

data structure (e.g. arrays of numeric pairs) and spend time

constructing a B-Tree from it, or (c) choose another approach that

someone on this list points out.

How about C? Nested sets are generally a great way of storing arbitrary

trees in SQL databases for easy retrieval. The awesome_nested_set

plugin makes this easy to do in Rails.

If you’re unfamiliar with nested sets, I recommend

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html as a

good overview.

Or do I misunderstand? What’s the nature of the tree you’d like to

store?

Can this forum offer any advice on size / speed tradeoffs for YAML’d

objects? If unpacking a YAML’d object is fast, then (a), storing the

entire B-Tree is probably the best approach. If slow, then perhaps I’m

better off with (b), storing a minimal data structure and reconstructing

the B-Tree when I need it.

Speed may not be your immediate concern with YAML: the big problem with

that approach is that it bypasses the structure of the database and

makes the data very difficult to query. It’s a decent last resort, but

it *is* a last resort.

Thoughts?

##
Best,

Marnen Laibow-Koser

http://www.marnen.org

[email protected]

On 2 December 2010 06:28, Fearless F. [email protected] wrote:

I need to store a set of relatively complex objects in the db – in my

case, these are discrete cumulative distribution functions: they take a

long time to construct, but I want the lookup to be fast. Since the

lookup is essentially a binary search, a B-Tree or some variant would be

a sensible data structure.

Thoughts?

I’ve an inkling this is out of my comfort zone, but my initial

thoughts (feel free to totally disregard…) are:

a) serialize? or is that what you meant by storing a “YAML’d B-Tree”?

b) if these are complex objects that need to be stored in the DB, can

you not use a DB designed to store complex objects (like Mongo)?

c) Marnen’s nested set sounds like a potential good solution too

(given what I think I understand from your description…)

Fearless F. wrote in post #965610:

This question may be so obvious that even Marnen won’t answer it…

I need to store a set of relatively complex objects in the db – in my

case, these are discrete cumulative distribution functions: they take a

long time to construct, but I want the lookup to be fast. Since the

lookup is essentially a binary search, a B-Tree or some variant would be

a sensible data structure.

I have choices: (a) I store entire YAML’d B-Trees, (b) I store simple

data structure (e.g. arrays of numeric pairs) and spend time

constructing a B-Tree from it, or © choose another approach that

someone on this list points out.

Tough to call… What’s your intended usage scenario?

Depending on the resolution required from the CDF, could you just store

built distributions in a table and query against it?

With normalized inputs and the right indices, the query should be fast,

and trading off record count versus accuracy, you could always do a

linear interpolation between any two stored points. Need higher

interpolated accuracy? Then increase the resolution of stored points to

minimize the interpolated gap.

Ugh… I gave up the practice of sadistics years ago, and now you’re

making me remember it… Curse you Red Baron!

@marnen, @railsdog: I owe you a big thank you. Once I realized that

looking up the CDF was really just piecewise linear approximation, then

casting it into pure SQL was easy. In fact, it gave me a chance to test

out the new AREL-based query interface.

If anyone is interested, it’s surprisingly simple: Assume you store x0,

y0, and slope for each line segment like this:

```
ActiveRecord::Schema.define do
create_table(:lin_segs, :force => true) do |t|
t.integer :tbl_id
t.float :x0
t.float :y0
t.float :m
end
end
```

The AREL-style query (replete with linear interpolation) is simply:

def lookup(x, table_id)

seg = LineSegment.where(:table_id => table_id).order(‘x0

DESC’).where(“x0 < #{x}”).limit(1).first

seg.y0 + (x - seg.x0) * seg.m

end

If you insert “bumpers” at each end of your table (an entry with x0 =

LARGEST_NEGATIVE_FLOAT and an entry with x0 = max X), then you don’t

even need additional logic to handle out-of-bounds x values.

Thanks for the nudge. I’m getting more comfortable thinking in SQL.

This may surprise the forum, but Marnen is right – it makes

sense to simply structure the CDFs as SQL queries.

I know how to write CDF functions in a procedural language but haven’t

yet attempted them using SQL, but @railsdog’s post reminded me that it

won’t be too hard (thank you).

CDF functions are easily implemented as tables of linear segments – I

can store the vertices and the slopes, so linear interpolation between

the breakpoints will be fast.

And – doh! – a google search for “cumulative distribution function

sql” gives me a real jump start. Why didn’t I think of that sooner?

Thanks, y’all.