Beyond YAML? (scaling)


#1

Hi,

I’ve been using YAML files to store hashes of numbers, e.g.,

{ [“O_Kc_01”] => [ 0.01232, 0.01212, 0.03222, … ], … }

This has worked wonderfully for portability and visibility
into the system as I’ve been creating it.

Recently, however, I’ve increased my problem size by orders
of magnitude in both the number of variables and the number
of associated values. The resulting YAML files are prohibitive:
10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

Thanks,


#2

On May 3, 2007, at 8:50 AM, Bil K. wrote:

of magnitude in both the number of variables and the number
of associated values. The resulting YAML files are prohibitive:
10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

Some random thoughts:

  • If they are just super straightforward lists of numbers like this a
    trivial flat file scheme, say with one number per line, might get the
    job done.
  • XML can be pretty darn easy to output manually and if you use
    REXML’s stream parser (not slurping everything into a DOM) you should
    be able to read it reasonably quick.
  • If you are willing to sacrifice a little visibility, you can always
    take the step up to a real database, even if it’s just sqlite. These
    have varying degrees of portability as well.
  • You might want to look at KirbyBase. (It has a younger brother
    Mongoose, but that uses binary output.)

Hope something in there helps.

James Edward G. II


#3

Bil K. wrote:

of magnitude in both the number of variables and the number
of associated values. The resulting YAML files are prohibitive:
10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

Hey, Bil. If you don’t mind a couple of shameless plugs, you might want
to try KirbyBase or Mongoose.

KirbyBase should be faster than YAML and it still stores the data in
plain text files, if that is important to you.

Mongoose is faster than KirbyBase, at the expense of the data not being
stored as plain text.

I don’t know if either will be fast enough for you.

HTH,

Jamey C.


#4

On Thu, 3 May 2007, Bil K. wrote:

I’ve been using YAML files to store hashes of numbers, e.g.,

{ [“O_Kc_01”] => [ 0.01232, 0.01212, 0.03222, … ], … }

of magnitude in both the number of variables and the number
of associated values. The resulting YAML files are prohibitive:
10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

I guess that depends on whether you need the files to be easily readable
or not. If you don’t, Marshal will be faster than YAML.

Kirk H.


#5

Bil K. wrote:

removed_email_address@domain.invalid wrote:

I guess that depends on whether you need the files to be easily
readable or not. If you don’t, Marshal will be faster than YAML.

At this point, I’m looking for an easy out that will
reduce size and increase speed, and I’m willing to
go binary if necessary.

What about mmap and pack/unpack, as ara does in

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/175944

?


#6

William J. wrote:

Would making a copy of the hash use too much
memory or time?

I don’t know, but that’s surely another way out of
the Marshal-hash-proc trap…

Thanks,


#7

Bil K. wrote:

migrating to Marshal seems to be the Simplest Thing
That Could Possibly Work.

Well, maybe not so simple…

`dump’: can’t dump hash with default proc (TypeError)

which seems to be due to the trick I learned from zenspider
and drbrain to quickly setup a hash of arrays:,

Hash.new{ |hash,key| hash[key]=[] }

Later,


#8

On Thu, May 03, 2007 at 10:50:06PM +0900, Bil K. wrote:

10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

Use a SQL database?

It all depends what sort of processing you’re doing. If you’re adding to
a
dataset (rather than starting with an entirely fresh data set each
time),
having a database makes sense. If you’re doing searches across the data,
and/or if the data is larger than the available amount of RAM, then a
database makes sense. If you’re only touching small subsets of the data
at
any one time, then a database makes sense.

Put it another way, does your processing really require you to read the
entire collection of objects into RAM before you can perform any
processing?

If it does, and your serialisation needs are as simple as it appears
above,
then maybe something like CSV would be better.

O_Kc_01,0.01232,0.01212,0.03222,…

If the source of the data is another Ruby program, then Marshal will be
much
faster than YAML (but unfortunately binary).

You could consider using something like Madeleine:
http://madeleine.rubyforge.org/
This snapshots your object tree to disk (using Marshal by default I
think,
but can also use YAML). You can then make incremental changes and
occasionally rewrite the snapshot.

B.


#9

On Fri, May 04, 2007 at 01:25:05AM +0900, Bil K. wrote:

Bil K. wrote:

Hash.new{ |hash,key| hash[key]=[] }

Is there a better way than,

samples[tag] = [] unless samples.has_key? tag
samples[tag] << sample

Not exactly identical but usually good enough:

samples[tag] ||= []
samples[tag] << sample

And you can probably combine:

(samples[tag] ||= []) << sample


#10

Matt L. wrote:

You’re building an Orion? Please tell me it’s not true!

No; I should have been more specific…“our Orion vehicle /design/”. :slight_smile:

Later,


#11

On Thu, May 03, 2007 at 11:45:05PM +0900, Bil K. wrote:

removed_email_address@domain.invalid wrote:

I guess that depends on whether you need the files to be easily readable
or not. If you don’t, Marshal will be faster than YAML.

At this point, I’m looking for an easy out that will
reduce size and increase speed, and I’m willing to
go binary if necessary.

If you want to describe your data needs a bit, and what operations you
need to operate on it, I’ll be happy to play around with an ruby/sqlite3
program and see what pops out.

Since there’s no Ruby Q. this weekend, we all need something to work
on :-).

enjoy,

-jeremy


#12

On May 4, 2007, at 12:35 AM, Bil K. wrote:

Brian C. wrote:

Use a SQL database?

I always suspect that I should be doing that more often,
but as my experience with databases is rather limited
and infrequent, I always shy away from those as James
already knows. Regardless, I should probably overcome
my aggressive incompetence one day!

Don’t be afraid of the database solution. In the long term, it is
much more scalable and will pay dividends immediately.
MySQL and PostgreSQL are both pretty fast and scalable, but if you
have a large data set, you certainly do need to plan a schema
carefully, but it should be somewhat similar to your existing data
structures anyway.
the database APIs in Ruby are pretty simple.


#13

Robert D. wrote:

      user     system      total        real

yaml 0.590000 0.000000 0.590000 ( 0.754097)
json 0.020000 0.000000 0.020000 ( 0.018363)

Looks promising, n’est-ce pas?

Neat. Thanks for the data. For now though, Marshal
is an adequate alternative.

Regards,


#14

Okay, so I didn’t get to it this weekend, but it is an interesting
project, but can you explain a bit more on the data requirments? I have
some questions inlined.

On Sat, May 05, 2007 at 11:20:05AM +0900, Bil K. wrote:

sensitivity analysis[2] on some of the simulation codes used for our

  1. Prepare a “sufficiently large” number of cases, each with random
    variations of the input parameters per the tolerance DSL markup.
    Save all these input variables and all their samples for step 5.

So the DSL generates a ‘large’ number of cases listing the input
parameters and their associated values. The list of input parameters
static across all cases, or a set of cases?

That is, for a given “experiment” you have the same set of parameters,
just with a “large” numver of different values to put in those
parameters.

  1. Run all the cases.
  2. Collect all the samples of all the outputs of interest.

I’m also assuming that the output(s) for a given “experiment” would be
consistent in their parameters?

  1. Compute running history of the output statistics to see
    if they have have converged, i.e., the “sufficiently large”
    guess was correct – typically a wasteful number of around 3,000.
    If not, start at step 1 again with a bigger number of cases.

So right now you are doing say for a given experiment f:

inputs  : i1,i2,i3,...,in
outputs : o1,o2,o3,...,0m

run f(i1,i2,i3,…,in) -> [o1,…,on] where the values for i1,…,in are
“jiggled” And you have around 3,000 diferent sets of inputs.

5 and 6.
[…]

[5] The current system consists of 5 Ruby codes at ~40 lines each
plus some equally tiny library routines.

Would you be willing to share these? I’m not sure if what I’m assuming
about your problem is correct or not, but it intrigues me and I’d like
to fiddle with the general problem :-). I’d be happy to talk offlist
too.

enjoy,

-jeremy


#15

Jamey C. wrote:

Hey, Bil.

Hi.

KirbyBase should be faster than YAML and it still stores the data in
plain text files, if that is important to you.

Plain text will be too big – I’ve got an n^2 problem.

Mongoose is faster than KirbyBase, at the expense of the data not being
stored as plain text.

Sounds intriguing, but where can I find some docs? So far, I’m
coming up empty…

Regards,


#16

On Sat, 5 May 2007, Bil K. wrote:

I’ve created a small tolerance DSL, and coupled with the Monte Carlo
Method[1] and the Pearson Correlation Coefficient[2], I’m performing
sensitivity analysis[2] on some of the simulation codes used for our
Orion vehicle[3]. In other words, jiggle the inputs, and see how
sensitive the outputs are and which inputs are the most influential.

You’re building an Orion? Please tell me it’s not true!

– Matt
It’s not what I know that counts.
It’s what I can remember in time to use.


#17

On Mon, May 07, 2007 at 08:15:05PM +0900, Bil K. wrote:

Here, the 2nd realization of the output, ‘stag_pr’, 108.02 corresponds
to the 2nd case and is associated with the 2nd entries in the ‘F_x’
and ‘q_r’ arrays, 1.12 and 3.89e+8, respectively.

Yeah, that’s what explains it best for me.

I’m not sure if what I’m assuming
about your problem is correct or not, but it intrigues me and I’d like
to fiddle with the general problem :-).

The more I explain it, the more I learn about it; so thanks
for the interest.

I haven’t forgotten about this, I have a couple of ways to manage it
with SQLite but how you want to deal with the data after running all the
cases could influence the direction.

You said you want to test for convergence after cases are run?
Basically you want to do some calculations using the inputs and outputs
after each case ( or N cases ) and save those calculations off to the
side until they reach some error/limit etc? For this do you want to
just record “After case N my running calculations(f,g,h) over the cases
run so
far are x,y,z”

That is something like:

Case    Running calc f, Running calc g, Running calc h
   1        1.0         2.0             3.0
   10       2.0         4.0             9.0
   ....

Also, do you do any comparison between experiments? That is, for one
scenario that you have a few thousand cases for with your inputs that
are jiggled; would you do anything with those results in relation to
some other scenario? Or are all scenario/experiments isolated?

enjoy,

-jeremy


#18

Jeremy H. wrote:

So the DSL generates a ‘large’ number of cases listing the input
parameters and their associated values. The list of input parameters
static across all cases, or a set of cases?

The input parameter names are static across all cases, but
for each case, the parameter value will vary randomly according
to the tolerance DSL, e.g., 1.05+/-0.2. Currently, I have all
these as a hash of arrays, e.g.,

{ ‘F_x’ => [ 1.23, 1.12, 0.92, 1.01, … ],
‘q_r’ => [ 1.34e+9, 3.89e+8, 8.98e+8, 5.23e+9, … ], … }

where 1.23 is the sample for input parameter ‘F_x’ for the
first case, 1.12 is the sample for the second case, etc.,
and 1.34e+9 is the sample for input parameter ‘q_r’ for
the first case, and so forth.

That is, for a given “experiment” you have the same set of parameters,
just with a “large” numver of different values to put in those
parameters.

Yes, if I understand you correctly.

  1. Run all the cases.
  2. Collect all the samples of all the outputs of interest.

I’m also assuming that the output(s) for a given “experiment” would be
consistent in their parameters?

Yes, the output parameter hash has the same structure as the input
hash, although it typically has fewer parameters. The number
and sequence of values (realizations) for each output parameter,
however, corresponds exactly to the array of samples for each input
parameter. For example, the outputs hash may look like,

{ ‘heating’ => [ 75.23, 76.54, … ],
‘stag_pr’ => [ 102.13, 108.02, … ], … }

Here, the 2nd realization of the output, ‘stag_pr’, 108.02 corresponds
to the 2nd case and is associated with the 2nd entries in the ‘F_x’
and ‘q_r’ arrays, 1.12 and 3.89e+8, respectively.

So right now you are doing say for a given experiment f:

inputs  : i1,i2,i3,...,in 
outputs : o1,o2,o3,...,0m

run f(i1,i2,i3,…,in) -> [o1,…,on] where the values for i1,…,in are
“jiggled” And you have around 3,000 diferent sets of inputs.

Yes, where ‘inputs’ and ‘outputs’ are vectors m and k long,
respectively; so you have a matrix of values, e.g.,

input1 : i1_1, i1_2, i1_3, …, i1_n
input2 : i2_1, i2_2, i2_3, …, i2_n
. . . . . .
. [ m x n matrix ] .
. . . . . .
inputj : im_1, im_2, im_3, …, im_n

output1 : o1_1, o1_2, o1_3, …, o1_n
output1 : o2_1, o2_2, o2_3, …, o2_n
. . . . . .
. [ k x n matrix ] .
. . . . . .
output1 : ok_1, ok_2, ok_3, …, ok_n

Would you be willing to share these?

/I/ am willing, but unfortunately I’m also mired in red tape.

I’m not sure if what I’m assuming
about your problem is correct or not, but it intrigues me and I’d like
to fiddle with the general problem :-).

The more I explain it, the more I learn about it; so thanks
for the interest.

Regards,


#19

On 5/4/07, Bil K. removed_email_address@domain.invalid wrote:


Bil K.
http://fun3d.larc.nasa.gov

I dunno how safe it is to rely on this behavior, but calling
Hash#default=
strips the proc and makes the hash marshallable:

h = Hash.new { |h,k| h[k] = [] }
=> {}

h[“a”]
=> []

h
=> {“a”=>[]}

h[“b”] = 7
=> 7

h
=> {“a”=>[], “b”=>7}

h.default = nil
=> nil

h
=> {“a”=>[], “b”=>7}

Marshal.dump h
=> “\004\b{\a”\006a[\000"\006bi\f"

of course this means you won’t get your default proc back when you load
the
hash and mutate it. However:
db = Marshal.load( from_disk )
delta = Hash.new { |h, k| h[k] = if db.has_key? k then db[k] else [] end
}
delta[“foo”] = “add a new value”
delta[“bar”] << “update something in the original”
… more of the same …
to_disk = Marshal.dump( db.merge!( delta ) )


#20

On May 3, 10:11 am, Bil K. removed_email_address@domain.invalid wrote:

Of the answers I’ve seen so far (thanks everyone!),
Hash.new{ |hash,key| hash[key]=[] }

Later,

Bil K.http://fun3d.larc.nasa.gov

Would making a copy of the hash use too much
memory or time?

h=Hash.new{ |hash,key| hash[key]=[] }

h[‘foo’] << 44
h[‘foo’] << 88

h_copy = {}
h.each{|k,v| h_copy[k] = v}
p h_copy