Beyond YAML? (scaling)

Bil K. wrote:

Mongoose is faster than KirbyBase, at the expense of the data not
being stored as plain text.

Sounds intriguing, but where can I find some docs? So far, I’m
coming up empty…

Docs are light compared to KirbyBase. If you download the
distribution, there is the README file, some pretty good examples in the
aptly named “examples” directory, and unit tests in the “tests”
directory.

HTH,

Jamey

Brian C. wrote:

Use a SQL database?

I always suspect that I should be doing that more often,
but as my experience with databases is rather limited
and infrequent, I always shy away from those as James
already knows. Regardless, I should probably overcome
my aggressive incompetence one day!

It all depends what sort of processing you’re doing. If you’re adding to a
dataset (rather than starting with an entirely fresh data set each time),
having a database makes sense.

In this point, I’m generating an entirely fresh data set
each time, but I can foresee a point where that will change
to an incremental model…

Put it another way, does your processing really require you to read the
entire collection of objects into RAM before you can perform any processing?

Yes, AFAIK, but I suppose there are algorithms that could
compute statistical correlations incrementally.

You could consider using something like Madeleine:
http://madeleine.rubyforge.org/
This snapshots your object tree to disk (using Marshal by default I think,
but can also use YAML). You can then make incremental changes and
occasionally rewrite the snapshot.

Probably not a good fit as I won’t change existing data,
only add new…

Thanks,

Jamey C. wrote:

Docs are light compared to KirbyBase. If you download the
distribution, there is the README file, some pretty good examples in the
aptly named “examples” directory, and unit tests in the “tests” directory.

Roger, I was afraid you’d say that. :slight_smile:

Please throw those up on your Rubyforge webpage at some point?

Later,

On 5/7/07, Bil K. [email protected] wrote:
Bill, maybe you want to have a look at JSON
http://json.rubyforge.org/

I do not have time right now to benchmark the reading, but the writing
gives some spectacular results, look at this:

517/17 > cat test-out.rb && ruby test-out.rb

vim: sts=2 sw=2 expandtab nu tw=0:

require ‘yaml’
require ‘rubygems’
require ‘json’
require ‘benchmark’

@hash = Hash[*(1…100).map{|l| “k_%03d” % l}.zip([*1…100]).flatten]

Benchmark.bmbm do
|bench|
bench.report( “yaml” ) { 50.times{ @hash.to_yaml } }
bench.report( “json” ) { 50.times{ @hash.to_json } }
end

Rehearsal ----------------------------------------
yaml 0.630000 0.030000 0.660000 ( 0.748123)
json 0.020000 0.000000 0.020000 ( 0.079732)
------------------------------- total: 0.680000sec

       user     system      total        real

yaml 0.590000 0.000000 0.590000 ( 0.754097)
json 0.020000 0.000000 0.020000 ( 0.018363)

Looks promising, n’est-ce pas?

Maybe you want to investigate that a little bit more, JSON is of
course very readable, look e.g at this:

irb(main):002:0> require ‘rubygems’
=> true
irb(main):003:0> require ‘json’
=> true
irb(main):004:0> {:a => [*42…84]}.to_json
=>
“{"a":[42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]}”

HTH
Robert

Hi,

Jeremy H. wrote:

If you want to describe your data needs a bit, and what operations you
need to operate on it, I’ll be happy to play around with an ruby/sqlite3
program and see what pops out.

I’ve created a small tolerance DSL, and coupled with the Monte Carlo
Method[1] and the Pearson Correlation Coefficient[2], I’m performing
sensitivity analysis[2] on some of the simulation codes used for our
Orion vehicle[3]. In other words, jiggle the inputs, and see how
sensitive the outputs are and which inputs are the most influential.

The current system[5] works, and after the YAML->Marshal migration,
it scales well enough for now. The trouble is the entire architecture
is wrong if I want to monitor the Monte Carlos statistics to see
if I can stop sampling, i.e., the statistics are converged.

The current system consists of the following steps:

  1. Prepare a “sufficiently large” number of cases, each with random
    variations of the input parameters per the tolerance DSL markup.
    Save all these input variables and all their samples for step 5.
  2. Run all the cases.
  3. Collect all the samples of all the outputs of interest.
  4. Compute running history of the output statistics to see
    if they have have converged, i.e., the “sufficiently large”
    guess was correct – typically a wasteful number of around 3,000.
    If not, start at step 1 again with a bigger number of cases.
  5. Compute normalized Pearson correlation coefficients for the
    outputs and see which inputs they are most sensitive to by
    using the data collected in steps 1 and 3.
  6. Lobby for experiments to nail down these “tall pole” uncertainties.

This system is plagued by the question of “sufficiently large”?
The next generation system would do steps 1 through 3 in small
batches, and at the end of each batch, check for the statistical
convergence of step 4. If convergence has been reached, shutdown
the Monte Carlo process, declare victory, and proceed with steps
5 and 6.

I’m thinking this more incremental approach, and my lack of database
experience would make a perfect match for Mongoose[6]…

Since there’s no Ruby Q. this weekend, we all need something to work
on :-).

:slight_smile:

Regards,

Bil K.

[1] Monte Carlo method - Wikipedia
[2] Pearson correlation coefficient - Wikipedia
[3] Sensitivity analysis - Wikipedia
[4] Crew Exploration Vehicle - Wikipedia
[5] The current system consists of 5 Ruby codes at ~40 lines each
plus some equally tiny library routines.
[6] http://mongoose.rubyforge.org/

Bil K. wrote:

Hash.new{ |hash,key| hash[key]=[] }

Is there a better way than,

samples[tag] = [] unless samples.has_key? tag
samples[tag] << sample

?

Anyway, apart from Marshal not having a convenient
#load_file method like Yaml, the conversion was
very painless and dropped file sizes considerably
and run times into the minutes category instead of
hours.

Thanks,

[email protected] wrote:

I guess that depends on whether you need the files to be easily readable
or not. If you don’t, Marshal will be faster than YAML.

At this point, I’m looking for an easy out that will
reduce size and increase speed, and I’m willing to
go binary if necessary.

Of the answers I’ve seen so far (thanks everyone!),
migrating to Marshal seems to be the Simplest Thing
That Could Possibly Work.

Thanks,