Q: what database would you suggest?

Dobai-Pataky_BSSSSl · December 5, 2010, 7:16pm

Hi,
I am doing some scientific work in Ruby. So far the data is stored
into a massive file (just Marshalled or YAMLed) and then once the run
is over it is loaded back into memory for some data analysis.

This (not very elegant method) has worked so far but as I will need to
raise the scale to several GBs the time lost in loading and slicing
the data in memory is getting painful. Hence I thought I could take
advantage of the many DB that exist for Ruby. Alas… they are so
many, and I’ve got no database experience, so I am not sure what would
be best.

The data can be thought of as hierarchical tables: i.e. tables in the
SQL sense with some entries corresponding to entire new tables, and
the others either strings or numbers (integers and float). I have only
access to one computer, though with multiple processors, so I don’t
have any need for distributing the data. Finally I’d like speed, but
given the database size I can’t keep all the info in memory.

The kind of operations I need to do involve getting slices of tables
(all the rows but a subset of columns), often in all the “subtables”
I’ve got. I also need to add a few rows to each table when the
simulation is running (but not for analysis) however the rate is
rather low so I don’t expect this to be much of a constraint.

Could people with more DB experience give a few suggestions? I’ve read
of MongoDB, CouchDB, MySQL and PostgreSQL… and to be honest I am
pretty lost, so I am sending this in the hope some fellow Rubyist can
help a poor DB newbie.

Thank you to all in advance.
Diego V.

diego_v · December 5, 2010, 7:38pm

On Sun, Dec 5, 2010 at 7:15 PM, Diego V.
[email protected] wrote:

I’ve got. I also need to add a few rows to each table when the
simulation is running (but not for analysis) however the rate is
rather low so I don’t expect this to be much of a constraint.

Could people with more DB experience give a few suggestions? I’ve read
of MongoDB, CouchDB, MySQL and PostgreSQL… and to be honest I am
pretty lost, so I am sending this in the hope some fellow Rubyist can
help a poor DB newbie.

SQLite[0] might be worth a look, especially if you can split your data
across several instances of SQLite.

The key difference of SQLite to things like MySQL, PostgreSQL, and
others is that SQLite is not a server, in that it doesn’t offer remote
access, which you say you don’t need, so that is not an issue.

Additionally, you have no overhead of administering (or at least
setting up) a database server and its users, and SQLite is lighter on
your resources.

If you have data that doesn’t lend itself well to storing them in the
rather rigid confines of SQL, take a look at the NoSQL DBs like
CouchDB or MongoDB. All of those (and the SQL engines) have good to
great Ruby bindings, so you can use them with relative ease.

[0] http://sqlite.org/

–
Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

diego_v · December 5, 2010, 7:52pm

SQLite sounds like a good start as Phillip G. says. It’s widely
used, simple and straightforward. I doubt you’ll have a problem with
size - see this:
http://www.sqlite.org/limits.html

diego_v · December 5, 2010, 9:11pm

On 05.12.2010 19:52, Mike S. wrote:

SQLite sounds like a good start as Phillip G. says. It’s widely
used, simple and straightforward. I doubt you’ll have a problem with
size - see this:
Implementation Limits For SQLite

I don’t have benchmarks handy but for really large volumes I’d turn to
PostgreSQL. Especially if the ability is needed to manually manipulate
the data and do backup, parallel accesses and the like.

Kind regards

robert

diego_v · December 5, 2010, 11:34pm

On Sun, Dec 5, 2010 at 10:30 PM, Diego V.
[email protected] wrote:

SQLite[0] might be worth a look, especially if you can split your data
across several instances of SQLite.

Thanks, I’ll start with this then.

Any suggestion on the ruby wrapper? It seems that DataMapper is pretty
good and “hides” some of the complexities of SQL. But I’ve heard of
Sequel as well, and some closer to the SQL code such as ruby-sqlite.

Which one would you suggest to start?

Without knowing your situation: I’d use sqlite3-ruby, or DBI (still
allows you to access different DB engines without performing too much
magic behind the scenes). From what you wrote, it doesn’t look like an
ORM (like DataMapper, or ActiveRecord) would be a good fit. And an ORM
might make things you want to do (like analysis of your data) more
difficult that it is necessary.

If you are really unsure: Grab what you find interesting, and do a
little prototyping with a reduced, somewhat simplified dataset that
you’ll be working with. Once you have something that you are happy
with, stick to that, instead of having to discard half your work
because you start hating the way your chosen tool behaves.

–
Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

diego_v · December 5, 2010, 10:36pm

SQLite[0] might be worth a look, especially if you can split your data
across several instances of SQLite.

Thanks, I’ll start with this then.

Any suggestion on the ruby wrapper? It seems that DataMapper is pretty
good and “hides” some of the complexities of SQL. But I’ve heard of
Sequel as well, and some closer to the SQL code such as ruby-sqlite.

Which one would you suggest to start?

Thank you again
Diego

diego_v · December 6, 2010, 12:49am

On Sunday, December 05, 2010 03:30:18 pm Diego V. wrote:

SQLite[0] might be worth a look, especially if you can split your data
across several instances of SQLite.

Thanks, I’ll start with this then.

Any suggestion on the ruby wrapper? It seems that DataMapper is pretty
good and “hides” some of the complexities of SQL. But I’ve heard of
Sequel as well, and some closer to the SQL code such as ruby-sqlite.

Which one would you suggest to start?

Whichever you like – the important thing is to start.

Personally, I’d suggest DataMapper, mostly because it’s likely to make
your
application least bound to SQL at all (let alone SQLite) if you decide
to
change in the future.

One thing to be aware of with SQLite is that it essentially locks the
entire
database for any operations. This may affect you if your code is
sufficiently
multithreaded (and you use a Ruby which supports this, like JRuby) or if
you
intend to launch multiple worker processes. In either case, you may want
to
swap SQLite out for something else eventually.

But especially if you’re building on something like DM, you can easily
swap
SQLite out for something else in the future.

diego_v · December 5, 2010, 11:58pm

On Sun, Dec 5, 2010 at 4:30 PM, Diego V.
[email protected] wrote:

SQLite[0] might be worth a look, especially if you can split your data
across several instances of SQLite.

Thanks, I’ll start with this then.

Any suggestion on the ruby wrapper? It seems that DataMapper is pretty
good and “hides” some of the complexities of SQL. But I’ve heard of
Sequel as well, and some closer to the SQL code such as ruby-sqlite.

Which one would you suggest to start?

IMHO you should definitely have a look at the latest Rails 3 version
of active record which uses arel GitHub - rails/arel: A Relational Algebra

Active record can be used separately from Rails, it always could but
Rails 3 makes it easier to pick particular parts of Rails then ever.

–
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Github: rubyredrick (Rick DeNatale) · GitHub
Twitter: @RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

diego_v · December 6, 2010, 1:46am

On Dec 5, 2010, at 5:48 PM, David M. wrote:

One thing to be aware of with SQLite is that it essentially locks the entire
database for any operations.

That’s not very accurate. SQLite has a very sophisticated system of
staged locks and it’s definitely possible to have multiple clients
reading data at once.

James Edward G. II

diego_v · December 6, 2010, 1:32am

On 12/5/2010 1:15 PM, Diego V. wrote:

Could people with more DB experience give a few suggestions? I’ve read
of MongoDB, CouchDB, MySQL and PostgreSQL… and to be honest I am
pretty lost, so I am sending this in the hope some fellow Rubyist can
help a poor DB newbie.
my recommendation would be postgresql. If you don’t want to deal with
postgresql initially, then sqlite3.
Migrating from sqlite3 to postgresql should be pretty straightforward if
you desire/need to do so later. If
you go with an ORM, make sure that it supports the ability for YOU to
write sql queries if you want to –
ORM’s may not build the most appropriate SQL query for YOUR database
schema, this may be the difference
between an acceptable/unacceptable query response time.

diego_v · December 6, 2010, 8:30pm

On Mon, Dec 6, 2010 at 1:48 AM, David M. [email protected]
wrote:

One thing to be aware of with SQLite is that it essentially locks the entire
database for any operations. This may affect you if your code is sufficiently
multithreaded (and you use a Ruby which supports this, like JRuby) or if you
intend to launch multiple worker processes. In either case, you may want to
swap SQLite out for something else eventually.

I learned that the hard way. I innocently tried to fork a process that
used sqlite and the whole thing fell apart in not so obvious ways.
Many hours were wasted recovering from that one. So yes, safe sqlite
equals single process.

Regards,
Ammar

diego_v · December 6, 2010, 8:49pm

On Dec 6, 2010, at 12:48 AM, David M. wrote:

least bound to SQL at all (let alone SQLite) if you decide to
change in the future.

“Prepare Now For Possible Future Head Transplant”

One thing to be aware of with SQLite is that it essentially locks the entire
database for any operations.

Hmmm?!? Not quiet.

http://www.sqlite.org/lockingv3.html
http://www.sqlite.org/wal.html

diego_v · December 7, 2010, 12:04am

Berkley DB, for sure.

diego_v · December 7, 2010, 12:24am

On Dec 7, 2010, at 12:04 AM, Eugeni A. wrote:

Berkley DB, for sure.

Interestingly enough, the latest, greatest Berkeley DB seems to offer
“SQLite-compatible SQL API”:

http://www.oracle.com/us/corporate/press/063695

diego_v · December 6, 2010, 11:53pm

On Monday, December 06, 2010 01:48:18 pm Petite A. wrote:

On Dec 6, 2010, at 12:48 AM, David M. wrote:

least bound to SQL at all (let alone SQLite) if you decide to
change in the future.

“Prepare Now For Possible Future Head Transplant”
http://database-programmer.blogspot.com/2010/11/prepare-now-for-possible-fu
ture-head.html

I don’t have time to do a point-by-point rebuttal, so for now:

“So what about that weird title involving head transplants? Obviously a
head
transplant is impossible, making it also very unlikely, besides being
silly
and non-sensical. It came to mind as a kind of aggregrate of all of the
bizarre and unrealistic ideas about abstracting data designs that I have
heard
over the years.”

Huh?

Having something that’s at least abstract enough to deal with multiple
databases is something I have actually used. That is, as a matter of
convenience, I’ve actually developed applications using SQLite and
deployed
them to MySQL.

And it’s not always just SQL.

I’ve also migrated models from one application to another – that is, I
took a
model which used to be internal to both applications A and B, along with
code
that made assumptions about that model, and made it live in application
A,
with application B accessing it via REST.

So, if you’re keeping track, that’s the same model, used for both REST
and
SQL. Far from impossible, silly, or unlikely, the fact that we were
using
abstractions which allowed for this possibility increased the likelihood
that
we’d actually do it.

It’s also possible the article misses one of the main points of an ORM.
Even
if I were writing a SQL-only application, I’d have to be insane to use
raw SQL
to deal with it when I can have 90% of the work done for me by the ORM
– that
is, 90% of what any application does is trivial CRUD, and there’s no
reason I
should have to rewrite that every time.

One thing to be aware of with SQLite is that it essentially locks the
entire database for any operations.

Hmmm?!? Not quiet.

File Locking And Concurrency In SQLite Version 3
Write-Ahead Logging

Yeah, reading the 3.0 summary confirms all writes are an exclusive lock
over
the entire database. My bad, that is different than locking for all
writes,
but it’s still something you want to be aware of during the design
phase,
especially if someone follows your advice and doesn’t abstract
sufficiently
that they can easily migrate away later.

diego_v · December 7, 2010, 11:07am

On Dec 7, 7:39pm, Eugeni A. [email protected] wrote:

oh, i didn’t knew that they implemented SQL API.

–
Posted viahttp://www.ruby-forum.com/.

For my two cents I am fairly database illiterate the only thing I have
read a decent amount about is normalisation of data.

I am using sequel and Mysql, I chose Mysql because it was popular wide
range of documnetation and community support but mostly it has the
MySQL workbench which will allow me to see graphically what i am doing
which being a beginner and only having used Access 2000 & 2007 prior
was a plus.You can turn networking off in MySQL setup to setup a
single developer PC easily.

There is also a very helpful sequel group if your having any issues
http://groups.google.com/group/sequel-talk and again enough docs for a
beginner here http://sequel.rubyforge.org/

I am a database no nothing and sequel and mysql work for me. And
because its fairly easy I am learning more…

diego_v · December 8, 2010, 6:06am

Diego V. [email protected] writes:

be best.
I’ve got. I also need to add a few rows to each table when the
simulation is running (but not for analysis) however the rate is
rather low so I don’t expect this to be much of a constraint.

Could people with more DB experience give a few suggestions? I’ve read
of MongoDB, CouchDB, MySQL and PostgreSQL… and to be honest I am
pretty lost, so I am sending this in the hope some fellow Rubyist can
help a poor DB newbie.

Thank you to all in advance.
Diego V.

From what you’ve described, I recommend you spend some time looking into
MongoDB. I’ve been using relational databases for decades, and I’ve
recently developed a couple applications using MongoDB. It’s fast,
handles hierarchical data well, has a powerful query language, good
indexing, etc.

One of my projects has over 40 million documents, and MongoDB is
handling it much better than postgres or mysql.

It does seem to take extra disk space than postgres (my typical rdbms),
but I accept that penalty for the increase in speed and flexibility in
programming.

diego_v · December 7, 2010, 9:34am

oh, i didn’t knew that they implemented SQL API.