Should I use a database or a flat file?

jimdeuc · April 1, 2008, 5:11pm

I need to store some information with my ruby program and I am not sure
on what would be the best method. I’m mostly concerned about what would
be the most efficient use of cpu resources.

Basically, I will have a list of names each belonging to one of 5
categories. Sort of like this:

Cat1
-name1
-name2
-name3
-etc…

Cat2
-name4
-name5
-name6
-etc…

Cat3
-name7
-name8
-name9
-etc…

There will be hundreds of names, evenly divided between the categories.
But each name will go in only one category, there is no relation between
categories or anything like that. All the information will be
completely rewritten once a day and then read several times throughout
the day.

My choices for storage are an sqlite database (using ActiveRecord), a
flat text file of my own design, a YAML file, or an XML file.

jimdeuc · April 1, 2008, 5:31pm

But ultimately it depends on what you want to do with the data.

yeah, it’s kinda hard to describe without just posting my entire script,
which I doubt people will want to read.

The data will be accessed by one ruby script, running on one computer.
The data will be read in, then the file closed and done for a couple
hours. So no concurrent access, no relations, no keeping the connection
open for extended periods of time, which is why I thought a database
would probably be overkill and just add overhead.

But I didn’t know if maybe reading a file into memory would take more
effort than reading entries from a database. Also, I was a little off
on the numbers, I meant to say that there are hundreds of names per
category, so total names could be over a thousand. That size will
likely never ever change beyond +/- 100 at the most.

Thanks for the info. I’m really a newb at this, so any thoughts on
storing data using any of these methods is helpful.

James.

jimdeuc · April 1, 2008, 5:18pm

2008/4/1, James D. [email protected]:

-name3
-name8
-name9
-etc…

There will be hundreds of names, evenly divided between the categories.

That’s not much. I’d probably use XML - but that also depends on what
generates the data and what needs to be able to read it. You can
efficiently generate it and read it (using a stream parser for
example, but that seems unnecessary for hundreds of names only).

But ultimately it depends on what you want to do with the data. In
some cases a DB might be a better choice. Also, if your volume is
going to increase dramatically etc.

But each name will go in only one category, there is no relation between
categories or anything like that. All the information will be
completely rewritten once a day and then read several times throughout
the day.

My choices for storage are an sqlite database (using ActiveRecord), a
flat text file of my own design, a YAML file, or an XML file.

YAML is another nice alternative because it is human readable. And
you can use Marshal if producer and consumer of the data are Ruby
programs.

Kind regards

robert

jimdeuc · April 1, 2008, 5:39pm

Seems like the type of problem yaml thats perfect for yaml

jimdeuc · April 1, 2008, 5:36pm

James D. wrote:

-name3
-name8
flat text file of my own design, a YAML file, or an XML file.
IMHO Databases are best when you have concurrent access to data being
modified regularly and want to enforce constraints during concurrent
write accesses.

In your case, the data is mostly static and constraints are easily
handled outside the storage layer (you overwrite all data with another
consistent version in one pass). I’d advise to use the simplest storage
method, which probably is a YAML dump of an object holding all this
data.

Marshall.dump/load is an option too. It may be faster than YAML if this
matters to you (I’ve not benchmarked it, so you better do it if you need
fast read/write). It’s not human-readable, so it can be a drawback when
debugging.

That was the code/integration complexity side of your problem.

For the performance side of the problem :

If you dump your data in a temporary file and then rename it to
overwrite the final destination, you can use a neat hack for long
running processes needing fresh data: you can design a little cache that
checks the mtime of the backing store (the final destination) on read
accesses and reload it when it changes.
mtime checks are cheap and simple to code and if the need arise for
really high throughput you can minimize them by coding a TTL logic.

Lionel

jimdeuc · April 1, 2008, 5:53pm

2008/4/1, James D. [email protected]:

But ultimately it depends on what you want to do with the data.

yeah, it’s kinda hard to describe without just posting my entire script,
which I doubt people will want to read.

I found that plain English works best for anything that is longer than
a few lines.

The data will be accessed by one ruby script, running on one computer.
The data will be read in, then the file closed and done for a couple
hours. So no concurrent access, no relations, no keeping the connection
open for extended periods of time, which is why I thought a database
would probably be overkill and just add overhead.

Yep.

But I didn’t know if maybe reading a file into memory would take more
effort than reading entries from a database. Also, I was a little off
on the numbers, I meant to say that there are hundreds of names per
category,

You did say that.

so total names could be over a thousand. That size will
likely never ever change beyond +/- 100 at the most.

1000 is a really meek number. I did a quick test and also for
illustration (attached).

17:51:23 /c/Temp
$ ./yam.rb
0.010 create
0.261 write
0.025 load
17:52:20 /c/Temp

Times in seconds

Thanks for the info. I’m really a newb at this, so any thoughts on
storing data using any of these methods is helpful.

You’re welcome.

Kind regards

robert

jimdeuc · April 1, 2008, 5:59pm

Just a quick change of the script to make the volume more realistic.

robert

jimdeuc · April 1, 2008, 6:37pm

On Tue, Apr 1, 2008 at 10:32 AM, James D. [email protected] wrote:

would probably be overkill and just add overhead.
James.
I’m going to slightly disagree with Lionel – and also Robert – on
this one. First of all, a database is not necessarily just for
concurrency. It’s for data integrity and allows the ability to build
reports on that data that you can trust because of the strict nature
of the underlying data store (I’m talking about RDBMS, but I’ve kept
my eyes open about OO databases as well; stay away from Pick,
though!!).

Here’s the problem with relational databases, though (RDBMSs): it’s
hard to model a hierarchy (which you can pull off somewhat clumsily
with XML).

If you are not going to do serious queries and inserts on the db, and
your data isn’t complex, then a flat file approach might work. It
works, after all, for software builds. I strongly recommend against
it in higher languages, though, even for small apps. And, no, I am
not a database vendor.

I always tell people they should learn SQL, but nowadays I’m getting a
cold shoulder, especially with OO people

The other important thing that I’ve noticed about data and storage is:
what do you want to do with it and how often? Store it, query it (and
how), add to it, move it around, archive it, etc. These are important
factors to consider.

Todd

jimdeuc · April 1, 2008, 6:52pm

Oh wait, Lionel already suggested that.

jimdeuc · April 1, 2008, 6:44pm

Don’t forget: you could put the data into a hash, and marshall it to
disc. Not a DB, but better than a flat file!

jimdeuc · April 1, 2008, 6:55pm

Todd B. wrote:

I’m going to slightly disagree with Lionel – and also Robert – on
this one. First of all, a database is not necessarily just for
concurrency. It’s for data integrity

Yes I agree (as explained below concurrency is what I consider the main
problem to solve to enforce data integrity). That said if you write your
data in one pass as the OP, you don’t need data integrity in the storage
layer… rename is atomic : you either renamed the temp file to its
final position before a crash or not.

The problem are partial updates where you need to maintain consistancy.
And on the top of my head the only problems with partial updates are :

concurrent accesses (most common, counting both concurrent read and
write accesses),
crashes (fortunately less common and can even be adressed by backups
in many cases).

These are why I disagree with people wanting to push all the consistency
logic into the applicaltion layer on database-backed applications with
concurrent access (like often advocated for Rails). It’s simply not
doable without recoding the whole concurrent access manager and
log-based/MVCC/… crash resistance of the database in the application
layer (good luck with that).

Lionel.

jimdeuc · April 1, 2008, 7:21pm

On Tue, Apr 1, 2008 at 12:13 PM, Todd B. [email protected]
wrote:

-name3
-name2
see the potential pratfalls.

There certainly is a time and place for this, but I’ve found it’s
usefulness generally not that beneficial.

Todd

Sorry Lionel; missed the OP’s “But each name will go in only one
category”. I do still think it wouldn’t be that bad to use a DB.

Todd

jimdeuc · April 1, 2008, 7:14pm

On Tue, Apr 1, 2008 at 11:55 AM, Lionel B.
[email protected] wrote:

final position before a crash or not.
concurrent access (like often advocated for Rails). It’s simply not
doable without recoding the whole concurrent access manager and
log-based/MVCC/… crash resistance of the database in the application
layer (good luck with that).

Lionel.

Maybe we are talking about different things. By data integrity, I
mean you can be certain not just that the data was entered correctly,
but also that it coincides with the relationships present. In a
modified version of the OP’s model, for example…

Cat1
-name1
-name2
-name3
-etc…

Cat2
-name4
-name5
-name6
-etc…

Cat3
-name1
-name2
-name3
etc…

Note the same category names, but in different categories.

Now, surely, you can say, “Well, the application logic will take care
of that ambiguity.” But I say we should continue to separate
application logic from data logic.

I’m no CS guy, so I don’t know the correct terms for this, but I do
see the potential pratfalls.

There certainly is a time and place for this, but I’ve found it’s
usefulness generally not that beneficial.

Todd

jimdeuc · April 1, 2008, 8:19pm

I was thinking that maybe the OP could use something like KirbyBase.
I’ve used it before, and it allows the code to stay very portable
because
kirbybase is just ruby code.

You can locate it here:
http://rubyforge.org/projects/kirbybase

/Shawn

On Tue, Apr 1, 2008 at 11:11 AM, Joel VanderWerf
[email protected]

jimdeuc · April 1, 2008, 8:14pm

James D. wrote:

I need to store some information with my ruby program and I am not sure
on what would be the best method. I’m mostly concerned about what would
be the most efficient use of cpu resources.

One option is FSDB[1] (file-system database), with one file per
“category”, and each file stored as YAML. This scales as well as your
file system scales, is always human-readable, and should be fairly
efficient. (It’s thread and process safe too, not that it matters for
your app.)

For example:

require ‘fsdb’
require ‘yaml’

db = FSDB::Database.new “~/tmp/my_data”
db.formats = [FSDB::YAML_FORMAT] + db.formats

3.times do |i|
db[“Cat#{i}.yml”] = %w{
name1
name2
name3
}
end

path = “Cat1.yml”

puts “Here’s the object:”
puts “==================”
p db[path]
puts “==================”
puts

puts “Here’s the file:”
puts “==================”
puts File.read(File.join(db.dir, path))
puts “==================”
puts

and this is the output:

Here’s the object:

[“name1”, “name2”, “name3”]

Here’s the file:

name1
name2
name3
==================

The dir structure looks like this:

[~/tmp] ls my_data
Cat0.yml Cat1.yml Cat2.yml

[1] FSDB

jimdeuc · April 1, 2008, 10:02pm

Wow, this has been a very good discussion. Feel free to keep
discussing, but, being as I’m the OP, I just thought I would let you
know that I think I will go with YAML for this case.

jimdeuc · April 1, 2008, 9:35pm

2008/4/1, Todd B. [email protected]:

data in one pass as the OP, you don’t need data integrity in the storage
layer… rename is atomic : you either renamed the temp file to its
final position before a crash or not.

Exactly. With regard to all that we’ve learned about the issue at
hand a DB seems overkill here. KISS

doable without recoding the whole concurrent access manager and
log-based/MVCC/… crash resistance of the database in the application
layer (good luck with that).

Totally agree - but this is another story.

Maybe we are talking about different things. By data integrity, I
mean you can be certain not just that the data was entered correctly,
but also that it coincides with the relationships present. In a
modified version of the OP’s model, for example…

Now, surely, you can say, “Well, the application logic will take care
of that ambiguity.” But I say we should continue to separate
application logic from data logic.

But the consistency needs to be /somewhere/ and if no database is
needed then enforcing it in app logic is certainly ok.

I’m no CS guy, so I don’t know the correct terms for this, but I do
see the potential pratfalls.

There certainly is a time and place for this, but I’ve found it’s
usefulness generally not that beneficial.

What is “this” in this paragraph?

Generally I do not think we’re far away - if at all. Given the scale
of the problem and the apparent lack of future extension with regard
to size, complexity and concurrency a simple solution suffices IMHO.
Of course it’s good to know the options - that’s why we discuss here.

Kind regards

robert

A modified version of the script since the other posting did not seem
to make it into usenet. This one has consistency check as originally
required:

#!/bin/env ruby

require ‘set’
require ‘yaml’

class CatNames
def self.load(file_name)
File.open(file_name) {|io| YAML.load(io)}
end

def save(file_name)
File.open(file_name, “w”) {|io| YAML.dump(self, io)}
end

def initialize
@cat = {}
@all = {}
end

def add(cat, name)
raise “Consistency Error” if @all[name]
s = (@cat[cat] ||= Set.new)
s << name
@all[name] = s
end

def remove(cat, name)
c = @cat[cat] and c.delete name
@all.delete name
end

def clear
@cat.clear
@all.clear
end

def size
@cat.inject(0) {|sum,(name,set)| sum + set.size}
end
end

t = Time.now

d = CatNames.new

1000.times do |i|
d.add(“cat#{i % 10}”, “name#{i}”)
end

puts d.size

tt = Time.now
printf “%6.3f %s\n”, tt-t, “create”
t = tt

d.save “test.yaml”

tt = Time.now
printf “%6.3f %s\n”, tt-t, “write”
t = tt

d2 = CatNames.load “test.yaml”

tt = Time.now
printf “%6.3f %s\n”, tt-t, “load”
t = tt

begin
d2.add “foo”, “name0”
rescue Exception => e
puts e
end

jimdeuc · April 1, 2008, 11:59pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Why use a sledge hammer, when you can use the surgeon’s knife SQLite?

That’s the RDBM I’d use, if I would be using a SQL DB in this situation.

It doesn’t always have to be Postgre or Oracle.

– Phillip G.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkfysB4ACgkQbtAgaoJTgL+ZAgCfcoatmISXF6htOk2AeiaQ71EN
ZkYAnAtfV7gsp1kYgNUFhMdjjd4ZQ4p9
=e857
-----END PGP SIGNATURE-----

jimdeuc · April 1, 2008, 11:47pm

On Tue, Apr 1, 2008 at 2:35 PM, Robert K.
[email protected] wrote:

Exactly. With regard to all that we’ve learned about the issue at
hand a DB seems overkill here. KISS

I admit, I tend to like using a sledgehammer to turn a machine screw,
but in that respect, I’m usually thinking of scalability and data
integrity.

When I said “there’s a time and place for this”, “this” was referring
to the various forms of flat file storage.

With this particular situation, I would probably go with YAML, and
migrate to a database if need be (which shouldn’t be that hard,
depending on how deeply nested the data is).

Todd

jimdeuc · April 2, 2008, 12:30am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd B. wrote:

|
| Well, “sledgehammer” was for humor.

Duly recognized, just ignored to stretch the metaphor to its breaking
point.

I didn’t mean to imply that the analogy was devoid of humor.

| A better analogy of my approach would be this darn overly large
| swiss army knife that doesn’t always fit comfortably in my pocket,
| but I wear it any way just in case.

Well, a Leatherman would be my cultural weapon of choice.</discworld
reference>

| My only problem with SQLite is lack of foreign key constraints.

Which, last I heard, is in the works.

However, the zero-config approach suits very well for rapid development,
and, at least, prototyping.

And with ORM tools like Sequel, or Og, details like the the specific DB
become less of a concern, too.

– Phillip G.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkfytyoACgkQbtAgaoJTgL/sewCeN2GiZda9A0nYeyOmiq7qwrIG
qY4An03u5kMsJjz/kwroKuLL+GzszWl7
=vVBj
-----END PGP SIGNATURE-----