When to use a db?

moyapilot · April 21, 2010, 6:10pm

Dumb question but I am reading in a constant stream of lines via
fastercsv and pushing it into an array. Each line is either a new
object or will update some information in an existing object. 98% of
all lines will be an update to an object. I will also be doing a lot of
lookups on the objects to see if X exists somewhere. I don’t have too
much experience with this stuff but I have read and been told by a
couple people that if I am just iterating repeatedly, updating and doing
a bunch of lookups on objects then perhaps a db would be more
efficient/faster.

I know there isn’t a sure fire threshold that states when you are
dealing with X amount of data you should use a database but I always
assumed that if I am using an array it would be faster since it is in
memory. I am not opposed to using something minimal like SQLite I just
wasn’t sure if things would be easier/faster/more efficient to use a db.
Object Persistence is a plus but not a necessity for me, if my script
dies and the array is lost it isn’t a huge deal.

Will using a SQLite DB be faster or easier or more efficient for me than
just doing all my iterations thru an array?

Thanks

moyapilot · April 22, 2010, 3:43pm

Moya Pilot wrote:

Will using a SQLite DB be faster or easier or more efficient for me than
just doing all my iterations thru an array?

Unless your data structure won’t fit in RAM, then you’ll be better off
using a Hash (so you can locate the object you’re interested straight
away, rather than iterating through an Array)

moyapilot · April 23, 2010, 8:24pm

I recently had a project here at work where I had to take close to 120k
products and format them from a csv into a set of necessary attribute
changes and then wrote them to a .tab file. I had experimented with a
few
different methodologies for manipulating the data and iterating over
with it
and by far the fastest way was just to create a db, pump the data into
it
via the csv, and read from the model and write to a csv.

I was working with segments initially that were only about 1500 records
in
size and they performed faster in ram than through a model. It seems
once I
broke the 10k mark of records, performance increased about 50% by
switching
to the model base.

I think codewise, looking up whether “X exists” would probably be easier
using the model than trying to find it in the csv as well as faster with
the
db indexes but it all depends on how many records you have.

If you want to reply with how many records/attributes you have or with
just
maybe the size on the filesystem that the csv is, I can give a better
appraisal but it really is up to you. If you’re already doing it by
storing
it in RAM and its working fine, done change it. But if you want to
squeek
some more time out of the project, the database may be worth a shot.

Hope this helps, I know its a long winded answer…

–Kevin

moyapilot · April 23, 2010, 10:35pm

My answer used to be “oh, use flat files and the like, they’re simpler”.

Then I actually used sqlite for some stuff.

Now my answer is pretty much “if you expect to have to ever look up
records
by more than one attribute, start in SQL.” It’s worked really well for
me.

-s

moyapilot · April 23, 2010, 11:03pm

On Thu, Apr 22, 2010 at 1:43 PM, Brian C. [email protected]
wrote:

Moya Pilot wrote:

Will using a SQLite DB be faster or easier or more efficient for me than
just doing all my iterations thru an array?

Unless your data structure won’t fit in RAM, then you’ll be better off
using a Hash (so you can locate the object you’re interested straight
away, rather than iterating through an Array)

Actually, if your data won’t fit in RAM, you still probably want to do
it in a hash. Just use BDB or GBDM. Those are more or less just fast
on-disc hashes. Before everything was SQL (back when dinosaurs roamed
the earth), that’s what folks called a database.

–Kyle

moyapilot · April 24, 2010, 12:05am

On 2010-04-23, Kyle S. [email protected] wrote:

Actually, if your data won’t fit in RAM, you still probably want to do
it in a hash. Just use BDB or GBDM. Those are more or less just fast
on-disc hashes. Before everything was SQL (back when dinosaurs roamed
the earth), that’s what folks called a database.

I’m a big fan of those tools, but what I’ve found is that the moment I
first want to iterate through the table looking for something by a value
other than its key, I have lost – I would have been better off with
sqlite.

-s

moyapilot · April 26, 2010, 1:23am

Thank you all for your comments, I really appreciate it. I have to
admit I am new to Ruby and fairly new to programming in any language and
am still confused about a lot of things.

To explain my needs a little better, I am reading in thousands of lines
in a day via a tcp connection. Most lines will be one attribute updates
to an object but the lines I am reading are like so
NAME, jim joe, 123-456-7890, 123 elm st, zip
PREFS, firefox, flash 10, java, windows, jim joe
FAVORITES, facebook, jim joe, ebay, google

In short, I am reading them in via fastercsv and working on using an if
statement to do something to see if the username exists in the array
already, if so update the object found with the needed data. If not add
the new data to as a new object in the array.

User = Struct.new(:a, b:, c:, d:, e:, f:, g:, h:, i:, j:, k:, l:, m:,
n:, o:, p:, q:, r:)
array=[]
CSV.parse lines do |row|
if ( lines =~ /^NAME:/)
array.find{ |foobar| if foobar.b == row[1]
puts “User already added”

elsif ( lines =~ /^PREFS:/)
array.find{ |foobar| if foobar.b == row[5]
puts “This is where I need so other help, I need to add this
rows contents[0…4] to the object matching in the array, so I think
something like foobar[6, 7, 8, 9] = row[1, 2, 3, 4]”

elsif ( lines =~ /^FAVORITES:/)
array.find{ |foobar| if foobar.b == row[2]
puts “This is where I need so other help, I need to add row[1,
3, 4] contents to the object matching in the array, so I think something
like foobar[12, 14, 15] = row[1, 3, 4]”

else
array << User.new(*row[0…5])
end }

I have been re-reading some more on hashes but I am still confused on
the differences between hashes and arrays. If it is simply just giving
a name to something like array[0] I am not sure how that speeds things
up. I am also still confused on how to efficiently add the cvs parsed
rows into an object in the array. Is array.find my best bet if using an
array? Would hashes work or BDB, GDBM or SQLite be better suited
towards this?

Thanks again I appreciate all your help.

moyapilot · April 30, 2010, 4:01am

Thank you David for your informative and considerate reply. I
appreciate it. It helps clear a lot of questions.

David M. wrote:

On Sunday 25 April 2010 06:23:50 pm Moya Pilot wrote:

moyapilot · April 28, 2010, 6:19am

On Sunday 25 April 2010 06:23:50 pm Moya Pilot wrote:

To explain my needs a little better, I am reading in thousands of lines
in a day via a tcp connection. Most lines will be one attribute updates
to an object but the lines I am reading are like so
NAME, jim joe, 123-456-7890, 123 elm st, zip
PREFS, firefox, flash 10, java, windows, jim joe
FAVORITES, facebook, jim joe, ebay, google

In other words, a naive implementation, you’d be reading in the entire
document, parsing it, making the change, and writing it all back out. A
slightly more optimized form would be buffering a bit – read the entire
document in, cache something useful in RAM, then write it back out once
a
minute or so, at the risk of losing a minute of updates if you crash.

In short, I am reading them in via fastercsv and working on using an if
statement to do something to see if the username exists in the array
already, if so update the object found with the needed data. If not add
the new data to as a new object in the array.

That is quite literally the textbook example of when a hash or a set
makes
sense. Let’s take a simplified example, maybe an address book. Let’s
pretend I
magically have an array like this:

addresses = [
[‘Joe’, ‘[email protected]’],
[‘John S.’, ‘[email protected]’],
[‘Alice’, ‘[email protected]’],
[‘Bob’, ‘[email protected]’],
[‘Eve’, ‘[email protected]’]
]

That’s reasonably small, so it’s reasonably quick to do. Basically, you
could
do something like this:

def put(username, email)
found = false
addresses.each do |record|
if record[0] == username
record[1] = email
found = false
break
end
end
addresses << [username, email] unless found
end

Basically, you want to update the user’s email address in the database,
or
create a new record if the user isn’t there already. Your code is
probably a
bit more elegant, but Array’s “find” method is going to be doing the
same kind
of thing under the hood anyway.

Now think about this. How long is that going to take? It’s proportionate
to
the number of users in the database, and how long their names are. The
more
users you have, the longer it will take.

Think about it – if you’re adding a new record, that means you just
read
through the ENTIRE DATABASE just to find out they’re not there.

put(‘charlie’, ‘[email protected]’) # needs to loop through 5 items
put(‘superman’, ‘[email protected]’) # needs to loop through 6
items
…

The more data you have, the slower it gets, and it gets slower
proportionately. That means that adding a new record to an array of a
million
records will take at least a thousand times as long as it takes to add a
new
record to an array of a thousand records – and maybe longer, because of
all
those string comparisons.

Say you want to look something up. It’s no better:

def get(username)
addresses.each do |record|
return record[1] if record[0] == username
end
nil
end

Again, if the user isn’t there, you have to look at the ENTIRE DATABASE
to
find out they aren’t.

This is about the least efficient way you could possibly ever do this,
so it’s
important that you understand why this is so inefficient.

Now, here’s what it looks like as a hash:

addresses = {
‘Joe’ => ‘[email protected]’,
‘John S.’ => ‘[email protected]’,
‘Alice’ => ‘[email protected]’,
‘Bob’ => ‘[email protected]’,
‘Eve’ => ‘[email protected]’
}

Now let’s write those methods again:

def put(username, email)
addresses[username] = email
end

def get(username)
addresses[username]
end

It’s not immediately obvious that these are faster. Maybe Ruby is doing
some
magic behind the scenes that makes this just as slow?

But that’s not the case. If you’re curious, look up hash tables, but in
the
mean time, take my word for it: This runs, on average, in constant
time.
That means it will take exactly the same amount of time to put an
address into
a hash with a million entries as it would to put that address into a
hash with
ten entries.

Plus, it’s less code anyway, once you understand it.

So in short, it’s much, MUCH more than

simply just giving
a name to something like array[0]

But even if that’s the only reason, it’s still worthwhile. Readable code
is a
Good Thing. Using bare integers as array offsets is something I almost
never
do – it’s not always bad, but it’s almost always ugly.

Would hashes work or BDB, GDBM or SQLite be better suited
towards this?

I would tend towards SQLite with a decent ORM on top of it, maybe
DataMapper.

But almost certainly yes.

Flat files like CSV are not at all good at doing random updates, which
is what
you just described. CSV is great for looking at your data in a
spreadsheet, if
your data reasonably fits in a spreadsheet. Other flat file formats can
be
much simpler to work with, especially for a config file, or to make
things
human readable.

But as a database, they’re slow, and they get slower the more you put in
them.