On Sunday 25 April 2010 06:23:50 pm Moya Pilot wrote:
To explain my needs a little better, I am reading in thousands of lines
in a day via a tcp connection. Most lines will be one attribute updates
to an object but the lines I am reading are like so
NAME, jim joe, 123-456-7890, 123 elm st, zip
PREFS, firefox, flash 10, java, windows, jim joe
FAVORITES, facebook, jim joe, ebay, google
In other words, a naive implementation, you’d be reading in the entire
document, parsing it, making the change, and writing it all back out. A
slightly more optimized form would be buffering a bit – read the entire
document in, cache something useful in RAM, then write it back out once
a
minute or so, at the risk of losing a minute of updates if you crash.
In short, I am reading them in via fastercsv and working on using an if
statement to do something to see if the username exists in the array
already, if so update the object found with the needed data. If not add
the new data to as a new object in the array.
That is quite literally the textbook example of when a hash or a set
makes
sense. Let’s take a simplified example, maybe an address book. Let’s
pretend I
magically have an array like this:
addresses = [
[‘Joe’, ‘[email protected]’],
[‘John S.’, ‘[email protected]’],
[‘Alice’, ‘[email protected]’],
[‘Bob’, ‘[email protected]’],
[‘Eve’, ‘[email protected]’]
]
That’s reasonably small, so it’s reasonably quick to do. Basically, you
could
do something like this:
def put(username, email)
found = false
addresses.each do |record|
if record[0] == username
record[1] = email
found = false
break
end
end
addresses << [username, email] unless found
end
Basically, you want to update the user’s email address in the database,
or
create a new record if the user isn’t there already. Your code is
probably a
bit more elegant, but Array’s “find” method is going to be doing the
same kind
of thing under the hood anyway.
Now think about this. How long is that going to take? It’s proportionate
to
the number of users in the database, and how long their names are. The
more
users you have, the longer it will take.
Think about it – if you’re adding a new record, that means you just
read
through the ENTIRE DATABASE just to find out they’re not there.
put(‘charlie’, ‘[email protected]’) # needs to loop through 5 items
put(‘superman’, ‘[email protected]’) # needs to loop through 6
items
…
The more data you have, the slower it gets, and it gets slower
proportionately. That means that adding a new record to an array of a
million
records will take at least a thousand times as long as it takes to add a
new
record to an array of a thousand records – and maybe longer, because of
all
those string comparisons.
Say you want to look something up. It’s no better:
def get(username)
addresses.each do |record|
return record[1] if record[0] == username
end
nil
end
Again, if the user isn’t there, you have to look at the ENTIRE DATABASE
to
find out they aren’t.
This is about the least efficient way you could possibly ever do this,
so it’s
important that you understand why this is so inefficient.
Now, here’s what it looks like as a hash:
addresses = {
‘Joe’ => ‘[email protected]’,
‘John S.’ => ‘[email protected]’,
‘Alice’ => ‘[email protected]’,
‘Bob’ => ‘[email protected]’,
‘Eve’ => ‘[email protected]’
}
Now let’s write those methods again:
def put(username, email)
addresses[username] = email
end
def get(username)
addresses[username]
end
It’s not immediately obvious that these are faster. Maybe Ruby is doing
some
magic behind the scenes that makes this just as slow?
But that’s not the case. If you’re curious, look up hash tables, but in
the
mean time, take my word for it: This runs, on average, in constant
time.
That means it will take exactly the same amount of time to put an
address into
a hash with a million entries as it would to put that address into a
hash with
ten entries.
Plus, it’s less code anyway, once you understand it.
So in short, it’s much, MUCH more than
simply just giving
a name to something like array[0]
But even if that’s the only reason, it’s still worthwhile. Readable code
is a
Good Thing. Using bare integers as array offsets is something I almost
never
do – it’s not always bad, but it’s almost always ugly.
Would hashes work or BDB, GDBM or SQLite be better suited
towards this?
I would tend towards SQLite with a decent ORM on top of it, maybe
DataMapper.
But almost certainly yes.
Flat files like CSV are not at all good at doing random updates, which
is what
you just described. CSV is great for looking at your data in a
spreadsheet, if
your data reasonably fits in a spreadsheet. Other flat file formats can
be
much simpler to work with, especially for a config file, or to make
things
human readable.
But as a database, they’re slow, and they get slower the more you put in
them.