What is the complexity of find_by_name?

kretch · July 14, 2008, 6:50am

Hi All
I was wondering whether find_by_name has linear complexity O(n) with n
being the number of records? Or is it log(n) or constant? I have some
code which uses find_by_name heavily, and I was wondering if the long
runtimes I’m seeing are a result of find_by_name being linear

Thanks
Kretch

kretch · July 14, 2008, 7:30am

Kretch Kretchmer wrote:

I was wondering whether find_by_name has linear complexity O(n) with n
being the number of records? Or is it log(n) or constant? I have some
code which uses find_by_name heavily, and I was wondering if the long
runtimes I’m seeing are a result of find_by_name being linear

Assuming you’re talking about Rails and ActiveRecord (this is a Ruby
group here; there are better places to ask Rails questions)…

find_by_name turns into an SQL “select * from

where name =
‘value’”.
If you have query caching enabled (the default), this call might be
averted and be turned into a hash lookup… but assuming that isn’t the
case, the costs come down to several factors:

Any call to SQL has a large almost-constant overhead, simply due to the
context-switch involved. A remote procedure call (whether transported
using shared memory, pipes or sockets) involves upward of 100,000
instructions, and from Ruby, possibly much more - especially if several
layers of database adapters are involved. This will almost completely
flush the processor’s caches, and possibly much of the L2 cache as
well. All the code executed, and all the data touched, will have to be
loaded from RAM, and that’s a significant time cost. A rule of thumb
would be 2-15 milliseconds per database call… an age in the CPU’s
time-scale. The limit here is the speed of main memory and the FSB
interface of your CPU chip.

Next, SQL must parse and optimize the query, since Rails doesn’t make
any use of the stored procedures that most DBMS provide to speed bypass
this step. The optimization process will hopefully find that you’ve
placed an index over the “name” field. If not, the entire table must
be loaded until the matching entry is found. In either case, quite a
lot of code is required to parse and optimize your query and that will
normally complete the cache flushing mentioned above.

Then, to execute the search requires a log(N) search through disk pages,
where N is the number of disk pages in the index - divide the size of
an index page (say 8K) by the average size of a key (plus say 16 bytes),
and you know how many pages are involved. Each physical disk access will
take perhaps 5 milliseconds, but even a large index will only search 2-4
pages. However, the disk pages will normally exist in the DBMS’s buffer
cache, at least all the internal nodes of the indices should be, so you
may only have one real disk access. In these days of web apps having
databases of 20MB running on servers having 1GB, the need to hit the
disk is minimized. DBMS technology is optimized around the DB being
more than 10x the available RAM, but typical Rails apps never go close
to that.

Having found the correct index entry, the data page will normally need
to be loaded as well - again Rails doesn’t know anything about
clustering
the data with the most-used index (and even if it did, that’d be the
surrogate key “id” field).

Finally the results will be passed back to your program and turned from
a row into an ActiveRecord object. That requires your CPU to reload
most of the instructions (and some data) that make up the Ruby
interpreter
and program, hitting main memory again.

Bottom line: though there is a log(N) component - and perhaps even an
O(N) component if you forgot to add indexes - in most Rails apps you
are hit heavily on fixed per-call costs.

I hope this message gives you some insight into why Rails database
performance is often poor, especially compared to more “enterprisey”
toolkits that streamline some of the above operations by making proper
use of the DBMS’ features.

Clifford H., Data Constellation.

kretch · July 14, 2008, 1:31pm

Kretch Kretchmer wrote:

I was wondering whether find_by_name has linear complexity O(n) with n
being the number of records? Or is it log(n) or constant? I have some
code which uses find_by_name heavily, and I was wondering if the long
runtimes I’m seeing are a result of find_by_name being linear

Try: sudo gem install assert_efficient_sql

Then write a unit test like this:

def test_find_by_name_is_efficient
assert_efficient_sql :verbose do
Table.find_all_by_name(‘foo’)
end
end

It will print out the result of a MySQL EXPLAIN call. If the output
contains an
ALL call, or does not contain the name of a candidate index, then the
SELECT is
probably reading every record to find your hit.

If you can take the :verbose out, and the assertion passes, then its
EXPLAIN
won’t contain certain indicators of inefficiencies, and might be more
efficient.

If you leave that assertion online as you add features, then if it ever
breaks
your efficiency has regressed.