Lookups on large sql server data base with Activerecord that is currently using shards

Jedrin · March 14, 2013, 3:34pm

We have a sql server database which can have something like 50 meg of
records of any one type. The way that the database was originally
designed
with activerecord/sql is that every time you get a couple of million
records or so, it creates a new shard or a new table.
In effect for one record type we might have 20 or 30 different tables
(or
shards) that all have the same schema. When you need to look up
a particular record, you get it’s sched field and see which shard
(tables)
it is in and query those shards. This also involves a manual step
in that someone has to monitor the system and go to the web server and
click on “create new shard” whenever there are too many records in
the currently active shard which is where newly created records go.

We are looking at a new approach that involves clustered indexes or non
clustered indexes built on some of the fields we typically
query on. It then occurred to me however that if we created a parent
record
for each sched and had the records use a foriegn key to that
parent, perhaps we could get similar performance for our data and we
could
not have to worry about the shards ?

This seems like more of a rails activerecord approach and I am very
familiar with that way of doing things, but somewhat less familiar with
larger data sets such as what I am describing, though when I thought
about
it, I do recall that there are large data sets used by some
companies, though I forget if there was other strategies such as caching
and such used as well.

We currently have 4 different sets of sharded record types. One of them
looks something like below for where clauses and group/order.
Where I have a * it indicates that in the Activerecord find(:all) call
it
may or may not be passed, that is what is in the conditions part can
vary.
Such a lookup can return many thousands of records all having the same
sched, script etc.

where clause: sched, *script

group/order: message_id, timeslice, script, label, scale, scaled_units,
*type

My first pass was to create a clustered indexd on sched, script, and
label. However I am wondering if there are easier approaches such as I
indicated above that might make more sense ?