First, I’m a physicist, not a computer scientist. I need to store 65M
rows now, with a growth rate of ~21M/year whose entries arrive daily.
The purpose of this DB is to store metadata for environmental
observations for the ionosphere. Basically it’s a glorified file
catalogue. I need the usual: fast enough inserts, reasonable
recovery, REST like access, etc. I have access to Ruby AR/AS 2.1.1,
Ruby 1.8.5 and eventually MySQL 5.1.x.
I’m trying to trade some storage space and lack of normalization for
simplicity of design and ease of maintenance by me. Basically, I want
to partition one very large table. My PK is essentially across 3
columns: filename, md5 checksum, destination. I’d like to partition
by year which is discernible by the filename and by obs_time if I can
but obs_time is not currently part of the PK. If that happens, then I
would have up to 40M rows in a single table though most tables would
have a few million; perhaps that is too many; perhaps a table for each
half year would work.
Currently, I have no foreign keys but I may want them one day.
Any help and suggestions would be really appreciated!
My schema follows. I have created a simple model off this table and
I’m using these decorators: validation, and before_save and after_save
to build an inventory summary (in another table not shown). I’ve
tried this on a non-partitioned table so far awaiting the installation
of MySQL 5.1.x by our IT department.
How do I go about integrating AR with a partitioned table? Can you
advise a simple pattern I might follow for such a large but simple
relationship?
create table ionofiles
(
– TODO optimize variable sizes
id
int(15) NOT NULL auto_increment PRIMARY KEY,
file
varchar(100) NOT NULL, – Any
valid filename without the “.~N~” backup extensions.
md5
varchar(100) NOT NULL, – MD5 Sum
destination
varchar(20) NOT NULL, – E.g.
MIDS, ADIC, AFWA, SWPC, …
`version` varchar(20) default NULL, -- Optional
version.
`ursi` char(5) default NULL, -- E.g. BC840,
WI937. default NULL because we allow any filename.
`time_scope` varchar(20) default NULL, -- E.g.
(individual, daily, monthly, yearly, etc). default NULL
– because we allow non time-scoped files.
`content_type` varchar(20) default NULL, -- E.g.
(root, scaled, image, ionogram)
`obs_time1` datetime default NULL, -- Start of
measurement. Default NULL because we allow
– non data files.
obs_time2
datetime default NULL, – End of
measurement, in the case file represents
– multiple measurements. Default NULL because we allow
– non data files.
`extension` varchar(5) default NULL, -- file
extension (any case). E.g. SAO, MMM, 16C, TXT, png.
`size` int(11) NOT NULL, -- file
size in bytes
`backup` int(2) default NULL, -- N of the
“~N~” tilda backup file.
`is_ready` boolean NOT NULL default 0, -- Is this
file’s links ready? 0 = No, 1 = Yes.
is_public
boolean NOT NULL default 1, – Is this
file publicly available? 0 = Public, 1 = Private.
is_removed
boolean NOT NULL default 0, – Has
this file been marked deleted?,
– s.t. we
don’t ever want to see it again?
– This is
useful in the event that we want to reject future submissions of file.
`file_on_disk` varchar(100) NOT NULL, -- Actual file name on
disk: E.g. BC840_2009001000000.SAO.~10~
path
varchar(200) NOT NULL, – A directory or a
full path to a ZIP or TAR or other aggregate file.
– E.g.
data/BC840/individual/2009/001/scaled/
– E.g.
data/BC840/individual/2009/BC840_2009001_2009-01-01.tar.gz
`last_verified` datetime default NULL, -- last time we checked
that this file actually exists at stated destination
– TODO: ActiveRecord fields. What should be the data type?
Timestamp or Datetime?
– The auto set and update features are currently handled by
AR.
– Should we put some ON UPDATE, etc actions here for non AR
interaction?
created_on
datetime NOT NULL, –
ActiveRecord auto set.
updated_on
datetime NOT NULL, –
ActiveRecord auto updated.
`note` TEXT default NULL, -- User
note
UNIQUE KEY `key_file_md5_destination_public` ( `file`, `md5`,
destination
, is_public
),
KEY `key_file` ( `file` ),
KEY `key_md5` ( `md5` ),
KEY `key_destination` ( `destination` ),
KEY `key_version` ( `version` ),
KEY `key_ursi` ( `ursi` ),
KEY `key_time_scope` ( `time_scope` ),
KEY `key_content_type` ( `content_type` ),
KEY `key_obs_time1` ( `obs_time1` ),
KEY `key_obs_time2` ( `obs_time2` ),
KEY `key_extension` ( `extension` ),
KEY `key_size` ( `size` ),
KEY `key_backup` ( `backup` ),
KEY `key_is_ready` ( `is_ready` ),
KEY `key_is_public` ( `is_public` ),
KEY `key_is_removed` ( `is_removed` ),
KEY `key_file_on_disk` ( `file_on_disk` ),
KEY `key_last_verified` ( `last_verified` ),
KEY `key_created_on` ( `created_on` ),
KEY `key_updated_on` ( `updated_on` )
) ENGINE=InnoDB;
R