-
there are no indexes apart from the primary key, so every query
execution must scan the entire table.
-
The use of the “like” predicate would prevent the use of indexes, if
they existed
-
The heavy use of blobs are a huge performance drag - blobs are an
extra I/O per record (at least)
Essentially, you will want a table of phrases indexed on the phrase,
with one phrase to a row, a table of URL’s, and an associative table
that is indexed by both phrase_id and url_id (a single compound index).
There is also a list of synonyms that you have not stated yet, but which
belongs in the data model.
Don’t be put off by a few million rows. The index is a B+ tree with a
cardinality determined by the number of keys per page and the page size
less some overheads. After about 1,000 rows performance does not
degrade noticably.
With an 8k page size, and a 4 byte key, with 4 bytes allowed for the row
index, and some other overheads, you can expect about 800 index entries
per page. The first 800 rows require 1 I/O to locate in the index and 1
I/O (2 I/O average) to look up in the table. However, you will average
a total of 3 I/O’s per search at between 800 and 640000 rows, and 4
I/O’s between 640000 and 512000000 rows (yes, there are real-life
examples that get that big and perform well).
1 I/O = 10 milliseconds worst case on current hardware. This does not
allow for the performance boost that the RDBMS buffer pool caching will
give automatically.
This is not quite right, but it should be close:
CREATE TABLE urls (
id int(11) NOT NULL auto_increment,
url varchar(250) default NULL,
title varchar(250) default NULL,
mdesc blob,
udate datetime default NULL,
PRIMARY KEY (id)
);
CREATE TABLE phrases (
id int(11) NOT NULL auto_increment,
phrase varchar(250) default NULL,
auth varchar(250) default NULL,
xmlfeed varchar(250) default NULL,
date_last_mention varchar(15) default NULL,
date_last_external varchar(15) default NULL,
date_last_photo varchar(15) default NULL,
date_last_rss varchar(15) default NULL,
seed_external1 blob,
seed_external2 blob,
seed_external3 blob,
seed_photo1 blob,
seed_photo2 blob,
seed_photo3 blob,
auth_external blob,
auth_photo blob,
auth_rss blob,
PRIMARY KEY (id)
)
create unique index phrase_ndx1 on phrases (phrase);
create table synonyms (
id int(11) NOT NULL auto_increment PRIMARY KEY,
phrase_id int(11) NOT NULL references phrases(ID),
synonym_id int(11) NOT NULL references phrases(ID),
)
create unique index phrase_synonym on synonyms (phrase_id, synonym_id);
create table relations (
id int(11) NOT NULL auto_increment PRIMARY KEY,
phrase_id int(11) NOT NULL references phrases(ID),
url_id int(11) NOT NULL references urls(ID)
)
// This index is useful in some RDBMS, and superfluous in others.
// Experiment with yours to see whether the indexes created in the
“references”
// clauses above work better or worse than this index for your RDBMS.
// In some RDBMS’ , this will allow an index only scan, eliminating one
I/O from each fetch
create unique index phrase_url on relations (phrase_id, url_id);
The SQL to retrieve all URL’s for a given phrase would be something
along the lines of this (someone else will have to express it in Rails-
ese - I know all about making RDBMS’ perform, but my rails is still
weak):
select URL
from
urls a,
relations b,
phrases c
where
a.id = b.url_id
and b.phrase_id = c.id
and c.phrase = ?
Here’s a gotcha: I normally use prepared statements. Rails does not
have prepared statements. The people running the show so far seem to
think that string substitution provides adequate performance, which is
true in a limited context but not all contexts.
If Rails had a prepared statement this would have much better
performance. Rails’ string substitution model means that the heavy work
of building the access path from the SQL must happen on every call to
the RDBMS. A prepared statement would allow the access path to be build
once (at prepare time) and reused for the duration of the RDBMS
connection.
So, while the actual data access may be fast with this schema, it is
possible that the overheads of the prepare phase may overwhelm the
benefits of the data model until/unless we can convince someone that
prepared statements are more than an oddity.