Q: Timing of render :update in an action?

aginghipster · April 28, 2006, 3:06am

Hey folks, am hoping someone can set me straight with a solution to an
issue that has come up involving the timing of rendered page elements in
relation to the completion of an action’s execution.

I have an ror search app which searches through two tables consisting of
7,000 page urls and a table w/ 87,000 related keyphrases within those
pages. Search results can take anywhere from 2 seconds to 5-8 seconds to
be displayed so I am trying to break the process up so a
loading/progress mesg can be displayed using Ajax and the following
logic-flow.

Search button is pressed
search action sets session data to ‘loading’
render :update do |page|
does a replace_html on a ‘status’ div for loading mesg
sets ‘status’ div to show
end
search is performed
end

( status div contains Ajax calls to keep checking the progress until the
search is completed ).

The above all works, however, I notice the initial render doesn’t affect
the screen until the search has finished, by the time I see the loading
message for the first time I also see the search has completed in my
Lighttpd console.

How might the above be modified so that the rendering takes place where
it should in the logic flow but also be visible in the browser, instead
of waiting until the full action has completed?

I’d be grateful to hear any suggestions
Thanks,
Andy

aginghipster · April 28, 2006, 3:35am

This search time is too long for tables that are essentially empty. On
1,000,000 rows, with proer indexing, you should see no more than 40
milliseconds on sub-gigahertz desktop systems. The curve is
logarithmic, so you should see about 20 milliseconds for these tables
“as-is”.

Can you show the DDL that creates the database and the DML that is
retrieving this information?

aginghipster · April 28, 2006, 4:07am

Sure, here is the info you requested:

CREATE TABLE urls (
id int(11) NOT NULL auto_increment,
url varchar(250) default NULL,
title varchar(250) default NULL,
mdesc blob,
udate datetime default NULL,
phrases blob,
rel_phrases blob,
PRIMARY KEY (id)
)

CREATE TABLE phrases (
id int(11) NOT NULL auto_increment,
phrase varchar(250) default NULL,
urls blob,
rel_urls blob,
synonyms blob,
auth varchar(250) default NULL,
xmlfeed varchar(250) default NULL,
date_last_mention varchar(15) default NULL,
date_last_external varchar(15) default NULL,
date_last_photo varchar(15) default NULL,
date_last_rss varchar(15) default NULL,
seed_external1 blob,
seed_external2 blob,
seed_external3 blob,
seed_photo1 blob,
seed_photo2 blob,
seed_photo3 blob,
auth_external blob,
auth_photo blob,
auth_rss blob,
PRIMARY KEY (id)
)

There are only two sql queries:

Phrase.find( :all, :conditions => [“phrase LIKE ?”, keyword ] )
Url.find_all

An alternative approach I tried was with an inner join table called
relations which had 1 url_id, 1 phrase_id and 1 record for each
associated url/phrase pair - resulting in a total of 235,000 relations.

I used has_many :through in the Url and Phrase models and the
performance was horrible, about 4 times as long so I did away with that
and am storing the relation info in each of the two tables. I am open to
any suggestions how to better deal with this relationship …

-Andy

David J. wrote:

This search time is too long for tables that are essentially empty. On
1,000,000 rows, with proer indexing, you should see no more than 40
milliseconds on sub-gigahertz desktop systems. The curve is
logarithmic, so you should see about 20 milliseconds for these tables
“as-is”.

Can you show the DDL that creates the database and the DML that is
retrieving this information?

aginghipster · April 28, 2006, 5:09am

Like David mentioned, with the right indexes you should be getting
good results. This is not a Rails problem.

Ensure your phrases.phrase column is indexed since you are searching
against it.

You need to narrow the scope of your URL search because you are
bringing back all 7000 URLs which will take some time especially if
those three blobs are on the large size.

One thing you may want to think about is moving those blobs out of the
table and retrieving them in a lazy manner… IE when you need them.
Hope that helps.

Carl

aginghipster · April 28, 2006, 5:09am

there are no indexes apart from the primary key, so every query
execution must scan the entire table.
The use of the “like” predicate would prevent the use of indexes, if
they existed
The heavy use of blobs are a huge performance drag - blobs are an
extra I/O per record (at least)

Essentially, you will want a table of phrases indexed on the phrase,
with one phrase to a row, a table of URL’s, and an associative table
that is indexed by both phrase_id and url_id (a single compound index).
There is also a list of synonyms that you have not stated yet, but which
belongs in the data model.

Don’t be put off by a few million rows. The index is a B+ tree with a
cardinality determined by the number of keys per page and the page size
less some overheads. After about 1,000 rows performance does not
degrade noticably.

With an 8k page size, and a 4 byte key, with 4 bytes allowed for the row
index, and some other overheads, you can expect about 800 index entries
per page. The first 800 rows require 1 I/O to locate in the index and 1
I/O (2 I/O average) to look up in the table. However, you will average
a total of 3 I/O’s per search at between 800 and 640000 rows, and 4
I/O’s between 640000 and 512000000 rows (yes, there are real-life
examples that get that big and perform well).

1 I/O = 10 milliseconds worst case on current hardware. This does not
allow for the performance boost that the RDBMS buffer pool caching will
give automatically.

This is not quite right, but it should be close:

CREATE TABLE urls (
id int(11) NOT NULL auto_increment,
url varchar(250) default NULL,
title varchar(250) default NULL,
mdesc blob,
udate datetime default NULL,
PRIMARY KEY (id)
);

CREATE TABLE phrases (
id int(11) NOT NULL auto_increment,
phrase varchar(250) default NULL,
auth varchar(250) default NULL,
xmlfeed varchar(250) default NULL,
date_last_mention varchar(15) default NULL,
date_last_external varchar(15) default NULL,
date_last_photo varchar(15) default NULL,
date_last_rss varchar(15) default NULL,
seed_external1 blob,
seed_external2 blob,
seed_external3 blob,
seed_photo1 blob,
seed_photo2 blob,
seed_photo3 blob,
auth_external blob,
auth_photo blob,
auth_rss blob,
PRIMARY KEY (id)
)
create unique index phrase_ndx1 on phrases (phrase);

create table synonyms (
id int(11) NOT NULL auto_increment PRIMARY KEY,
phrase_id int(11) NOT NULL references phrases(ID),
synonym_id int(11) NOT NULL references phrases(ID),
)
create unique index phrase_synonym on synonyms (phrase_id, synonym_id);

create table relations (
id int(11) NOT NULL auto_increment PRIMARY KEY,
phrase_id int(11) NOT NULL references phrases(ID),
url_id int(11) NOT NULL references urls(ID)
)
// This index is useful in some RDBMS, and superfluous in others.
// Experiment with yours to see whether the indexes created in the
“references”
// clauses above work better or worse than this index for your RDBMS.
// In some RDBMS’ , this will allow an index only scan, eliminating one
I/O from each fetch
create unique index phrase_url on relations (phrase_id, url_id);

The SQL to retrieve all URL’s for a given phrase would be something
along the lines of this (someone else will have to express it in Rails-
ese - I know all about making RDBMS’ perform, but my rails is still
weak):

select URL
from
urls a,
relations b,
phrases c
where
a.id = b.url_id
and b.phrase_id = c.id
and c.phrase = ?

Here’s a gotcha: I normally use prepared statements. Rails does not
have prepared statements. The people running the show so far seem to
think that string substitution provides adequate performance, which is
true in a limited context but not all contexts.

If Rails had a prepared statement this would have much better
performance. Rails’ string substitution model means that the heavy work
of building the access path from the SQL must happen on every call to
the RDBMS. A prepared statement would allow the access path to be build
once (at prepare time) and reused for the duration of the RDBMS
connection.

So, while the actual data access may be fast with this schema, it is
possible that the overheads of the prepare phase may overwhelm the
benefits of the data model until/unless we can convince someone that
prepared statements are more than an oddity.

aginghipster · April 28, 2006, 5:54am

Wow, looks like a major revision to the DB structure is needed, I did
not realize the use of table indices is so crucial.

One thing I am wondering is if Rails’ has_many :through relation
implicitly performs the kind of query you describe with the
url/phrase/relation lookup. I am guessing it does and that adding the
index (which I did not have with my last version) will make a big
difference.

I’m also thinking of losing all of the blob data from the phrases table
and perhaps moving it to a seperate table since its not needed for the
bulk of the search logic.

Alot to think about, thank you for taking the time to post this, I think
the suggested revisions will likely help the app’s performance quite a
bit!

-Andy

David J. wrote:

there are no indexes apart from the primary key, so every query
execution must scan the entire table.

The use of the “like” predicate would prevent the use of indexes, if
they existed

The heavy use of blobs are a huge performance drag - blobs are an
extra I/O per record (at least)

Essentially, you will want a table of phrases indexed on the phrase,
with one phrase to a row, a table of URL’s, and an associative table
that is indexed by both phrase_id and url_id (a single compound index).
There is also a list of synonyms that you have not stated yet, but which
belongs in the data model.

Don’t be put off by a few million rows. The index is a B+ tree with a
cardinality determined by the number of keys per page and the page size
less some overheads. After about 1,000 rows performance does not
degrade noticably.

With an 8k page size, and a 4 byte key, with 4 bytes allowed for the row
index, and some other overheads, you can expect about 800 index entries
per page. The first 800 rows require 1 I/O to locate in the index and 1
I/O (2 I/O average) to look up in the table. However, you will average
a total of 3 I/O’s per search at between 800 and 640000 rows, and 4
I/O’s between 640000 and 512000000 rows (yes, there are real-life
examples that get that big and perform well).

1 I/O = 10 milliseconds worst case on current hardware. This does not
allow for the performance boost that the RDBMS buffer pool caching will
give automatically.

This is not quite right, but it should be close:

CREATE TABLE urls (
id int(11) NOT NULL auto_increment,
url varchar(250) default NULL,
title varchar(250) default NULL,
mdesc blob,
udate datetime default NULL,
PRIMARY KEY (id)
);

CREATE TABLE phrases (
id int(11) NOT NULL auto_increment,
phrase varchar(250) default NULL,
auth varchar(250) default NULL,
xmlfeed varchar(250) default NULL,
date_last_mention varchar(15) default NULL,
date_last_external varchar(15) default NULL,
date_last_photo varchar(15) default NULL,
date_last_rss varchar(15) default NULL,
seed_external1 blob,
seed_external2 blob,
seed_external3 blob,
seed_photo1 blob,
seed_photo2 blob,
seed_photo3 blob,
auth_external blob,
auth_photo blob,
auth_rss blob,
PRIMARY KEY (id)
)
create unique index phrase_ndx1 on phrases (phrase);

create table synonyms (
id int(11) NOT NULL auto_increment PRIMARY KEY,
phrase_id int(11) NOT NULL references phrases(ID),
synonym_id int(11) NOT NULL references phrases(ID),
)
create unique index phrase_synonym on synonyms (phrase_id, synonym_id);

create table relations (
id int(11) NOT NULL auto_increment PRIMARY KEY,
phrase_id int(11) NOT NULL references phrases(ID),
url_id int(11) NOT NULL references urls(ID)
)
// This index is useful in some RDBMS, and superfluous in others.
// Experiment with yours to see whether the indexes created in the
“references”
// clauses above work better or worse than this index for your RDBMS.
// In some RDBMS’ , this will allow an index only scan, eliminating one
I/O from each fetch
create unique index phrase_url on relations (phrase_id, url_id);

The SQL to retrieve all URL’s for a given phrase would be something
along the lines of this (someone else will have to express it in Rails-
ese - I know all about making RDBMS’ perform, but my rails is still
weak):

select URL
from
urls a,
relations b,
phrases c
where
a.id = b.url_id
and b.phrase_id = c.id
and c.phrase = ?

Here’s a gotcha: I normally use prepared statements. Rails does not
have prepared statements. The people running the show so far seem to
think that string substitution provides adequate performance, which is
true in a limited context but not all contexts.

If Rails had a prepared statement this would have much better
performance. Rails’ string substitution model means that the heavy work
of building the access path from the SQL must happen on every call to
the RDBMS. A prepared statement would allow the access path to be build
once (at prepare time) and reused for the duration of the RDBMS
connection.

So, while the actual data access may be fast with this schema, it is
possible that the overheads of the prepare phase may overwhelm the
benefits of the data model until/unless we can convince someone that
prepared statements are more than an oddity.

aginghipster · April 29, 2006, 3:50am

I’m still a newbie to Rails.

If you simply want to filter out the rows based on the isrel column,
just append “and isrel=1” to the sql

If you have a more complex use for the relationship, my next step would
be to trace into the rails source and figure out how Rails is handling
the relation.

I’m listening to see what a Rails guru comes up with.

Good Luck.

aginghipster · April 28, 2006, 9:49pm

David/Carl - thanks again for your replies, I started from scratch this
morning with your suggestions and the response time is much much faster,
I’d go as far as to say its even speedy.

There is one issue that the database rework brings up and maybe there is
an easy solution.

With the removal of the phrases and rel_phrases fields from the urls
table and similar removal of the urls and rel_urls fields from the
phrases table, I am using the relations table to specify whether the
relation between a given url_id and phrase_id is relevant or not – done
through a field Relation.isrel int(1).

I have the urls and phrases linked together through a has_many :through
relationship but I am unsure how to access the specific relation to get
the isrel member.

eg

url.phrases is an array of all phrases where rel.url_id = url.id and
rel.phrase_id = phrases*.id – but given url.phrases how is the
underling relation object accessed?

I am hoping my description is clear and that there is a solution I am
unaware of.

Again, thank you guys so much for pointing out a better DB design.
Andy

aginghipster · May 3, 2006, 7:08pm

Rails is trying to query for these individually by generating SQL from
each loaded object. This is a Rails specific problem, and I am still
new to rails.

I would suspect that one of three things is happening:

You are using the wrong rails facility to represent this relationship
Rails has a mechanism that will allow you to control this facility
better, or
Rails has a bug in its implementation of this feature

The only way to figure this out will be to dive into the Rails source
code, unless a Rails guru is following this thread and can give more
insight from experience.

I am very interested in hearing what you find out because most of my
work involves at least a couple of layers of indirection.

aginghipster · April 29, 2006, 4:23am

David, as a temp solution I added a second join table called rrelations
and only added in the url/phrase pairs that were tagged as relevant –
that works fine as far as I can tell, I can access phrase.urls with the
relations inner join table via has_many :through and phrase.rel_urls via
the rrelations join table.

However, after playing alot more with the app I have found a problem,
the inner join table implemented via has_many :through works great in
only one direction, accessing the urls list from a phrase object
(phrase.urls, or phrase.rel_urls for the second join table).

If I attempt to access for a given url object url.phrases my server
console starts to fill up with these:

Phrase Load (0.004271) SELECT phrases.* FROM phrases INNER JOIN
relations ON phrases.id = relations.phrase_id WHERE (relations.url_id =
4198)

one of those for each url in the list, and a typical search may have
1000 - 7000 or so urls. This was the exact problem that caused me to
scrap the inner join table with my first attempt at it, before I added
the indexes for each of the tables.

When I access for a phrase object, phrase.urls I see no such messages,
so it seems the indexing is only working in one direction.

Any ideas what might be the problem?

-Andy

David J. wrote:

I’m still a newbie to Rails.

If you simply want to filter out the rows based on the isrel column,
just append “and isrel=1” to the sql

If you have a more complex use for the relationship, my next step would
be to trace into the rails source and figure out how Rails is handling
the relation.

I’m listening to see what a Rails guru comes up with.

Good Luck.