12,000 articles in a Radiant site?

I’ve been enlisted to take a 12,000+ article site (static .html
pages!) with 500k unique visitors a month and convert it to a dynamic
site where the owner of the site can more readily flow advertising,
among other things alongside the content.

The HTML (not XHTML) is mostly all old bad stuff, the clean-up and
conversion of which is a separate task and discussion entirely.

My question for the other members of the core team and the community
at large is:

  1. Am I crazy to be considering Radiant as the starting point for
    this project? I know I will need to section-up and somewhat re-invent
    the admin page tree, at minimum, but despite the size and popularity
    of the site, there are not a lot of unique CMS features needed.

It’s definitely an option to go custom from ground-up and their may
be just enough in their budget to accommodate the custom route,
however because of the unique situation of this client (the site is
soon to be sold) time is of the essence and for this reason starting
with Radiant could be a valuable jump start.

The questions in my mind now are about caching and the core
performance of Radiant under what could be significant load. I don’t
have peak number of pages served per second or minute right now, but
will have those numbers shortly.

  1. Is there any precedence for such a thing. I seem to remember a
    discussion a while back about relative site size and thought I
    remembered seeing that someone is managing a 1-2k page site in it
    currently.

12K articles, that’s a mini wikipedia!

Just one though.

It depends on your requirements, but you don't need to go fully

“dynamic”. I mean, you can do everything in Radiant (adjusting the
admin UI
as you said) and then generate all the pages and store them to be served
statically. That is, something like making the current Radiant cache
usable
by the web server directly. When something is updated, regenerate the
static content. May be all of it, or wisely only the modified parts, if
possible.

 I have been thinking on doing this with some Radiant projects i'm

working on. It allows to “use” Radiant in cheap-with-no-rails shared
hosting for example. It’s probable that in the following weeks i
develop a
Radiant extension for this purpose.

A good friend has developed a high traffic site, updated daily, 

using
this technique (not in Radiant) and the customer was very satisfied with
the
result. So it’s possible to do it.

/AITOR

So, this doesn’t sound like a rewrite of Radiant yet, hence maybe I
get something worthwhile by using Radiant as the starting point.

Are there any changes to caching that we could consider for core
which would also help make sense at this scale?

I think that the caching is about as good as it’s going to get, and
without hard numbers I’d suggest that handing the caching over
to apache probably isn’t worth the effort.

Are there any changes to the Admin page interface “”?

If your pages already have a natural heirachy that keeps the number of
children of a specific page fairly low (I’d say up to about
20-30 should be fine), then I don’t think you’ll have a problem.

If you’ve got something like my situation, where you’re using something
like ArchivePage and letting it have a few hundred children,
then you’d want a way for a page type to be able to segment the child
list (ie, when expanding the archive page in the admin,
instead of directly expanding to the children, it provides a list of
years which expand into months).

Dan.

I seem to remember this in another caching discussion: what about
response headers? From what I recall, browsers won’t allow you to use
meta tags for evey header, so simply caching the html won’t
necessarily fill every need.

Could the caching scheme be extended to store and transmit cached
headers as well?

The current system already does. If people start talking about handing
off the caching to the web server instead, there’s not really any way
to do that without customising your web server to some extent.

If somebody finds radiant’s caching performance inadequate, the next
logical step that I can see is to implement the caching as it currently
functions in the webserver.

That would either be in the form of an apache module or a custom handler
for mongrel (I’d say the later would be the best first step).

If you’ve got the sort of traffic where this matters to you (very few
people will), your choices are:

a) develop an apache module / mongrel handler to handle caching
b) pay someone (me? I like money) to do that for you
c) rethink your business model so that you have the money for a) or b)

Dan.

See, this is why it’s awesome to have a performance guru on the core
team. Thank you for clearing up some misunderstandings of mine.

Loren, if we assume that 500,000 unique visitors in a month and every
one of them requested only 1 page, that would still only be 0.19
requests per second. It would take each of them requesting 1000 pages
per visit to even reach the benchmark Daniel cites. So, I think Daniel
may be right in the 100 req/sec vs. 600 req/sec debate. The only issue
then is to make sure that any extensions you add or create don’t slow
things down or make them too unstable.

Like Daniel suggested, your issue may then be the admin UI, which will
take a long time to render if your structure is really flat. One thing
we may do for Digital Pulp (even though their estimated site size is
around 300 pages) is to add a live-search box to the site map so pages
can be quickly found.

Cheers!

Sean

Aitor,

I for one would be very interested in your extension. I looking forward
to seeing it soon…

saji

possible.
/AITOR

conversion of which is a separate task and discussion entirely.
It’s definitely an option to go custom from ground-up and their may

Site: http://lists.radiantcms.org/mailman/listinfo/radiant


Radiant mailing list
Post: [email protected]
Search: http://radiantcms.org/mailing-list/search/
Site: http://lists.radiantcms.org/mailman/listinfo/radiant


Saji N. Hameed

APEC Climate Center +82 51 668 7470
National Pension Corporation Busan Building 12F
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 [email protected]
KOREA

Loren,

As Aitor suggests, it sounds like this site could really use some
caching that reminisces of Rails’ page caching model.

In a non-Radiant site I have worked on recently, the entire public site
is page-cached, but each page takes a long time to generate (lots of
data, averaging 3 sec/page). A cron job runs about every 30 minutes
that does three things after updating time-sensitive data:

  1. Rename all the cached .html files to .htm (the web server has a
    rewrite rule to defer to .htm if .html isn’t present)
  2. Hit each page on a separate app-server process that isn’t in the
    proxy pool, resulting in a cached .html
  3. Remove all .htm pages
    This way, we never serve up dynamic pages to users.

Now, I imagine one of the largest slowdowns on your site is going to be
Page.find_by_url because it is recursive. Radiant could really be sped
up by caching the URL in the database, which would reduce lookup time
for most pages, and having the recursive method as a fallback. If you
want, we can hash over the design of this optimization together (perhaps
with John and Daniel) and apply it to the core.

Sean

I seem to remember this in another caching discussion: what about
response headers? From what I recall, browsers won’t allow you to use
meta tags for evey header, so simply caching the html won’t
necessarily fill every need.

Could the caching scheme be extended to store and transmit cached
headers as well?

-Andrew

As Aitor suggests, it sounds like this site could really use some
caching that reminisces of Rails’ page caching model.

The radiant caching model is actually pretty damn quick - not raw-apache
quick, but pretty damn close. Look through the archives for
the actual figures from my benchmarks, but my system could push out
somewhere around 100 req/sec using radiant’s cache compared to
600 req/sec for raw apache. In the grand scheme of things, that
performance difference is negligible (from what little data I can
find, 100 req/sec would be more than adequate to handle a
slashdotting/digging). Performance is about half that if you don’t use
mod_x_sendfile (I advise using mod_x_sendfile).

Now, I imagine one of the largest slowdowns on your site is
going to be
Page.find_by_url because it is recursive.

Page.find_by_url isn’t actually that big of a problem anymore. You’ll
get a db hit for each level of your page heirachy, but that
doesn’t actually add up to much - again look through the archives for
the benchmarks, but I don’t think that it starts impacting on
your performance until you get quite deep. Also, radiant’s caching does
not hit the db at all - only non-cached requests have to
deal with find_by_url.

Radiant used to have a problem with serving up a wide list of pages, but
unless you use something that changes the url scheme (ie.
ArchivePage) that is no longer a problem - if you do use something like
ArchivePage, then you can fix it up by customising the
find_by_url method in that page to look up child pages directly (which I
should really get around to doing in the core ArchivePage).

My site (www.thegroggysquirrel.com) is currently serving ~800 pages, all
images (~421 of various sizes) are also served through
radiant’s caching. My main issue with that many pages is that I have a
very wide list of pages (most pages are direct children of
either /comics or /articles), so the admin interface takes a while to
load that list up. The highest traffic the site’s had to deal
with is 2,000 visitors a day, but it did that without hiccup (I’m
convinced that my bandwidth will give out before radiant does).

Radiant could
really be sped
up by caching the URL in the database, which would reduce lookup time
for most pages, and having the recursive method as a
fallback. If you
want, we can hash over the design of this optimization
together (perhaps
with John and Daniel) and apply it to the core.

I’m fairly sure that the speedup there is negligible now that
Page.find_by_url isn’t so bad… I think I might have benchmarked
that to see whether it was worth doing work on and it came up fine.

Dan.

Yes, this is all very good news. I’m quiet happy to not have to make
any caching or lookup improvements and it makes all the more sense to
base this project on Radiant. Daniel, thanks very much for explaining
this and doing the original benchmarks. I hope that once this project
is done it can serve as an example of this sort of performance in
practice.

As for page depth, the current content organization is pretty
abnormal / chaotic, reminiscient of 12 years of static file
maintenance. I have some 400 page nodes which will be 200+ pages down
the tree, currently with around 400 pages at the root level. So for
now, other than the file import routine, I’ll focus my energies on a
“SiteGrande” Admin interface extension which will, among other things:

  • Visually compress the page tree (restyle the current page tree
    getting dropping the page icons and the variation in font-size
    between node levels, reduce the row height, etc.)
  • Add an “Edit from this root” link which would filter the page view
    to that node inward with a “back” link of some sort
  • Search / Live Search capacity in some form

Thanks much for the comments so far,

Loren

Ps. Anybody who wants work and has experience batch processing legacy
HTML should contact me :slight_smile:

Ok, so nobody said crazy…

That’s good.

I guess there are three things that seem absolutely necessary to
address then:

  1. Entirely revisit Radiant’s page caching – possibly getting out of
    any proxy to ruby situation entirely to let Apache (or Nginx, or…)
    do their simple magic. Possibly simply using Radiant as a static site
    generator?

  2. Depending on how the caching happens (static generation or not),
    implementing a core change to Page to store it’s own full path and
    Page.find_by_url to remove the currently necessary recursion there.
    Sean – why would we still need to keep it as a fall back?

  3. New page admin interface

So, this doesn’t sound like a rewrite of Radiant yet, hence maybe I
get something worthwhile by using Radiant as the starting point.

Are there any changes to caching that we could consider for core
which would also help make sense at this scale?

Are there any changes to the Admin page interface “”?

L

Sean,

Would love to look into this further and help out at all if I could
(and find the time) as I’m currently serving a predominately static
site with Nginx in front of a couple of Mongrel instances, if I could
have a rails-style caching method it would essentially make Radiant
the admin front end to managing the content and regular end-users
would never get anywhere near anything looking like ruby. So speed
should be awesome, and I can deploy an extra few sites on the same
hardware.

Glenn