Avoiding ext3's 4K entry limit when caching


#1

I’ve just started checking out caching. I have
thousands of items with URLs like ‘/item/view/ID’. I
see caching puts them in public/item/view/ID.html. I’m
using Linux’s ext3 filesystem and I’ve run into
problems before with caching and ext3’s 4K entries per
directory limit. How can I avoid this?

thanks
csn


Yahoo! FareChase: Search multiple travel sites in one click.


#2

Wow, this is an excellent question.

ext3’s performance with super large directories can actually be pretty
decent with dir_index (it’s depressingly bad without it). You could
alter ext3 itself to allow more entries, but honestly, with that many
files in a directory you probably spend more time seeking the directory
for the file than you would spend generating the page dynamically. You
could harvest the oldest files in the directory every few minutes, or
whatever seems appropriate, with a scheduled job.

It’d be very interesting to see how other large sites have handled this.
I’d push for either the cron, or rethinking what you cache (ie, don’t
cache whole pages, just cache parts that repeat). With more than a few
hundred files in a directory you’ll loose alot of performance no matter
what filesystem you use.

Looking forward to hearing from some more experienced deployment folk,

-Matt B


#3

Matthew B. wrote:

Wow, this is an excellent question.

ext3’s performance with super large directories can actually be pretty
decent with dir_index (it’s depressingly bad without it). You could
alter ext3 itself to allow more entries, but honestly, with that many
files in a directory you probably spend more time seeking the directory
for the file than you would spend generating the page dynamically. You
could harvest the oldest files in the directory every few minutes, or
whatever seems appropriate, with a scheduled job.

Matthew,

What are the performance characteristics of ext3 filesystems with and
without
dir_index for small directories up to large ones?

How many files do you need in a directory before dir_index is worth it?

Right now all my filesystems do not have dir_index enabled, so it would
require
some downtime to enable it.

Regards,
Blair


#4

On Wed, 2005-11-16 at 11:00 -0800, Blair Z. wrote:

Matthew,

What are the performance characteristics of ext3 filesystems with and without
dir_index for small directories up to large ones?

Not sure on any benchmarks. I wouldn’t say “stunning” is a bad word to
use. Basically, instead of using lists for files it uses B-Trees, which
are the same tech that make reiserfs directories so damn fast.

dir_index -
Use hashed b-trees to speed up lookups in large directories.

How many files do you need in a directory before dir_index is worth it?

I don’t know. But if you have 4000 I’d say that’s a good place to
start :slight_smile:

Right now all my filesystems do not have dir_index enabled, so it would require
some downtime to enable it.

Yeah, that’s the crappy part. In theory, you can enable it with
tune2fs, but in practice I’ve only gotten it with mke2fs. My tools when
implementing it were mostly from older debian distro though, this was
unheard of stuff when they were written.

gl! Let us know how you fare.

-Matthew B.