Design flaw? - num_processors, accept/close

newbee-robert · October 16, 2007, 3:02pm

On 16 Oct 2007, at 13:45, Zed A. Shaw wrote:

No, as usual performance panic has set in and you’re not looking at
the problem in the best way to solve it.

Sorry Zed, I have a great deal of respect for your work and your
opinions on development. But you seem to have a blind spot here and I
just don’t understand why.

This has nothing to do with optimisation. It has nothing to do with
performance. It’s got everything to do with resilience and reliability.

Clearly what you say about waiting for remote services is true. Doing
so is a Bad Thing and an application shouldn’t do it. But you’re
missing the point.

Your philosophy guarantees that your applications performance will be
held hostage by the worst performing action within it. What if I
screw up and accidentally roll out “bad” action. Should this mean
that every aspect of my app now behaves terribly? Following your
logic, it does. The whole point of a load balancer is that it should
enable things to behave sensibly even if one of my backend servers is
screwed up. But a mismatch between the expectations encoded within
mod_proxy_balancer and Mongrel running Ruby on Rails means that this
isn’t the case.

Similarly, if I write a quick and dirty reporting action which runs
an SQL query which takes 10 seconds to complete, should that screw up
my entire application? It seems unreasonable to me that I have to
optimise an action like this (why should I care if a reporting action
which is only used once a day takes 10 seconds to complete?). I do
care if every time I run it, though, I cause all the 0.1 second
actions to queue up behind it.

–
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park…
Who says I have a one track mind?

LinkedIn: https://www.linkedin.com/in/paulbutcher
MSN: [email protected]
AIM: paulrabutcher
Skype: paulrabutcher

newbee-robert · October 16, 2007, 5:27pm

Brian Williams wrote:

algorithm that does exactly this - round robin but to worker with
lightest load.

Were you on Apache 2.0 or 2.2?

mod_proxy_balancer is 2.2 only. It has the same features as lighty’s
balancer, and many important ones that it doesn’t. We had 2.0 <->
lighttpd <-> mongrel_cluster. I like 2.2/mod_proxy_balancer better.
Lighty missed some features we needed and I wasn’t prepared to
implement.

I made heavy use of the following logging features in Apache and m_p_b
for diagnostics:

– request duration in microseconds ( lighty only offers seconds …
ugh )
– client session cookie
– balancer member ( which load balancer member Apache sent the
request to )
– client socket status at end of request

I should correct the Round Robin misperception. More accurately
mod_proxy_balancer does request balancing. The module sends an equal
number of requests to each back end, at least according to the docs.
It has another mode where it balances by bytes transferred.

The icing on the cake for me was mod_proxy_balancer’s status page. It
gives a live view of configured balancer pools and stats for each pool
member.

newbee-robert · October 16, 2007, 9:38pm

On Tue, 16 Oct 2007 07:52:19 -0700
“Brian Williams” [email protected] wrote:

Just to clarify, we were accessing a web service that typically returns
results in < 1 second. But due to network issues out of our control, these
requests were going into a black hole, and waiting for tcp timeouts.
Admittedly, since this was to an external service, we could shift to a model
where all updates are asynchronous, but this doesn’t help in the cases that
paul mentions, such as a slower reporting queries or programmer error slow
actions which then end up degrading the experience for all users to the
site.

There’s also an odd thing about performance: users perceive the range
of response times as “slow” and not the mean. I have no idea why, but
I’ve seen it over and over again. You’ll take a system that has a mean
perf of 2 seconds, but a range of .5 to 10 seconds and they think it’s
“slow”. Tune the system so that it has 3 second mean perf, but only a
range of 2 to 4 seconds and they think it’s “fast as hell”.

But yes, if the service isn’t under your control than you’ll get hit by
this over and over. It’s better to setup an “async firewall” both in
the service layer and in your UI so that they don’t deal with things
that are potentially variable.

Assuming we did switch to an asynchronous model, I would think it would be
more like - show me latest FOO, trigger backend update to get latest FOO,
return last cached FOO. Or if you know what FOO is, you periodically
update it, and don’t bother triggering an update.

There’s a few general approaches you can try depending on the type of
application you’ve got and what you can do with the UI. I like to
generally categorize them into the “polling” or “inbox” methods.

In polling, your controllers have four general actions to deal with the
request: submit, poll, cancel, get. In this one, the user submits
their request like normal, and you display a “Waiting for this to happen
dude…” message. Your submit action builds the request and hands it to
some service that does the real work (like backgroundrb) then returns
the waiting message immediately. The waiting page then simply has a bit
of ajaxy good javascript that hits the poll method to see if it’s done
yet, and updates a spinner or something. If you want a cancel link on
the waiting page, then cancel would abort, tell the backgroundrb to
stop, and shunt the user off to the end. Finally, when the poll method
says it’s done, you redirect to the “get” action to retrieve the final
result.

There’s many variations on this depending on the type of tech you have,
and typically works best for situations where the user will eventually
see something, and they shouldn’t go off doing other things. Such as in
a strict biz process where they MUST complete this task before moving on
(like in looking up a flight on a reservation system).

In the “inbox” method (or email method) you just adopt the tried and
true method of having an inbox and outbox. Users get a way to submit
requests. That goes in the outbox. They can then see all the pending
stuff. You then have your background processor just pull things out of
people’s outboxes, process them, and put the results in the inbox.
Simple, and the UI for this means you have lots of chances to give them
something else to do. The nice thing about this approach is the user
doesn’t have to care who’s dealing with it, and they can even setup
scheduled tasks that just get run and results are put in their inbox
(which would mean no need for an outbox, but maybe a tasks folder).
Canceling is simply a matter of removing it from their inbox.

Really good uses for this are of course things like email and printing,
but also having reports generated, conducting big number crunches,
asking for analysis, etc.

The trick is then to come up with a UI model that lets you use the inbox
method whenever you can. Let’s take the flight system as an example.
Currently they have a polling method on most sites, but you could do an
inbox method if the user interface was more conversational and based on
secondary information about the user (like, they have Delta miles). In
this model, the user puts in more information up front, or it’s
inferred, then says “tell me what you can find for me.” The system uses
all its power to go out and look for a flight potentially taking days,
and simply putting status or results in the person’s inbox for review or
acceptance. The user would have to understand this UI approach and see
an advantage to it, so if the results weren’t better than just a quick
query via polling it’s pointless.

A nice advantage of this is the user can train whatever engine you use
the same way they train a Bayes classifier. Imagine if the above
reservation system puts potential flights in your inbox, and you go in
and just smack a “hell no/maybe so/more like this” button. This trains
the flight reservation finding engine to give you better and better
results until it finds what you want. Keep this information around and
eventually the user will think your flight system is absolutely perfect.

The test that you’ve got an “inbox” method right for a flight
reservation system is if people can reserve flights they want via txt
message off their phones over the course of a day.

Another place for this would be in a movie site like Netflix. Instead
of saying genres and exact movies, you go in and give demographic
information as well as some movies you like. After this initial
training, you put out requests for things like “Give me movies I might
like that are sad.” Netflix makes a “folder” for this called “movies
that are sad” and starts to fill it what it thinks you might like. It
actually doesn’t know, but as the user classifies what is sad or not,
netflix begins to learn more and gives better sad movie results.
Eventually users are just getting movies that they’ve pre-classified and
don’t even bother searching.

And as usual, I’ll put my disclaimer that this isn’t a boolean decision
or that these are the only two solutions. In fact, combining the above
for a flight reservation system would be a powerful metaphor if you
could figure it out without confusing people.

Hope that helps.

–
Zed A. Shaw

Hate: http://savingtheinternetwithhate.com/
Good: http://www.zedshaw.com/
Evil: http://yearofevil.com/

newbee-robert · October 16, 2007, 9:56pm

On Tue, 16 Oct 2007 14:01:16 +0100
Paul B. [email protected] wrote:

On 16 Oct 2007, at 13:45, Zed A. Shaw wrote:

No, as usual performance panic has set in and you’re not looking at
the problem in the best way to solve it.

Sorry Zed, I have a great deal of respect for your work and your
opinions on development. But you seem to have a blind spot here and I
just don’t understand why.

That’s because you’re reading my recommendation as “performance tuning
vs. design to avoid”. If you’ve read any of my work you’ll understand I
never advocate a boolean argument. Those are for computers.

In my argument I’m saying that his problem can never be solved because
he doesn’t have control of the performance at all, and why should the
user’s HTTP REQUEST be held up for this? You get the distinction? Your
HTTP request processing doesn’t have to be coupled to your backend
request processing. Break them apart and then you can ensure the user
gets rapid feedback, you have fewer bottlnecks, you can push the
processing out, and you can measure orthogonal pieces rather than one
giant messy process.

This has nothing to do with optimisation. It has nothing to do with
performance. It’s got everything to do with resilience and reliability.

No, resilience and reliability are quantifiable metrics. Mean Time
Between Failure to be exact. “Performance” is a subjective thing that’s
based on human perception. Yes, I know you can go get yourself a little
graph of requests per second, but that won’t tell you if the users think
it is fast.

If you can’t make the computer fast, trick the people to think it’s
fast.

> Your philosophy guarantees that your applications performance will be > held hostage by the worst performing action within it.

Again, no, I’m not saying don’t try to make it fast. What I’m saying is
first thing programmers do is they run off with faulty statistics to
“tune” their system, completely ignoring the fact that many times a
simple redesign (or complex improvement) can just eliminate the problem
entirely. See my most recent reply to Brian for many examples.

What if I screw up and accidentally roll out “bad” action. Should this mean
that every aspect of my app now behaves terribly? Following your
logic, it does. The whole point of a load balancer is that it should
enable things to behave sensibly even if one of my backend servers is
screwed up. But a mismatch between the expectations encoded within
mod_proxy_balancer and Mongrel running Ruby on Rails means that this
isn’t the case.

Well I didn’t do a logic proof so you’re inventing logic where there is
none. My “logic” would be this:

The fastest way to do something is to just not do it.

Right? That basically gives you an infinite number of requests per
second.

But ultimately, I’ve been doing this a long time, and the one thing I’ve
realized is, no matter how fast you make something, there’s always a
bigger dumbass available to make it slow. Hell man, computers have
blasted in capability and speed over the years, and still I have to wait
for my damn email to render in the fastest email client I could find.

No amount of making things fast will protect you against stupidity.

Similarly, if I write a quick and dirty reporting action which runs
an SQL query which takes 10 seconds to complete, should that screw up
my entire application? It seems unreasonable to me that I have to
optimise an action like this (why should I care if a reporting action
which is only used once a day takes 10 seconds to complete?). I do
care if every time I run it, though, I cause all the 0.1 second
actions to queue up behind it.

I’d reword this: “I have SQL queries that take 10 seconds to complete
and I’m stuck using Mongrel because nobody else has stepped up to fix
the dumbass crap in Ruby’s GC, IO, and Threads and even the JRuby guys
can’t solve their ‘mystery’ performance problem with Rails…”

Option A:

“… I’m totally screwed and should toss myself off a building because I
keep banging my head on this thing and it doesn’t go faster.”

Option B:

“… I’m rich and will just put 1000 mongrels in the mix and solve the
problem.”

Option C:

“… I know queueing theory and can work up a queuing model that will
help me figure out the minimum number of request processors to handle
the queue at a 10 req/sec rate.”

Option D:

“… I can analyze the performance of all my stuff and tune it as fast
as possible, then try C and B.”

Option E:

“… Well, let’s try some stuff on the front end and see if we can just
trick people into thinking how this goes so that there isn’t a problem
anymore.”

Any of them will work, but with Rails option E, D, C, and B work best
(in that order). Please don’t do A, it’s not that big a deal.

Epliogue (not just for Paul): A lot of people complain that rails
should be thread safe. Well, Rails Core folks including DHH also
complain that it should be thread safe. Under JRuby you can spin up a
ton of real threads with entire Rails apps in each one, but that’s
suboptimal for memory usage (like Java cares).

If all of you think that Rails shouldn’t have a giant lock, then I have
only one suggestion:

Get off your damn ass and make it happen.

David just made a big effort to make the process for submitting patches
much more open and he’s looking for people to solve this problem. I
dare say he’s admitted he was wrong about the locking issue and is
ultra-keen (I won’t say desperate) for someone to solve it. Nothing is
in your way, and the reward will be the glory of making things fast.
Worked for me, and I can say it’s totally worth it.

As a sweetener, I’ll throw this out: I bet you can NOT make Rails
threadsafe. The first person or group of people to finally get rid of
the thread locking around Rails requests in Mongrel and make Rails
performance match that of Merb or Nitro on average will get a real
highschool style trophy from me. The trophy will have a bust of a dog
on it and will be enscribed: “Official Mongrel Rails Threadify Ninja
Destroyer 1st Place: Zed and DHH were wrong!”. The runner-up will get
the first set of MUDCRAP-CE certificates, and I’ll hand them out at the
next Rails or Ruby conference in person.

Alright, I’ve ponied up my end of the bargain. Who’s going to take me
on?

–
Zed A. Shaw

Hate: http://savingtheinternetwithhate.com/
Good: http://www.zedshaw.com/
Evil: http://yearofevil.com/

newbee-robert · October 17, 2007, 11:08am

I tried a bunch of tools and ended up using Jakarta J-Meter b/c it
works cross-platform (Unix/Windows in my situation) and has a gui which
shallows the learning curve. I’ve been quite satisfied with it…

Search the archives on this list and you should be able to find some
great info Zed and others wrote a while back on How to Load Test
Correctly. I found that series of posts to be pretty helpful in testing
Mongrel/Rails. IIRC, about this was about 10 months ago.

Best,

Steve

newbee-robert · October 17, 2007, 3:42am

The query should not take 10 seconds. People should not steal. Still,
they do, and I live with the workaround – locking.

So, while the 10-second query is a problem, and worth solving for its
own sake, the mod_proxy_balancer solution prevents it from causing the
secondary, request queuing problem.

That might eliminate enough crisis meetings that someone actually has
time to fix the underlying problem without working through the week
end. Which in turn lessens the likelyhood of anyone choosing option A.

newbee-robert · October 18, 2007, 3:43am

Ok, make me work. That’s a good one but not the one I was thinking of.
Digging, I find these two from Zed - the first one is the one I was
thinking of, and the second refers to a screencast that will surely
help you out too:

http://rubyforge.org/pipermail/mongrel-users/2006-May/000200.html

http://rubyforge.org/pipermail/mongrel-users/2007-March/003111.html

Also don’t miss this epic one from Ezra on how to figure out the right
number of mongrels:

http://rubyforge.org/pipermail/mongrel-users/2006-December/002591.html

Summarizing, because that’s all I’m good for on this list…

Steve

p.s. Looks like my 10 month guess was actually pretty wrong

newbee-robert · October 18, 2007, 12:15am

Is this the thread you’re remembering – looks like it might be:

http://rubyforge.org/pipermail/mongrel-users/2006-September/001349.html

Thanks.