From: Mike Galbraith <efault@gmx.de>
To: Greg Smith <gsmith@gregsmith.com>
Cc: Ingo Molnar <mingo@elte.hu>,
Peter Zijlstra <peterz@infradead.org>,
Dhaval Giani <dhaval@linux.vnet.ibm.com>,
lkml <linux-kernel@vger.kernel.org>,
Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Subject: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
Date: Tue, 27 May 2008 07:59:10 +0200 [thread overview]
Message-ID: <1211867950.5505.47.camel@marge.simson.net> (raw)
In-Reply-To: <Pine.GSO.4.64.0805260851510.24771@westnet.com>
[-- Attachment #1: Type: text/plain, Size: 7103 bytes --]
On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:
> I did again get useful results here with the stock 2.6.26.git kernel and
> default parameters using Peter's small patch to adjust se.waker.
>
> What I found most interesting was how the results changed when I set
> /proc/sys/kernel/sched_features = 0, without doing anything with batch
> mode. The default for that is 1101111111=895. What I then did was run
> through setting each of those bits off one by one to see which feature(s)
> were getting in the way here. The two that mattered a lot were 895-32=863
> (no SCHED_FEAT_SYNC_WAKEUPS ) and 895-2=893 (no
> SCHED_FEAT_WAKEUP_PREEMPT). Combining those two but keeping the rest of
> the features on (895-32-2=861) actually gave the best result I've ever
> seen here, better than with all the features disabled. Tossing out all
> the tests I did that didn't show anything useful, here's my table of the
> interesting results.
>
> Clients .22.19 .26.git waker f=0 f=893 f=863 f=861
> 1 7660 11043 11041 9214 11204 9232 9433
> 2 17798 11452 16306 16916 11165 16686 16097
> 3 29612 13231 18476 24202 11348 26210 26906
> 4 25584 13053 17942 26639 11331 25094 25679
> 6 25295 12263 18472 28918 11761 30525 33297
> 8 24344 11748 19109 32730 12190 31775 35912
> 10 23963 11612 19537 31688 12331 29644 36215
> 15 23026 11414 19518 33209 13050 28651 36452
> 20 22549 11332 19029 32583 13544 25776 35707
> 30 22074 10743 18884 32447 14191 21772 33501
> 40 21495 10406 18609 31704 11017 20600 32743
> 50 20051 10534 17478 29483 14683 19949 31047
> 60 18690 9816 17467 28614 14817 18681 29576
>
> Note that compared to earlier test runs, I replaced the 5 client case with
> a 60 client one to get more data on the top end. I also wouldn't pay too
> much attention to the single client case; that one really bounces around a
> lot on most of the kernel revs, even with me doing 5 runs and using the
> median.
Care to give the below a whirl? If fixes the over-enthusiastic affinity
bug in a less restrictive way. It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).
With default features, I get the below.
2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434
(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled. Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho)
Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.
Signed-off-by: Mike Galbraith <efault@gmx.de>
kernel/sched_fair.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)
Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;
- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;
/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1075,16 +1086,8 @@ wake_affine(struct rq *rq, struct sched_
schedstat_inc(p, se.nr_wakeups_affine_attempts);
tl_per_task = cpu_avg_load_per_task(this_cpu);
- /*
- * If sync wakeup then subtract the (maximum possible)
- * effect of the currently running task from the load
- * of the current CPU:
- */
- if (sync)
- tl -= current->se.load.weight;
-
if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
- 100*(tl + p->se.load.weight) <= imbalance*load) {
+ !bad_imbalance) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
> Some still open questions after this long investigation that I'd like to
> know the answers to are:
>
> 1) Why are my 2.6.26.git results so dramatically worse than the ones Mike
> posted? I'm not sure what was different about his test setup here. The
> 2.6.22 results are pretty similar, and the fully tuned ones as well, so
> the big difference on that column bugs me.
Because those were git _with_ Peter's patch. I was looking at the
difference between a non-broken 2.6.26 with and without minimizing
preemption, to match the way I tested 2.6.22.
> 2) Mike suggested a patch to 2.6.25 in this thread that backports the
> feature for disabling SCHED_FEAT_SYNC_WAKEUPS. Would it be reasonable to
> push that into 2.6.25.5? It's clearly quite useful for this load and
> therefore possibly others.
If the patch above flies, imho, that should be folded into the backport
to stable. The heart of the patch is a bugfix, so definitely stable
material. Whether to enable the debugging/tweaking knobs as well is a
different question. (I would do it, but I ain't the maintainer;)
> 3) Peter's se.waker patch is a big step forward on this workload without
> any tuning, closing a significant amount of the gap between the default
> setup and what I get with the two troublesome features turned off
> altogether. What issues might there be with pushing that into the stock
> {2.6.25|2.6.26} kernel?
(it doesn't fully address the 1:N needs, that needs much more thought)
> 4) What known workloads are there that suffer if SCHED_FEAT_SYNC_WAKEUPS
> and SCHED_FEAT_WAKEUP_PREEMPT are disabled? I'd think that any attempt to
> tune those code paths would need my case for "works better when
> SYNC/PREEMPT wakeups disabled" as well as a case that works worse to
> balance modifications against.
I suspect very many, certainly any load where latency is of major
importance. Peak performance of the mysql+oltp benchmark for sure is
injured with your settings. As a really good demonstration of how
important wakeup preemption can be, try running the attached testcase,
which was distilled from a real world load and posted to lkml a few
years ago, without wakeup preemption and nailed to one cpu.
> 5) Once (4) has identified some tests cases, what else might be done to
> make the default behavior better without killing the situations it's
> intended for?
See patch :)
-Mike
[-- Attachment #2: starve.c --]
[-- Type: text/x-csrc, Size: 715 bytes --]
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
volatile unsigned long loop = 10000000;
void
handler (int n)
{
if (loop > 0)
--loop;
}
static int
child (void)
{
pid_t ppid = getppid ();
sleep (1);
while (1)
kill (ppid, SIGUSR1);
return 0;
}
int
main (int argc, char **argv)
{
pid_t child_pid;
int r;
loop = argc > 1 ? strtoul (argv[1], NULL, 10) : 10000000;
printf ("expecting to receive %lu signals\n", loop);
if ((child_pid = fork ()) == 0)
exit (child ());
signal (SIGUSR1, handler);
while (loop)
sleep (1);
r = kill (child_pid, SIGTERM);
waitpid (child_pid, NULL, 0);
return 0;
}
next prev parent reply other threads:[~2008-05-27 5:59 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-05-21 17:34 PostgreSQL pgbench performance regression in 2.6.23+ Greg Smith
2008-05-22 7:10 ` Mike Galbraith
2008-05-22 8:28 ` Dhaval Giani
2008-05-22 9:05 ` Mike Galbraith
2008-05-22 10:34 ` Mike Galbraith
2008-05-22 11:25 ` Mike Galbraith
2008-05-22 11:44 ` Peter Zijlstra
2008-05-22 12:09 ` Mike Galbraith
2008-05-22 12:24 ` Peter Zijlstra
2008-05-22 13:16 ` Mike Galbraith
2008-05-23 7:13 ` Greg Smith
2008-05-23 10:00 ` Mike Galbraith
2008-05-23 10:10 ` Ingo Molnar
2008-05-23 10:15 ` Mike Galbraith
2008-05-23 23:18 ` Greg Smith
2008-05-23 23:46 ` Mike Galbraith
2008-05-24 8:08 ` Mike Galbraith
2008-05-27 0:28 ` Greg Smith
2008-05-27 5:59 ` Mike Galbraith [this message]
2008-05-27 8:20 ` [patch] " Mike Galbraith
2008-05-27 8:35 ` Mike Galbraith
2008-06-06 5:03 ` Greg Smith
2008-06-06 6:13 ` Mike Galbraith
2008-06-07 11:38 ` Mike Galbraith
2008-06-07 12:50 ` Mike Galbraith
2008-06-07 13:07 ` Peter Zijlstra
2008-06-07 14:16 ` Mike Galbraith
2008-06-07 16:16 ` Peter Zijlstra
2008-06-07 17:56 ` Mike Galbraith
2008-06-07 13:08 ` Peter Zijlstra
2008-06-07 14:54 ` [patch part 2] " Mike Galbraith
2008-06-07 16:12 ` Peter Zijlstra
2008-06-07 17:53 ` Mike Galbraith
2008-06-07 18:19 ` Mike Galbraith
2008-05-23 13:05 ` Mike Galbraith
2008-05-23 13:35 ` Mike Galbraith
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1211867950.5505.47.camel@marge.simson.net \
--to=efault@gmx.de \
--cc=dhaval@linux.vnet.ibm.com \
--cc=gsmith@gregsmith.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=peterz@infradead.org \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.