All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Galbraith <efault@gmx.de>
To: Greg Smith <gsmith@gregsmith.com>
Cc: Ingo Molnar <mingo@elte.hu>,
	Peter Zijlstra <peterz@infradead.org>,
	Dhaval Giani <dhaval@linux.vnet.ibm.com>,
	lkml <linux-kernel@vger.kernel.org>,
	Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Subject: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
Date: Tue, 27 May 2008 07:59:10 +0200	[thread overview]
Message-ID: <1211867950.5505.47.camel@marge.simson.net> (raw)
In-Reply-To: <Pine.GSO.4.64.0805260851510.24771@westnet.com>

[-- Attachment #1: Type: text/plain, Size: 7103 bytes --]


On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:

> I did again get useful results here with the stock 2.6.26.git kernel and 
> default parameters using Peter's small patch to adjust se.waker.
> 
> What I found most interesting was how the results changed when I set 
> /proc/sys/kernel/sched_features = 0, without doing anything with batch 
> mode.  The default for that is 1101111111=895. What I then did was run 
> through setting each of those bits off one by one to see which feature(s) 
> were getting in the way here.  The two that mattered a lot were 895-32=863 
> (no SCHED_FEAT_SYNC_WAKEUPS ) and 895-2=893 (no 
> SCHED_FEAT_WAKEUP_PREEMPT).  Combining those two but keeping the rest of 
> the features on (895-32-2=861) actually gave the best result I've ever 
> seen here, better than with all the features disabled.  Tossing out all 
> the tests I did that didn't show anything useful, here's my table of the 
> interesting results.
> 
> Clients	.22.19	.26.git	waker	f=0	f=893	f=863	f=861
> 1	7660	11043	11041	9214	11204	9232	9433
> 2	17798	11452	16306	16916	11165	16686	16097
> 3	29612	13231	18476	24202	11348	26210	26906
> 4	25584	13053	17942	26639	11331	25094	25679
> 6	25295	12263	18472	28918	11761	30525	33297
> 8	24344	11748	19109	32730	12190	31775	35912
> 10	23963	11612	19537	31688	12331	29644	36215
> 15	23026	11414	19518	33209	13050	28651	36452
> 20	22549	11332	19029	32583	13544	25776	35707
> 30	22074	10743	18884	32447	14191	21772	33501
> 40	21495	10406	18609	31704	11017	20600	32743
> 50	20051	10534	17478	29483	14683	19949	31047
> 60	18690	9816	17467	28614	14817	18681	29576
> 
> Note that compared to earlier test runs, I replaced the 5 client case with 
> a 60 client one to get more data on the top end.  I also wouldn't pay too 
> much attention to the single client case; that one really bounces around a 
> lot on most of the kernel revs, even with me doing 5 runs and using the 
> median.

Care to give the below a whirl?  If fixes the over-enthusiastic affinity
bug in a less restrictive way.  It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).

With default features, I get the below.

2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434

(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled.  Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho) 

Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.

Signed-off-by: Mike Galbraith <efault@gmx.de>

 kernel/sched_fair.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
 	struct task_struct *curr = this_rq->curr;
 	unsigned long tl = this_load;
 	unsigned long tl_per_task;
+	int bad_imbalance;
 
-	if (!(this_sd->flags & SD_WAKE_AFFINE))
+	if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
 		return 0;
 
 	/*
+	 * If sync wakeup then subtract the (maximum possible)
+	 * effect of the currently running task from the load
+	 * of the current CPU:
+	 */
+	if (sync && tl)
+		tl -= curr->se.load.weight;
+
+	bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+	/*
 	 * If the currently running task will sleep within
 	 * a reasonable amount of time then attract this newly
 	 * woken task:
 	 */
-	if (sync && curr->sched_class == &fair_sched_class) {
+	if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
 		if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
 				p->se.avg_overlap < sysctl_sched_migration_cost)
 			return 1;
@@ -1075,16 +1086,8 @@ wake_affine(struct rq *rq, struct sched_
 	schedstat_inc(p, se.nr_wakeups_affine_attempts);
 	tl_per_task = cpu_avg_load_per_task(this_cpu);
 
-	/*
-	 * If sync wakeup then subtract the (maximum possible)
-	 * effect of the currently running task from the load
-	 * of the current CPU:
-	 */
-	if (sync)
-		tl -= current->se.load.weight;
-
 	if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
-			100*(tl + p->se.load.weight) <= imbalance*load) {
+			!bad_imbalance) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and


> Some still open questions after this long investigation that I'd like to 
> know the answers to are:
> 
> 1) Why are my 2.6.26.git results so dramatically worse than the ones Mike 
> posted?  I'm not sure what was different about his test setup here.  The 
> 2.6.22 results are pretty similar, and the fully tuned ones as well, so 
> the big difference on that column bugs me.

Because those were git _with_ Peter's patch.  I was looking at the
difference between a non-broken 2.6.26 with and without minimizing
preemption, to match the way I tested 2.6.22.

> 2) Mike suggested a patch to 2.6.25 in this thread that backports the 
> feature for disabling SCHED_FEAT_SYNC_WAKEUPS.  Would it be reasonable to 
> push that into 2.6.25.5?  It's clearly quite useful for this load and 
> therefore possibly others.

If the patch above flies, imho, that should be folded into the backport
to stable.  The heart of the patch is a bugfix, so definitely stable
material.  Whether to enable the debugging/tweaking knobs as well is a
different question.  (I would do it, but I ain't the maintainer;)

> 3) Peter's se.waker patch is a big step forward on this workload without 
> any tuning, closing a significant amount of the gap between the default 
> setup and what I get with the two troublesome features turned off 
> altogether.  What issues might there be with pushing that into the stock 
> {2.6.25|2.6.26} kernel?

(it doesn't fully address the 1:N needs, that needs much more thought)

> 4) What known workloads are there that suffer if SCHED_FEAT_SYNC_WAKEUPS 
> and SCHED_FEAT_WAKEUP_PREEMPT are disabled?  I'd think that any attempt to 
> tune those code paths would need my case for "works better when 
> SYNC/PREEMPT wakeups disabled" as well as a case that works worse to 
> balance modifications against.

I suspect very many, certainly any load where latency is of major
importance.  Peak performance of the mysql+oltp benchmark for sure is
injured with your settings.  As a really good demonstration of how
important wakeup preemption can be, try running the attached testcase,
which was distilled from a real world load and posted to lkml a few
years ago, without wakeup preemption and nailed to one cpu.

> 5) Once (4) has identified some tests cases, what else might be done to 
> make the default behavior better without killing the situations it's 
> intended for?

See patch :)

	-Mike

[-- Attachment #2: starve.c --]
[-- Type: text/x-csrc, Size: 715 bytes --]

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/wait.h>

volatile unsigned long loop = 10000000;

void
handler (int n)
{
  if (loop > 0)
    --loop;
}

static int
child (void)
{
  pid_t ppid = getppid ();

  sleep (1);
  while (1)
    kill (ppid, SIGUSR1);
  return 0;
}

int
main (int argc, char **argv)
{
  pid_t child_pid;
  int r;

  loop = argc > 1 ? strtoul (argv[1], NULL, 10) : 10000000;
  printf ("expecting to receive %lu signals\n", loop);

  if ((child_pid = fork ()) == 0)
    exit (child ());

  signal (SIGUSR1, handler);
  while (loop)
    sleep (1);
  r = kill (child_pid, SIGTERM);
  waitpid (child_pid, NULL, 0);
  return 0;
}

  reply	other threads:[~2008-05-27  5:59 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-21 17:34 PostgreSQL pgbench performance regression in 2.6.23+ Greg Smith
2008-05-22  7:10 ` Mike Galbraith
2008-05-22  8:28   ` Dhaval Giani
2008-05-22  9:05     ` Mike Galbraith
2008-05-22 10:34       ` Mike Galbraith
2008-05-22 11:25         ` Mike Galbraith
2008-05-22 11:44           ` Peter Zijlstra
2008-05-22 12:09             ` Mike Galbraith
2008-05-22 12:24               ` Peter Zijlstra
2008-05-22 13:16                 ` Mike Galbraith
2008-05-23  7:13                 ` Greg Smith
2008-05-23 10:00                   ` Mike Galbraith
2008-05-23 10:10                     ` Ingo Molnar
2008-05-23 10:15                       ` Mike Galbraith
2008-05-23 23:18                         ` Greg Smith
2008-05-23 23:46                           ` Mike Galbraith
2008-05-24  8:08                             ` Mike Galbraith
2008-05-27  0:28                             ` Greg Smith
2008-05-27  5:59                               ` Mike Galbraith [this message]
2008-05-27  8:20                                 ` [patch] " Mike Galbraith
2008-05-27  8:35                                   ` Mike Galbraith
2008-06-06  5:03                                 ` Greg Smith
2008-06-06  6:13                                   ` Mike Galbraith
2008-06-07 11:38                                     ` Mike Galbraith
2008-06-07 12:50                                       ` Mike Galbraith
2008-06-07 13:07                                         ` Peter Zijlstra
2008-06-07 14:16                                           ` Mike Galbraith
2008-06-07 16:16                                             ` Peter Zijlstra
2008-06-07 17:56                                               ` Mike Galbraith
2008-06-07 13:08                                       ` Peter Zijlstra
2008-06-07 14:54                                         ` [patch part 2] " Mike Galbraith
2008-06-07 16:12                                           ` Peter Zijlstra
2008-06-07 17:53                                             ` Mike Galbraith
2008-06-07 18:19                                               ` Mike Galbraith
2008-05-23 13:05                       ` Mike Galbraith
2008-05-23 13:35                         ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1211867950.5505.47.camel@marge.simson.net \
    --to=efault@gmx.de \
    --cc=dhaval@linux.vnet.ibm.com \
    --cc=gsmith@gregsmith.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=vatsa@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.