Re: [PATCH 3/3] sched: Disable affine wakeups by default

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Mike Galbraith <efault@gmx.de>
To: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	mingo@elte.hu, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/3] sched: Disable affine wakeups by default
Date: Sun, 25 Oct 2009 23:04:47 +0100	[thread overview]
Message-ID: <1256508287.17306.14.camel@marge.simson.net> (raw)
In-Reply-To: <20091025123319.2b76bf69@infradead.org>

On Sun, 2009-10-25 at 12:33 -0700, Arjan van de Ven wrote:
> On Sun, 25 Oct 2009 18:38:09 +0100
> Mike Galbraith <efault@gmx.de> wrote:
> > > > Even if you're sharing a cache, there are reasons to wake
> > > > affine.  If the wakee can preempt the waker while it's still
> > > > eligible to run, wakee not only eats toasty warm data, it can
> > > > hand the cpu back to the waker so it can make more and repeat
> > > > this procedure for a while without someone else getting in
> > > > between, and trashing cache. 
> > > 
> > > and on the flipside, and this is the workload I'm looking at,
> > > this is halving your performance roughly due to one core being
> > > totally busy while the other one is idle.
> > 
> > Yeah, the "one pgsql+oltp pair" in the numbers I posted show that
> > problem really well.  If you can hit an idle shared cache at low load,
> > go for it every time. 
> 
> sadly the current code does not do this ;(
> my patch might be too big an axe for it, but it does solve this part ;)

The below fixed up pgsql+oltp low end, but has negative effect on high
end.  Must be some stuttering going on.

> I'll keep digging to see if we can do a more micro-incursion.
> 
> > Hm.  That looks like a bug, but after any task has scheduled a few
> > times, if it looks like a synchronous task, it'll glue itself to it's
> > waker's runqueue regardless.  Initial wakeup may disperse, but it will
> > come back if it's not overlapping.
> 
> the problem is the "synchronous to WHAT" question.
> It may be synchronous to the disk for example; in the testcase I'm
> looking at, we get "send message to X. do some more code. hit a page
> cache miss and do IO" quite a bit.

Hm.  Yes, disk could be problematic. It's going to be exactly what the
affinity code looks for, you wake somebody, and almost immediately go to
sleep.  OTOH, even a house keeper threads make warm data.

> > > The numbers you posted are for a database, and only measure
> > > throughput. There's more to the world than just databases /
> > > throughput-only computing, and I'm trying to find low impact ways
> > > to reduce the latency aspect of things. One obvious candidate is
> > > hyperthreading/SMT where it IS basically free to switch to a
> > > sibbling, so wake-affine does not really make sense there.
> > 
> > It's also almost free on my Q6600 if we aimed for idle shared cache.
> 
> yeah multicore with shared cache falls for me in the same bucket.

Anyone with a non-shared cache multicore would be most unhappy with my
little test hack.
 
> > I agree fully that affinity decisions could be more perfect than they
> > are.  Getting it wrong is very expensive either way.
> 
> Looks like we agree on a key principle:
> If there is a free cpu "close enough" (SMT or MC basically), the
> wakee should just run on that. 
> 
> we may not agree on what to do if there's no completely free logical
> cpu, but a much lighter loaded one instead.
> but first we need to let code speak ;)

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          10013.90   18526.84   34900.38   34420.14   33069.83   32083.40   30578.30   28010.71   25605.47 3x avg
tip+         10071.16   18498.33   34697.17   34275.20   32761.96   31657.10   30223.70   27363.50   24698.71
              9971.57   18290.17   34632.46   34204.59   32588.94   31513.19   30081.51   27504.66   24832.24
              9884.04   18502.26   34650.08   34250.13   32707.81   31566.86   29954.19   27417.09   24811.75
             

pgsql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          13907.85   27135.87   52951.98   52514.04   51742.52   50705.43   49947.97   48374.19   46227.94 3x avg
tip+         15163.56   28882.70   52374.32   52469.79   51739.79   50602.02   49827.18   48029.84   46191.90
             15258.65   28778.77   52716.46   52405.32   51434.21   50440.66   49718.89   48082.22   46124.56
             15278.02   28178.55   52815.82   52609.98   51729.17   50652.10   49800.19   48126.95   46286.58


diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 37087a7..fa534f0 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1374,6 +1374,8 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
+		int level = tmp->level;
+
 		/*
 		 * If power savings logic is enabled for a domain, see if we
 		 * are not overloaded, if so, don't balance wider.
@@ -1398,11 +1400,28 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 				want_sd = 0;
 		}
 
+		/*
+		 * look for an idle shared cache before looking at last CPU.
+		 */
 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+				(level == SD_LV_SIBLING || level == SD_LV_MC)) {
+			int i;
 
+			for_each_cpu(i, sched_domain_span(tmp)) {
+				if (!cpu_rq(i)->cfs.nr_running) {
+					affine_sd = tmp;
+					want_affine = 0;
+					cpu = i;
+				}
+			}
+		} else if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
 			affine_sd = tmp;
 			want_affine = 0;
+
+			if ((level == SD_LV_SIBLING || level == SD_LV_MC) &&
+					!cpu_rq(prev_cpu)->cfs.nr_running)
+				cpu = prev_cpu;
 		}
 
 		if (!want_sd && !want_affine)

next prev parent reply	other threads:[~2009-10-25 22:04 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-24 19:58 [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain Arjan van de Ven
2009-10-24 20:04 ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Arjan van de Ven
2009-10-24 20:07   ` [PATCH 3/3] sched: Disable affine wakeups by default Arjan van de Ven
2009-10-25  6:55     ` Mike Galbraith
2009-10-25 16:51       ` Arjan van de Ven
2009-10-25 17:38         ` Mike Galbraith
2009-10-25 19:33           ` Arjan van de Ven
2009-10-25 22:04             ` Mike Galbraith [this message]
2009-10-26  1:53               ` Peter Zijlstra
2009-10-26  4:38                 ` Mike Galbraith
2009-10-26  4:52                   ` Arjan van de Ven
2009-10-26  5:08                     ` Mike Galbraith
2009-10-26  5:36                       ` Arjan van de Ven
2009-10-26  5:47                         ` Mike Galbraith
2009-10-26  5:57                           ` Mike Galbraith
2009-10-26  7:01                             ` Ingo Molnar
2009-10-26  7:05                               ` Arjan van de Ven
2009-10-26 11:33                                 ` Suresh Siddha
2009-11-10 21:59                         ` Peter Zijlstra
2009-11-11  6:01                           ` Arjan van de Ven
2009-10-27 14:35                 ` Mike Galbraith
2009-10-28  7:25                   ` Mike Galbraith
2009-10-28 18:36                     ` Mike Galbraith
2009-11-04 19:33                   ` [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair() tip-bot for Mike Galbraith
2009-11-04 20:37                     ` Mike Galbraith
2009-11-04 21:41                       ` Mike Galbraith
2009-11-05  9:30                     ` Ingo Molnar
2009-11-05  9:57                       ` Mike Galbraith
2009-11-05 10:00                         ` Mike Galbraith
2009-11-06  7:09                         ` [tip:sched/core] sched: Fix affinity logic " tip-bot for Mike Galbraith
2009-10-26  5:21             ` [PATCH 3/3] sched: Disable affine wakeups by default Mike Galbraith
2009-10-25  8:01     ` Peter Zijlstra
2009-10-25  8:01   ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Peter Zijlstra
2009-10-25 11:48     ` Peter Zijlstra
2009-10-25  8:03 ` [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain Peter Zijlstra

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:37087a7 dfblob:fa534f0 )
 OR (
bs:"Re: [PATCH 3/3] sched: Disable affine wakeups by default" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1256508287.17306.14.camel@marge.simson.net \
    --to=efault@gmx.de \
    --cc=arjan@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox