public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Brendan Gregg <brendan.d.gregg@gmail.com>,
	Jeff Merkey <linux.mdb@gmail.com>
Subject: [patch] sched: Fix smp nice induced group scheduling load distribution woes
Date: Wed, 27 Apr 2016 09:09:51 +0200	[thread overview]
Message-ID: <1461740991.3622.3.camel@gmail.com> (raw)
In-Reply-To: <1461575925.3670.25.camel@gmail.com>

On Mon, 2016-04-25 at 11:18 +0200, Mike Galbraith wrote:
> On Sun, 2016-04-24 at 09:05 +0200, Mike Galbraith wrote:
> > On Sat, 2016-04-23 at 18:38 -0700, Brendan Gregg wrote:
> > 
> > > The bugs they found seem real, and their analysis is great
> > > (although
> > > using visualizations to find and fix scheduler bugs isn't new),
> > > and it
> > > would be good to see these fixed. However, it would also be
> > > useful to
> > > double check how widespread these issues really are. I suspect
> > > many on
> > > this list can test these patches in different environments.
> > 
> > Part of it sounded to me very much like they're meeting and
> > "fixing"
> > SMP group fairness...
> 
> Ew, NUMA boxen look like they could use a hug or two.  Add a group of
> one hog to compete with a box wide kbuild, ~lose a node.

sched: Fix smp nice induced group scheduling load distribution woes

On even a modest sized NUMA box any load that wants to scale
is essentially reduced to SCHED_IDLE class by smp nice scaling.
Limit niceness to prevent cramming a box wide load into a too
small space.  Given niceness affects latency, give the user the
option to completely disable box wide group fairness as well.

time make -j192 modules on a 4 node NUMA box..

Before:
root cgroup
real    1m6.987s      1.00

cgroup vs 1 groups of 1 hog
real    1m20.871s     1.20

cgroup vs 2 groups of 1 hog
real    1m48.803s     1.62

Each single task group receives a ~full socket because the kbuild
has become an essentially massless object that fits in practically
no space at all.  Near perfect math led directly to far from good
scaling/performance, a "Perfect is the enemy of good" poster child.

After "Let's just be nice enough instead" adjustment, single task
groups continued to sustain >99% utilization while competing with
the box sized kbuild.

cgroup vs 2 groups of 1 hog
real    1m8.151s     1.01  192/190=1.01

Good enough works better.. nearly perfectly in this case.

Signed-off-by: Mike Galbraith <umgwanakikbuit@gmail.com>
---
 kernel/sched/fair.c     |   22 ++++++++++++++++++----
 kernel/sched/features.h |    3 +++
 2 files changed, 21 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -2464,17 +2464,28 @@ static inline long calc_tg_weight(struct
 
 static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 {
-	long tg_weight, load, shares;
+	long tg_weight, load, shares, min_shares = MIN_SHARES;
 
-	tg_weight = calc_tg_weight(tg, cfs_rq);
+	if (!sched_feat(SMP_NICE_GROUPS))
+		return tg->shares;
+
+	/*
+	 * Bound niceness to prevent everything that wants to scale from
+	 * essentially becoming SCHED_IDLE on multi/large socket boxen,
+	 * screwing up our ability to distribute load properly and/or
+	 * deliver acceptable latencies.
+	 */
+	tg_weight = min_t(long, calc_tg_weight(tg, cfs_rq), sched_prio_to_weight[10]);
 	load = cfs_rq->load.weight;
 
 	shares = (tg->shares * load);
 	if (tg_weight)
 		shares /= tg_weight;
 
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
+	if (tg->shares > sched_prio_to_weight[20])
+		min_shares = sched_prio_to_weight[20];
+	if (shares < min_shares)
+		shares = min_shares;
 	if (shares > tg->shares)
 		shares = tg->shares;
 
@@ -2517,6 +2528,9 @@ static void update_cfs_shares(struct cfs
 #ifndef CONFIG_SMP
 	if (likely(se->load.weight == tg->shares))
 		return;
+#else
+	if (!sched_feat(SMP_NICE_GROUPS) && se->load.weight == tg->shares)
+		return;
 #endif
 	shares = calc_cfs_shares(cfs_rq, tg);
 
Index: linux-2.6/kernel/sched/features.h
===================================================================
--- linux-2.6.orig/kernel/sched/features.h
+++ linux-2.6/kernel/sched/features.h
@@ -69,3 +69,6 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+SCHED_FEAT(SMP_NICE_GROUPS, true)
+#endif

  reply	other threads:[~2016-04-27  7:09 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-23 18:20 [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Jeff Merkey
2016-04-24  1:38 ` Brendan Gregg
2016-04-24  7:05   ` Mike Galbraith
2016-04-25  9:18     ` Mike Galbraith
2016-04-27  7:09       ` Mike Galbraith [this message]
2016-04-28  9:11         ` [patch] sched: Fix smp nice induced group scheduling load distribution woes Peter Zijlstra
2016-04-28 12:29           ` Mike Galbraith
2016-04-25  9:34   ` [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Peter Zijlstra
2016-04-25 17:54     ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1461740991.3622.3.camel@gmail.com \
    --to=umgwanakikbuti@gmail.com \
    --cc=brendan.d.gregg@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux.mdb@gmail.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox