From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
Brendan Gregg <brendan.d.gregg@gmail.com>,
Jeff Merkey <linux.mdb@gmail.com>
Subject: [patch] sched: Fix smp nice induced group scheduling load distribution woes
Date: Wed, 27 Apr 2016 09:09:51 +0200 [thread overview]
Message-ID: <1461740991.3622.3.camel@gmail.com> (raw)
In-Reply-To: <1461575925.3670.25.camel@gmail.com>
On Mon, 2016-04-25 at 11:18 +0200, Mike Galbraith wrote:
> On Sun, 2016-04-24 at 09:05 +0200, Mike Galbraith wrote:
> > On Sat, 2016-04-23 at 18:38 -0700, Brendan Gregg wrote:
> >
> > > The bugs they found seem real, and their analysis is great
> > > (although
> > > using visualizations to find and fix scheduler bugs isn't new),
> > > and it
> > > would be good to see these fixed. However, it would also be
> > > useful to
> > > double check how widespread these issues really are. I suspect
> > > many on
> > > this list can test these patches in different environments.
> >
> > Part of it sounded to me very much like they're meeting and
> > "fixing"
> > SMP group fairness...
>
> Ew, NUMA boxen look like they could use a hug or two. Add a group of
> one hog to compete with a box wide kbuild, ~lose a node.
sched: Fix smp nice induced group scheduling load distribution woes
On even a modest sized NUMA box any load that wants to scale
is essentially reduced to SCHED_IDLE class by smp nice scaling.
Limit niceness to prevent cramming a box wide load into a too
small space. Given niceness affects latency, give the user the
option to completely disable box wide group fairness as well.
time make -j192 modules on a 4 node NUMA box..
Before:
root cgroup
real 1m6.987s 1.00
cgroup vs 1 groups of 1 hog
real 1m20.871s 1.20
cgroup vs 2 groups of 1 hog
real 1m48.803s 1.62
Each single task group receives a ~full socket because the kbuild
has become an essentially massless object that fits in practically
no space at all. Near perfect math led directly to far from good
scaling/performance, a "Perfect is the enemy of good" poster child.
After "Let's just be nice enough instead" adjustment, single task
groups continued to sustain >99% utilization while competing with
the box sized kbuild.
cgroup vs 2 groups of 1 hog
real 1m8.151s 1.01 192/190=1.01
Good enough works better.. nearly perfectly in this case.
Signed-off-by: Mike Galbraith <umgwanakikbuit@gmail.com>
---
kernel/sched/fair.c | 22 ++++++++++++++++++----
kernel/sched/features.h | 3 +++
2 files changed, 21 insertions(+), 4 deletions(-)
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -2464,17 +2464,28 @@ static inline long calc_tg_weight(struct
static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
{
- long tg_weight, load, shares;
+ long tg_weight, load, shares, min_shares = MIN_SHARES;
- tg_weight = calc_tg_weight(tg, cfs_rq);
+ if (!sched_feat(SMP_NICE_GROUPS))
+ return tg->shares;
+
+ /*
+ * Bound niceness to prevent everything that wants to scale from
+ * essentially becoming SCHED_IDLE on multi/large socket boxen,
+ * screwing up our ability to distribute load properly and/or
+ * deliver acceptable latencies.
+ */
+ tg_weight = min_t(long, calc_tg_weight(tg, cfs_rq), sched_prio_to_weight[10]);
load = cfs_rq->load.weight;
shares = (tg->shares * load);
if (tg_weight)
shares /= tg_weight;
- if (shares < MIN_SHARES)
- shares = MIN_SHARES;
+ if (tg->shares > sched_prio_to_weight[20])
+ min_shares = sched_prio_to_weight[20];
+ if (shares < min_shares)
+ shares = min_shares;
if (shares > tg->shares)
shares = tg->shares;
@@ -2517,6 +2528,9 @@ static void update_cfs_shares(struct cfs
#ifndef CONFIG_SMP
if (likely(se->load.weight == tg->shares))
return;
+#else
+ if (!sched_feat(SMP_NICE_GROUPS) && se->load.weight == tg->shares)
+ return;
#endif
shares = calc_cfs_shares(cfs_rq, tg);
Index: linux-2.6/kernel/sched/features.h
===================================================================
--- linux-2.6.orig/kernel/sched/features.h
+++ linux-2.6/kernel/sched/features.h
@@ -69,3 +69,6 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
SCHED_FEAT(ATTACH_AGE_LOAD, true)
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+SCHED_FEAT(SMP_NICE_GROUPS, true)
+#endif
next prev parent reply other threads:[~2016-04-27 7:09 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-23 18:20 [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Jeff Merkey
2016-04-24 1:38 ` Brendan Gregg
2016-04-24 7:05 ` Mike Galbraith
2016-04-25 9:18 ` Mike Galbraith
2016-04-27 7:09 ` Mike Galbraith [this message]
2016-04-28 9:11 ` [patch] sched: Fix smp nice induced group scheduling load distribution woes Peter Zijlstra
2016-04-28 12:29 ` Mike Galbraith
2016-04-25 9:34 ` [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Peter Zijlstra
2016-04-25 17:54 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1461740991.3622.3.camel@gmail.com \
--to=umgwanakikbuti@gmail.com \
--cc=brendan.d.gregg@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux.mdb@gmail.com \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.