* [RFC] The Linux Scheduler: a Decade of Wasted Cores Report @ 2016-04-23 18:20 Jeff Merkey 2016-04-24 1:38 ` Brendan Gregg 0 siblings, 1 reply; 9+ messages in thread From: Jeff Merkey @ 2016-04-23 18:20 UTC (permalink / raw) To: LKML Interesting read. http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf "... The Linux kernel scheduler has deficiencies that prevent a multicore system from making proper use of all cores for heavily multithreaded loads, according to a lecture and paper delivered earlier this month at the EuroSys '16 conference in London, ..." Any plans to incorporate these fixes? Jeff ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] The Linux Scheduler: a Decade of Wasted Cores Report 2016-04-23 18:20 [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Jeff Merkey @ 2016-04-24 1:38 ` Brendan Gregg 2016-04-24 7:05 ` Mike Galbraith 2016-04-25 9:34 ` [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Peter Zijlstra 0 siblings, 2 replies; 9+ messages in thread From: Brendan Gregg @ 2016-04-24 1:38 UTC (permalink / raw) To: Jeff Merkey; +Cc: LKML On Sat, Apr 23, 2016 at 11:20 AM, Jeff Merkey <linux.mdb@gmail.com> wrote: > > Interesting read. > > http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf > > "... The Linux kernel scheduler has deficiencies that prevent a > multicore system from making proper use of all cores for heavily > multithreaded loads, according to a lecture and paper delivered > earlier this month at the EuroSys '16 conference in London, ..." > > Any plans to incorporate these fixes? While this paper analyzes and proposes fixes for four bugs, it has been getting a lot of attention for broader claims about Linux being fundamentally broken: "As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invariant is often broken in Linux. Cores may stay idle for seconds while ready threads are waiting in runqueues." Then states that the problems in the Linux scheduler that they found cause degradations of "13-24% for typical Linux workloads". Their proof of concept patches are online[1]. I tested them and saw 0% improvements on the systems I tested, for some simple workloads[2]. I tested 1 and 2 node NUMA, as that is typical for my employer (Netflix, and our tens of thousands of Linux instances in the AWS/EC2 cloud), even though I wasn't expecting any difference on 1 node. I've used synthetic workloads so far. I should note I do check run queue latency having hit scheduler bugs in the past (especially on other kernels) and haven't noticed the issues they describe, on our systems, for various workloads. I've also written a new tool for this (runqlat using bcc/BPF[3]) to print run queue latency as a histogram. The bugs they found seem real, and their analysis is great (although using visualizations to find and fix scheduler bugs isn't new), and it would be good to see these fixed. However, it would also be useful to double check how widespread these issues really are. I suspect many on this list can test these patches in different environments. Have we really had a decade of wasted cores, losing 13-24% for typical Linux workloads? I don't think it's that widespread, but I'm only testing one environment. Brendan [1] https://github.com/jplozi/wastedcores [2] https://gist.github.com/brendangregg/588b1d29bcb952141d50ccc0e005fcf8 [3] https://github.com/iovisor/bcc/blob/master/tools/runqlat_example.txt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] The Linux Scheduler: a Decade of Wasted Cores Report 2016-04-24 1:38 ` Brendan Gregg @ 2016-04-24 7:05 ` Mike Galbraith 2016-04-25 9:18 ` Mike Galbraith 2016-04-25 9:34 ` [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Peter Zijlstra 1 sibling, 1 reply; 9+ messages in thread From: Mike Galbraith @ 2016-04-24 7:05 UTC (permalink / raw) To: Brendan Gregg, Jeff Merkey; +Cc: LKML On Sat, 2016-04-23 at 18:38 -0700, Brendan Gregg wrote: > The bugs they found seem real, and their analysis is great (although > using visualizations to find and fix scheduler bugs isn't new), and it > would be good to see these fixed. However, it would also be useful to > double check how widespread these issues really are. I suspect many on > this list can test these patches in different environments. Part of it sounded to me very much like they're meeting and "fixing" SMP group fairness. Take the worst case, a threads=cores group of synchronized threads passing checkpoints in lockstep competing with a group of one hog: synchronized threads that have a core to themselves must wait (busy as they mentioned, or sleep) for the straggler thread who's fair share is a small fraction (1/65 for 64 core box) of a core to catch up before the group as a unit can proceed. Without SMP fairness, groups intersecting compete as equals at any given intersection (assuming shares have not been twiddled), thus a fully synchronized load can utilize up to 50% of a box [1], whereas with SMP fairness, worst case load slams head on into a one core wall. Pondering the progress dependency thingy a bit, seems some degree of that is likely, thus it logically follows that SMP fairness is likely to find some non zero delta to multiply by box size. This came up fairly recently, with a university math department admin grumbling that cranky professors were beating him bloody. Testing, I couldn't confirm exactly what he was grumbling about (couldn't figure out exactly what that was actually), but thinking about it combined with what I was seeing made me too want to "fix" it by smacking it squarely between the eyes with my BFH. Turned out that it had grown a wart though, isn't nearly as bad in the real world (defined as measuring random generic stuff on my little box;) as idle pondering, and measurement of slightly dinged up code had indicated. Like everything else, it cuts both ways. -Mike 1. IOW do NOT run highly specialized load in generic environment, it is guaranteed to either suck rock or suck gigantic frick'n boulders. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] The Linux Scheduler: a Decade of Wasted Cores Report 2016-04-24 7:05 ` Mike Galbraith @ 2016-04-25 9:18 ` Mike Galbraith 2016-04-27 7:09 ` [patch] sched: Fix smp nice induced group scheduling load distribution woes Mike Galbraith 0 siblings, 1 reply; 9+ messages in thread From: Mike Galbraith @ 2016-04-25 9:18 UTC (permalink / raw) To: Brendan Gregg, Jeff Merkey; +Cc: LKML, Peter Zijlstra On Sun, 2016-04-24 at 09:05 +0200, Mike Galbraith wrote: > On Sat, 2016-04-23 at 18:38 -0700, Brendan Gregg wrote: > > > The bugs they found seem real, and their analysis is great (although > > using visualizations to find and fix scheduler bugs isn't new), and it > > would be good to see these fixed. However, it would also be useful to > > double check how widespread these issues really are. I suspect many on > > this list can test these patches in different environments. > > Part of it sounded to me very much like they're meeting and "fixing" > SMP group fairness... Ew, NUMA boxen look like they could use a hug or two. Add a group of one hog to compete with a box wide kbuild, ~lose a node. Master.today, 4 node box, make -j 192 modules root group real 1m6.987s 1.00 cgroup vs 1 group of 1 hog real 1m20.871s 1.20 cgroup vs 2 groups of 1 hog real 1m48.803s 1.62 -Mike ^ permalink raw reply [flat|nested] 9+ messages in thread
* [patch] sched: Fix smp nice induced group scheduling load distribution woes 2016-04-25 9:18 ` Mike Galbraith @ 2016-04-27 7:09 ` Mike Galbraith 2016-04-28 9:11 ` Peter Zijlstra 0 siblings, 1 reply; 9+ messages in thread From: Mike Galbraith @ 2016-04-27 7:09 UTC (permalink / raw) To: Peter Zijlstra; +Cc: LKML, Brendan Gregg, Jeff Merkey On Mon, 2016-04-25 at 11:18 +0200, Mike Galbraith wrote: > On Sun, 2016-04-24 at 09:05 +0200, Mike Galbraith wrote: > > On Sat, 2016-04-23 at 18:38 -0700, Brendan Gregg wrote: > > > > > The bugs they found seem real, and their analysis is great > > > (although > > > using visualizations to find and fix scheduler bugs isn't new), > > > and it > > > would be good to see these fixed. However, it would also be > > > useful to > > > double check how widespread these issues really are. I suspect > > > many on > > > this list can test these patches in different environments. > > > > Part of it sounded to me very much like they're meeting and > > "fixing" > > SMP group fairness... > > Ew, NUMA boxen look like they could use a hug or two. Add a group of > one hog to compete with a box wide kbuild, ~lose a node. sched: Fix smp nice induced group scheduling load distribution woes On even a modest sized NUMA box any load that wants to scale is essentially reduced to SCHED_IDLE class by smp nice scaling. Limit niceness to prevent cramming a box wide load into a too small space. Given niceness affects latency, give the user the option to completely disable box wide group fairness as well. time make -j192 modules on a 4 node NUMA box.. Before: root cgroup real 1m6.987s 1.00 cgroup vs 1 groups of 1 hog real 1m20.871s 1.20 cgroup vs 2 groups of 1 hog real 1m48.803s 1.62 Each single task group receives a ~full socket because the kbuild has become an essentially massless object that fits in practically no space at all. Near perfect math led directly to far from good scaling/performance, a "Perfect is the enemy of good" poster child. After "Let's just be nice enough instead" adjustment, single task groups continued to sustain >99% utilization while competing with the box sized kbuild. cgroup vs 2 groups of 1 hog real 1m8.151s 1.01 192/190=1.01 Good enough works better.. nearly perfectly in this case. Signed-off-by: Mike Galbraith <umgwanakikbuit@gmail.com> --- kernel/sched/fair.c | 22 ++++++++++++++++++---- kernel/sched/features.h | 3 +++ 2 files changed, 21 insertions(+), 4 deletions(-) Index: linux-2.6/kernel/sched/fair.c =================================================================== --- linux-2.6.orig/kernel/sched/fair.c +++ linux-2.6/kernel/sched/fair.c @@ -2464,17 +2464,28 @@ static inline long calc_tg_weight(struct static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) { - long tg_weight, load, shares; + long tg_weight, load, shares, min_shares = MIN_SHARES; - tg_weight = calc_tg_weight(tg, cfs_rq); + if (!sched_feat(SMP_NICE_GROUPS)) + return tg->shares; + + /* + * Bound niceness to prevent everything that wants to scale from + * essentially becoming SCHED_IDLE on multi/large socket boxen, + * screwing up our ability to distribute load properly and/or + * deliver acceptable latencies. + */ + tg_weight = min_t(long, calc_tg_weight(tg, cfs_rq), sched_prio_to_weight[10]); load = cfs_rq->load.weight; shares = (tg->shares * load); if (tg_weight) shares /= tg_weight; - if (shares < MIN_SHARES) - shares = MIN_SHARES; + if (tg->shares > sched_prio_to_weight[20]) + min_shares = sched_prio_to_weight[20]; + if (shares < min_shares) + shares = min_shares; if (shares > tg->shares) shares = tg->shares; @@ -2517,6 +2528,9 @@ static void update_cfs_shares(struct cfs #ifndef CONFIG_SMP if (likely(se->load.weight == tg->shares)) return; +#else + if (!sched_feat(SMP_NICE_GROUPS) && se->load.weight == tg->shares) + return; #endif shares = calc_cfs_shares(cfs_rq, tg); Index: linux-2.6/kernel/sched/features.h =================================================================== --- linux-2.6.orig/kernel/sched/features.h +++ linux-2.6/kernel/sched/features.h @@ -69,3 +69,6 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true) SCHED_FEAT(LB_MIN, false) SCHED_FEAT(ATTACH_AGE_LOAD, true) +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP) +SCHED_FEAT(SMP_NICE_GROUPS, true) +#endif ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch] sched: Fix smp nice induced group scheduling load distribution woes 2016-04-27 7:09 ` [patch] sched: Fix smp nice induced group scheduling load distribution woes Mike Galbraith @ 2016-04-28 9:11 ` Peter Zijlstra 2016-04-28 12:29 ` Mike Galbraith 0 siblings, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2016-04-28 9:11 UTC (permalink / raw) To: Mike Galbraith; +Cc: LKML, Brendan Gregg, Jeff Merkey On Wed, Apr 27, 2016 at 09:09:51AM +0200, Mike Galbraith wrote: > On even a modest sized NUMA box any load that wants to scale > is essentially reduced to SCHED_IDLE class by smp nice scaling. > Limit niceness to prevent cramming a box wide load into a too > small space. Given niceness affects latency, give the user the > option to completely disable box wide group fairness as well. Have you tried the (obvious) ? I suppose we really should just do this (and yuyang's cleanup patches I suppose). Nobody has ever been able to reproduce those increased power usage claims and Google is running with this enabled. --- diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 69da6fcaa0e8..968f573413de 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -53,7 +53,7 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the * increased costs. */ -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */ +#ifdef CONFIG_64BIT # define SCHED_LOAD_RESOLUTION 10 # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION) # define scale_load_down(w) ((w) >> SCHED_LOAD_RESOLUTION) ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [patch] sched: Fix smp nice induced group scheduling load distribution woes 2016-04-28 9:11 ` Peter Zijlstra @ 2016-04-28 12:29 ` Mike Galbraith 0 siblings, 0 replies; 9+ messages in thread From: Mike Galbraith @ 2016-04-28 12:29 UTC (permalink / raw) To: Peter Zijlstra; +Cc: LKML, Brendan Gregg, Jeff Merkey On Thu, 2016-04-28 at 11:11 +0200, Peter Zijlstra wrote: > On Wed, Apr 27, 2016 at 09:09:51AM +0200, Mike Galbraith wrote: > > On even a modest sized NUMA box any load that wants to scale > > is essentially reduced to SCHED_IDLE class by smp nice scaling. > > Limit niceness to prevent cramming a box wide load into a too > > small space. Given niceness affects latency, give the user the > > option to completely disable box wide group fairness as well. > > Have you tried the (obvious) ? Duh, nope. > I suppose we really should just do this (and yuyang's cleanup patches I > suppose). Nobody has ever been able to reproduce those increased power > usage claims and Google is running with this enabled. Yup, works, and you don't have to carefully blink as you skim past it. > --- > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 69da6fcaa0e8..968f573413de 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -53,7 +53,7 @@ static inline void cpu_load_update_active(struct rq > *this_rq) { } > * when BITS_PER_LONG <= 32 are pretty high and the returns do not > justify the > * increased costs. > */ > -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power > usage under light load */ > +#ifdef CONFIG_64BIT > # define SCHED_LOAD_RESOLUTION 10 > # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION) > # define scale_load_down(w) ((w) >> SCHED_LOAD_RESOLUTION) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] The Linux Scheduler: a Decade of Wasted Cores Report 2016-04-24 1:38 ` Brendan Gregg 2016-04-24 7:05 ` Mike Galbraith @ 2016-04-25 9:34 ` Peter Zijlstra 2016-04-25 17:54 ` Rik van Riel 1 sibling, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2016-04-25 9:34 UTC (permalink / raw) To: Brendan Gregg; +Cc: Jeff Merkey, LKML, Mike Galbraith, Ingo Molnar On Sat, Apr 23, 2016 at 06:38:25PM -0700, Brendan Gregg wrote: > On Sat, Apr 23, 2016 at 11:20 AM, Jeff Merkey <linux.mdb@gmail.com> wrote: > > > > Interesting read. > > > > http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf > > > > "... The Linux kernel scheduler has deficiencies that prevent a > > multicore system from making proper use of all cores for heavily > > multithreaded loads, according to a lecture and paper delivered > > earlier this month at the EuroSys '16 conference in London, ..." > > > > Any plans to incorporate these fixes? No; their patches are completely butchering things. Also, I don't think I agree with some of their analysis. Sadly the paper doesn't provide enough detail to fully reproduce things. Nor have I had time to really look into things yet. I was only made aware of this paper last week -- it was so good of these here folks to contact me,. oh wait. > While this paper analyzes and proposes fixes for four bugs, it has > been getting a lot of attention for broader claims about Linux being > fundamentally broken: > > "As a central part of resource management, the OS thread scheduler > must maintain the following, simple, invariant: make sure that ready > threads are scheduled on available cores. This is actually debatable. This is a global problem, therefore it is expensive. It can take more work to find a runnable task than we would have been idle for in the first place. > As simple as it may seem, we > found that this invariant is often broken in Linux. Cores may stay > idle for seconds while ready threads are waiting in runqueues." Right, obviously seconds is undesirable. > Then states that the problems in the Linux scheduler that they found > cause degradations of "13-24% for typical Linux workloads". > > Their proof of concept patches are online[1]. I tested them and saw 0% > improvements on the systems I tested, for some simple workloads[2]. I > tested 1 and 2 node NUMA, as that is typical for my employer (Netflix, > and our tens of thousands of Linux instances in the AWS/EC2 cloud), > even though I wasn't expecting any difference on 1 node. I've used > synthetic workloads so far. So their setup uses a bigger (not fully connected) NUMA topology, and I'm not entirely sure how much of their problems are due to that, but at least one of them is. Such boxes are fairly rare. In any case, I'll get to it at some point... ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] The Linux Scheduler: a Decade of Wasted Cores Report 2016-04-25 9:34 ` [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Peter Zijlstra @ 2016-04-25 17:54 ` Rik van Riel 0 siblings, 0 replies; 9+ messages in thread From: Rik van Riel @ 2016-04-25 17:54 UTC (permalink / raw) To: Peter Zijlstra, Brendan Gregg Cc: Jeff Merkey, LKML, Mike Galbraith, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 1349 bytes --] On Mon, 2016-04-25 at 11:34 +0200, Peter Zijlstra wrote: > On Sat, Apr 23, 2016 at 06:38:25PM -0700, Brendan Gregg wrote: > > > > Their proof of concept patches are online[1]. I tested them and saw > > 0% > > improvements on the systems I tested, for some simple workloads[2]. > > I > > tested 1 and 2 node NUMA, as that is typical for my employer > > (Netflix, > > and our tens of thousands of Linux instances in the AWS/EC2 cloud), > > even though I wasn't expecting any difference on 1 node. I've used > > synthetic workloads so far. > So their setup uses a bigger (not fully connected) NUMA topology, and > I'm not entirely sure how much of their problems are due to that, but > at > least one of them is. > > Such boxes are fairly rare. Their proposed fix, of making sure we build all 8 sched groups with 5 nodes each in them seems a little bit roundabout when compared with a simpler alternative, though. When dealing with a NUMA_GLUELESS_MESH topology, we should simply not build any sched domains with multiple nodes inside them, except for the top level domain that contains all the nodes. At that point, we will balance between threads, inside each core, and between all nodes, without running into those pointless (and potentially harmful) intermediate sched domains. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-04-28 12:29 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-04-23 18:20 [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Jeff Merkey 2016-04-24 1:38 ` Brendan Gregg 2016-04-24 7:05 ` Mike Galbraith 2016-04-25 9:18 ` Mike Galbraith 2016-04-27 7:09 ` [patch] sched: Fix smp nice induced group scheduling load distribution woes Mike Galbraith 2016-04-28 9:11 ` Peter Zijlstra 2016-04-28 12:29 ` Mike Galbraith 2016-04-25 9:34 ` [RFC] The Linux Scheduler: a Decade of Wasted Cores Report Peter Zijlstra 2016-04-25 17:54 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox