* [PATCH RFC/TEST] sched: make sync affine wakeups work @ 2014-05-02 4:42 Rik van Riel 2014-05-02 5:32 ` Mike Galbraith 2014-05-02 6:13 ` Mike Galbraith 0 siblings, 2 replies; 46+ messages in thread From: Rik van Riel @ 2014-05-02 4:42 UTC (permalink / raw) To: linux-kernel Cc: morten.rasmussen, mingo, peterz, george.mccollister, ktkhai, umgwanakikbuti Currently sync wakeups from the wake_affine code cannot work as designed, because the task doing the sync wakeup from the target cpu will block its wakee from selecting that cpu. This is despite the fact that whether or not the wakeup is sync determines whether or not we want to do an affine wakeup... This patch allows sync wakeups to pick the waker's cpu. Whether or not this is the right thing to do remains to be seen, but it does allow us to verify whether or not the wake_affine strategy of always doing affine wakeups and only disabling them in a specific circumstance is sound, or needs rethinking... Not-yet-signed-off-by: Rik van Riel <riel@redhat.com> --- kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7570dd9..7f8fe16 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4358,13 +4358,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* * Try and locate an idle CPU in the sched_domain. */ -static int select_idle_sibling(struct task_struct *p, int target) +static int select_idle_sibling(struct task_struct *p, int target, int sync) { struct sched_domain *sd; struct sched_group *sg; int i = task_cpu(p); - if (idle_cpu(target)) + if (idle_cpu(target) || (sync && cpu_rq(target)->nr_running == 1)) return target; /* @@ -4453,7 +4453,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) prev_cpu = cpu; - new_cpu = select_idle_sibling(p, prev_cpu); + new_cpu = select_idle_sibling(p, prev_cpu, sync); goto unlock; } . ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 4:42 [PATCH RFC/TEST] sched: make sync affine wakeups work Rik van Riel @ 2014-05-02 5:32 ` Mike Galbraith 2014-05-02 5:41 ` Mike Galbraith 2014-05-02 5:58 ` Mike Galbraith 2014-05-02 6:13 ` Mike Galbraith 1 sibling, 2 replies; 46+ messages in thread From: Mike Galbraith @ 2014-05-02 5:32 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > Currently sync wakeups from the wake_affine code cannot work as > designed, because the task doing the sync wakeup from the target > cpu will block its wakee from selecting that cpu. > > This is despite the fact that whether or not the wakeup is sync > determines whether or not we want to do an affine wakeup... If the sync hint really did mean we ARE going to schedule RSN, waking local would be a good thing. It is all too often a big fat lie. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 5:32 ` Mike Galbraith @ 2014-05-02 5:41 ` Mike Galbraith 2014-05-02 5:58 ` Mike Galbraith 1 sibling, 0 replies; 46+ messages in thread From: Mike Galbraith @ 2014-05-02 5:41 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On Fri, 2014-05-02 at 07:32 +0200, Mike Galbraith wrote: > On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > > Currently sync wakeups from the wake_affine code cannot work as > > designed, because the task doing the sync wakeup from the target > > cpu will block its wakee from selecting that cpu. > > > > This is despite the fact that whether or not the wakeup is sync > > determines whether or not we want to do an affine wakeup... > > If the sync hint really did mean we ARE going to schedule RSN, waking > local would be a good thing. It is all too often a big fat lie. BTW, we used to have avg_overlap and a sync less heuristic to improve the accuracy of the sync wakeup hint, but I ripped it out because it was dead wrong far too often. You can try reviving it, but preemption among other things break it. You'll have better luck with a simple context switch rate heuristic, but that's less than perfect too. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 5:32 ` Mike Galbraith 2014-05-02 5:41 ` Mike Galbraith @ 2014-05-02 5:58 ` Mike Galbraith 2014-05-02 6:08 ` Rik van Riel 1 sibling, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-02 5:58 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On Fri, 2014-05-02 at 07:32 +0200, Mike Galbraith wrote: > On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > > Currently sync wakeups from the wake_affine code cannot work as > > designed, because the task doing the sync wakeup from the target > > cpu will block its wakee from selecting that cpu. > > > > This is despite the fact that whether or not the wakeup is sync > > determines whether or not we want to do an affine wakeup... > > If the sync hint really did mean we ARE going to schedule RSN, waking > local would be a good thing. It is all too often a big fat lie. One example of that is say pgbench. The mother of all work (server thread) for that load wakes with sync hint. Let the server wake the first of a small herd CPU affine, and that first wakee then preempt the server (mother of all work) that drives the entire load. Byebye throughput. When there's only one wakee, and there's really not enough overlap to at least break even, waking CPU affine is a great idea. Even when your wakees only run for a short time, if you wake/get_preempted repeat, the load will serialize. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 5:58 ` Mike Galbraith @ 2014-05-02 6:08 ` Rik van Riel 2014-05-02 6:36 ` Mike Galbraith 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2014-05-02 6:08 UTC (permalink / raw) To: Mike Galbraith Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On 05/02/2014 01:58 AM, Mike Galbraith wrote: > On Fri, 2014-05-02 at 07:32 +0200, Mike Galbraith wrote: >> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: >>> Currently sync wakeups from the wake_affine code cannot work as >>> designed, because the task doing the sync wakeup from the target >>> cpu will block its wakee from selecting that cpu. >>> >>> This is despite the fact that whether or not the wakeup is sync >>> determines whether or not we want to do an affine wakeup... >> >> If the sync hint really did mean we ARE going to schedule RSN, waking >> local would be a good thing. It is all too often a big fat lie. > > One example of that is say pgbench. The mother of all work (server > thread) for that load wakes with sync hint. Let the server wake the > first of a small herd CPU affine, and that first wakee then preempt the > server (mother of all work) that drives the entire load. > > Byebye throughput. > > When there's only one wakee, and there's really not enough overlap to at > least break even, waking CPU affine is a great idea. Even when your > wakees only run for a short time, if you wake/get_preempted repeat, the > load will serialize. I see a similar issue with specjbb2013, with 4 backend and 4 frontend JVMs on a 4 node NUMA system. The NUMA balancing code nicely places the memory of each JVM on one NUMA node, but then the wake_affine code will happily run all of the threads anywhere on the system, totally ruining memory locality. The front end and back end only exchange a few hundred messages a second, over loopback tcp, so the switching rate between threads is quite low... I wonder if it would make sense for wake_affine to be off by default, and only switch on when the right conditions are detected, instead of having it on by default like we have now? I have some ideas on that, but I should probably catch some sleep before trying to code them up :) Meanwhile, the test patch that I posted may help us figure out whether the "sync" option in the current wake_affine code does anything useful. -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 6:08 ` Rik van Riel @ 2014-05-02 6:36 ` Mike Galbraith 2014-05-02 6:51 ` Mike Galbraith 0 siblings, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-02 6:36 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On Fri, 2014-05-02 at 02:08 -0400, Rik van Riel wrote: > On 05/02/2014 01:58 AM, Mike Galbraith wrote: > > On Fri, 2014-05-02 at 07:32 +0200, Mike Galbraith wrote: > >> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > >>> Currently sync wakeups from the wake_affine code cannot work as > >>> designed, because the task doing the sync wakeup from the target > >>> cpu will block its wakee from selecting that cpu. > >>> > >>> This is despite the fact that whether or not the wakeup is sync > >>> determines whether or not we want to do an affine wakeup... > >> > >> If the sync hint really did mean we ARE going to schedule RSN, waking > >> local would be a good thing. It is all too often a big fat lie. > > > > One example of that is say pgbench. The mother of all work (server > > thread) for that load wakes with sync hint. Let the server wake the > > first of a small herd CPU affine, and that first wakee then preempt the > > server (mother of all work) that drives the entire load. > > > > Byebye throughput. > > > > When there's only one wakee, and there's really not enough overlap to at > > least break even, waking CPU affine is a great idea. Even when your > > wakees only run for a short time, if you wake/get_preempted repeat, the > > load will serialize. > > I see a similar issue with specjbb2013, with 4 backend and > 4 frontend JVMs on a 4 node NUMA system. > > The NUMA balancing code nicely places the memory of each JVM > on one NUMA node, but then the wake_affine code will happily > run all of the threads anywhere on the system, totally ruining > memory locality. Hm, I thought numasched got excessive pull crap under control. For steady hefty loads, you want to kill all but periodic load balancing once the thing gets cranked up. The less you move tasks, the better the load will perform. Bursty loads exist too though, damn the bad luck. > The front end and back end only exchange a few hundred messages > a second, over loopback tcp, so the switching rate between > threads is quite low... > > I wonder if it would make sense for wake_affine to be off by > default, and only switch on when the right conditions are > detected, instead of having it on by default like we have now? Not IMHO, but I have seen situations where that was exactly what I recommended to fix the throughput problem the user was having. Reason why is that case was on a box where FAIR_SLEEPERS is disabled by default, meaning there is no such thing as wakeup preemption. Guess what happens when you don't have shared LLC for a fast/light wakee to escape to when the waker is a pig. The worst thing possible in that case is to wake affine. Leave the poor thing wherever it was, else it will take a latency hit that need not have been. > I have some ideas on that, but I should probably catch some > sleep before trying to code them up :) Yeah, there are many aspects to ponder. > Meanwhile, the test patch that I posted may help us figure out > whether the "sync" option in the current wake_affine code does > anything useful. If I had a NAK stamp and digital ink pad, that patch wouldn't be readable, much less applicable ;-) -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 6:36 ` Mike Galbraith @ 2014-05-02 6:51 ` Mike Galbraith 0 siblings, 0 replies; 46+ messages in thread From: Mike Galbraith @ 2014-05-02 6:51 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On Fri, 2014-05-02 at 08:36 +0200, Mike Galbraith wrote: > Reason why is that case was on a box where FAIR_SLEEPERS is disabled by > default, meaning there is no such thing as wakeup preemption. Guess > what happens when you don't have shared LLC for a fast/light wakee to > escape to when the waker is a pig. The worst thing possible in that > case is to wake affine. Leave the poor thing wherever it was, else it > will take a latency hit that need not have been. Oh yeah, and you'll see similar issues playing with kvm. No escape routes are available, as no llc domain exists. Globally do a sync wakeup CPU affine, and for some loads that will induce massive wreckage where as If select_isle_sibling() had been there to save the day, all would have been peachy. Is it good, or is it evil... depends. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 4:42 [PATCH RFC/TEST] sched: make sync affine wakeups work Rik van Riel 2014-05-02 5:32 ` Mike Galbraith @ 2014-05-02 6:13 ` Mike Galbraith 2014-05-02 6:30 ` Rik van Riel 1 sibling, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-02 6:13 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > Whether or not this is the right thing to do remains to be seen, > but it does allow us to verify whether or not the wake_affine > strategy of always doing affine wakeups and only disabling them > in a specific circumstance is sound, or needs rethinking... Yes, it needs rethinking. I know why you want to try this, yes, select_idle_sibling() is very much a two faced little bitch. NOHZ doesn't help, my kernels throttle it. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 6:13 ` Mike Galbraith @ 2014-05-02 6:30 ` Rik van Riel 2014-05-02 7:37 ` Mike Galbraith 2014-05-04 11:44 ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy 0 siblings, 2 replies; 46+ messages in thread From: Rik van Riel @ 2014-05-02 6:30 UTC (permalink / raw) To: Mike Galbraith Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On 05/02/2014 02:13 AM, Mike Galbraith wrote: > On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > >> Whether or not this is the right thing to do remains to be seen, >> but it does allow us to verify whether or not the wake_affine >> strategy of always doing affine wakeups and only disabling them >> in a specific circumstance is sound, or needs rethinking... > > Yes, it needs rethinking. > > I know why you want to try this, yes, select_idle_sibling() is very much > a two faced little bitch. My biggest problem with select_idle_sibling and wake_affine in general is that it will override NUMA placement, even when processes only wake each other up infrequently... -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 6:30 ` Rik van Riel @ 2014-05-02 7:37 ` Mike Galbraith 2014-05-02 10:56 ` Rik van Riel 2014-05-04 11:44 ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy 1 sibling, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-02 7:37 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On Fri, 2014-05-02 at 02:30 -0400, Rik van Riel wrote: > On 05/02/2014 02:13 AM, Mike Galbraith wrote: > > On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > > > >> Whether or not this is the right thing to do remains to be seen, > >> but it does allow us to verify whether or not the wake_affine > >> strategy of always doing affine wakeups and only disabling them > >> in a specific circumstance is sound, or needs rethinking... > > > > Yes, it needs rethinking. > > > > I know why you want to try this, yes, select_idle_sibling() is very much > > a two faced little bitch. > > My biggest problem with select_idle_sibling and wake_affine in > general is that it will override NUMA placement, even when > processes only wake each other up infrequently... Hm, seems the thing to do would be to tell select_task_rq_fair() to keep it's mitts off of tasks that the numasched stuff has placed rather than decapitating select_idle_sibling() or some other drastic measure. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 7:37 ` Mike Galbraith @ 2014-05-02 10:56 ` Rik van Riel 2014-05-02 11:27 ` Mike Galbraith 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2014-05-02 10:56 UTC (permalink / raw) To: Mike Galbraith Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On 05/02/2014 03:37 AM, Mike Galbraith wrote: > On Fri, 2014-05-02 at 02:30 -0400, Rik van Riel wrote: >> On 05/02/2014 02:13 AM, Mike Galbraith wrote: >>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: >>> >>>> Whether or not this is the right thing to do remains to be seen, >>>> but it does allow us to verify whether or not the wake_affine >>>> strategy of always doing affine wakeups and only disabling them >>>> in a specific circumstance is sound, or needs rethinking... >>> >>> Yes, it needs rethinking. >>> >>> I know why you want to try this, yes, select_idle_sibling() is very much >>> a two faced little bitch. >> >> My biggest problem with select_idle_sibling and wake_affine in >> general is that it will override NUMA placement, even when >> processes only wake each other up infrequently... > > Hm, seems the thing to do would be to tell select_task_rq_fair() to keep > it's mitts off of tasks that the numasched stuff has placed rather than > decapitating select_idle_sibling() or some other drastic measure. Thing is, if tasks are waking each other up frequently enough, we probably DO want to place them near each other with select_idle_sibling. We just cannot afford to have it as the default behaviour for casual wakeup activity, because it will mess up other things. -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 10:56 ` Rik van Riel @ 2014-05-02 11:27 ` Mike Galbraith 2014-05-02 12:51 ` Mike Galbraith [not found] ` <5363B793.9010208@redhat.com> 0 siblings, 2 replies; 46+ messages in thread From: Mike Galbraith @ 2014-05-02 11:27 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On Fri, 2014-05-02 at 06:56 -0400, Rik van Riel wrote: > On 05/02/2014 03:37 AM, Mike Galbraith wrote: > > On Fri, 2014-05-02 at 02:30 -0400, Rik van Riel wrote: > >> On 05/02/2014 02:13 AM, Mike Galbraith wrote: > >>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > >>> > >>>> Whether or not this is the right thing to do remains to be seen, > >>>> but it does allow us to verify whether or not the wake_affine > >>>> strategy of always doing affine wakeups and only disabling them > >>>> in a specific circumstance is sound, or needs rethinking... > >>> > >>> Yes, it needs rethinking. > >>> > >>> I know why you want to try this, yes, select_idle_sibling() is very much > >>> a two faced little bitch. > >> > >> My biggest problem with select_idle_sibling and wake_affine in > >> general is that it will override NUMA placement, even when > >> processes only wake each other up infrequently... > > > > Hm, seems the thing to do would be to tell select_task_rq_fair() to keep > > it's mitts off of tasks that the numasched stuff has placed rather than > > decapitating select_idle_sibling() or some other drastic measure. > > Thing is, if tasks are waking each other up frequently enough, we > probably DO want to place them near each other with select_idle_sibling. Right. I'm thinking you could perhaps create a sched feature like NUMA_ME_HARDER or such so you can tell it to go away if you find that your load performs best when movement is left entirely up to the NUMA placement code. > We just cannot afford to have it as the default behaviour for casual > wakeup activity, because it will mess up other things. I think it is generally good, but yes, it has its bad it's bad side, why we have tweakables. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 11:27 ` Mike Galbraith @ 2014-05-02 12:51 ` Mike Galbraith [not found] ` <5363B793.9010208@redhat.com> 1 sibling, 0 replies; 46+ messages in thread From: Mike Galbraith @ 2014-05-02 12:51 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister, ktkhai On Fri, 2014-05-02 at 13:27 +0200, Mike Galbraith wrote: > On Fri, 2014-05-02 at 06:56 -0400, Rik van Riel wrote: > > On 05/02/2014 03:37 AM, Mike Galbraith wrote: > > > On Fri, 2014-05-02 at 02:30 -0400, Rik van Riel wrote: > > >> On 05/02/2014 02:13 AM, Mike Galbraith wrote: > > >>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > > >>> > > >>>> Whether or not this is the right thing to do remains to be seen, > > >>>> but it does allow us to verify whether or not the wake_affine > > >>>> strategy of always doing affine wakeups and only disabling them > > >>>> in a specific circumstance is sound, or needs rethinking... > > >>> > > >>> Yes, it needs rethinking. > > >>> > > >>> I know why you want to try this, yes, select_idle_sibling() is very much > > >>> a two faced little bitch. > > >> > > >> My biggest problem with select_idle_sibling and wake_affine in > > >> general is that it will override NUMA placement, even when > > >> processes only wake each other up infrequently... > > > > > > Hm, seems the thing to do would be to tell select_task_rq_fair() to keep > > > it's mitts off of tasks that the numasched stuff has placed rather than > > > decapitating select_idle_sibling() or some other drastic measure. > > > > Thing is, if tasks are waking each other up frequently enough, we > > probably DO want to place them near each other with select_idle_sibling. > > Right. I'm thinking you could perhaps create a sched feature like > NUMA_ME_HARDER or such so you can tell it to go away if you find that > your load performs best when movement is left entirely up to the NUMA > placement code. > > > We just cannot afford to have it as the default behaviour for casual > > wakeup activity, because it will mess up other things. > > I think it is generally good, but yes, it has its bad it's bad side, why > we have tweakables. Slightly garbled (multitasking). To make that a little clearer, generally, trying to bring waker and wakee together is a good thing. I think changing the default to avoid doing the generally good thing is wrong. Box drivers should be given whatever knobs they need to adjust as required, and use them, because there is no one correct answer. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <5363B793.9010208@redhat.com>]
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work [not found] ` <5363B793.9010208@redhat.com> @ 2014-05-06 11:54 ` Peter Zijlstra 2014-05-06 20:19 ` Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2014-05-06 11:54 UTC (permalink / raw) To: Rik van Riel Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai [-- Attachment #1: Type: text/plain, Size: 314 bytes --] On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote: > As an aside, it also looks like SD_BALANCE_WAKE is set on all domains > of a NUMA system by default, so even the non-affine wakeup will end > up looking for the lowest load NUMA node to start up on. I can't find it being set on anything by default. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-06 11:54 ` Peter Zijlstra @ 2014-05-06 20:19 ` Rik van Riel 2014-05-06 20:39 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2014-05-06 20:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On 05/06/2014 07:54 AM, Peter Zijlstra wrote: > On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote: >> As an aside, it also looks like SD_BALANCE_WAKE is set on all domains >> of a NUMA system by default, so even the non-affine wakeup will end >> up looking for the lowest load NUMA node to start up on. > > I can't find it being set on anything by default. .flags = 1*SD_LOAD_BALANCE | 1*SD_BALANCE_NEWIDLE | 0*SD_BALANCE_EXEC | 0*SD_BALANCE_FORK | 0*SD_BALANCE_WAKE | 0*SD_WAKE_AFFINE | 0*SD_SHARE_CPUPOWER | 0*SD_SHARE_PKG_RESOURCES | 1*SD_SERIALIZE | 0*SD_PREFER_SIBLING | 1*SD_NUMA | sd_local_flags(level) static inline int sd_local_flags(int level) { if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE) return 0; return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; } -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-06 20:19 ` Rik van Riel @ 2014-05-06 20:39 ` Peter Zijlstra 2014-05-06 23:46 ` Rik van Riel 2014-05-09 2:20 ` Rik van Riel 0 siblings, 2 replies; 46+ messages in thread From: Peter Zijlstra @ 2014-05-06 20:39 UTC (permalink / raw) To: Rik van Riel Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On Tue, May 06, 2014 at 04:19:21PM -0400, Rik van Riel wrote: > On 05/06/2014 07:54 AM, Peter Zijlstra wrote: > > On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote: > >> As an aside, it also looks like SD_BALANCE_WAKE is set on all domains ^^^^^^^^^^^^^^^ > >> of a NUMA system by default, so even the non-affine wakeup will end > >> up looking for the lowest load NUMA node to start up on. > > > > I can't find it being set on anything by default. > > .flags = 1*SD_LOAD_BALANCE > | 1*SD_BALANCE_NEWIDLE > | 0*SD_BALANCE_EXEC > | 0*SD_BALANCE_FORK > | 0*SD_BALANCE_WAKE ^ last time I checked 0*x was still 0. > | 0*SD_WAKE_AFFINE > | 0*SD_SHARE_CPUPOWER > | 0*SD_SHARE_PKG_RESOURCES > | 1*SD_SERIALIZE > | 0*SD_PREFER_SIBLING > | 1*SD_NUMA > | sd_local_flags(level) > > > static inline int sd_local_flags(int level) > { > if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE) > return 0; > > return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; > } No BALANCE_WAKE there ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-06 20:39 ` Peter Zijlstra @ 2014-05-06 23:46 ` Rik van Riel 2014-05-09 2:20 ` Rik van Riel 1 sibling, 0 replies; 46+ messages in thread From: Rik van Riel @ 2014-05-06 23:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On 05/06/2014 04:39 PM, Peter Zijlstra wrote: > On Tue, May 06, 2014 at 04:19:21PM -0400, Rik van Riel wrote: >> On 05/06/2014 07:54 AM, Peter Zijlstra wrote: >>> On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote: >>>> As an aside, it also looks like SD_BALANCE_WAKE is set on all domains > ^^^^^^^^^^^^^^^ >> static inline int sd_local_flags(int level) >> { >> if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE) >> return 0; >> >> return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; >> } > > No BALANCE_WAKE there Yeah you're right, I typoed in my earlier email and didn't pick up on it later. Doh. -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-06 20:39 ` Peter Zijlstra 2014-05-06 23:46 ` Rik van Riel @ 2014-05-09 2:20 ` Rik van Riel 2014-05-09 5:27 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Rik van Riel 1 sibling, 1 reply; 46+ messages in thread From: Rik van Riel @ 2014-05-09 2:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On 05/06/2014 04:39 PM, Peter Zijlstra wrote: > On Tue, May 06, 2014 at 04:19:21PM -0400, Rik van Riel wrote: >> On 05/06/2014 07:54 AM, Peter Zijlstra wrote: >>> On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote: >>>> As an aside, it also looks like SD_BALANCE_WAKE is set on all domains > ^^^^^^^^^^^^^^^ > No BALANCE_WAKE there Looks like SD_BALANCE_WAKE is not gotten from the sd flags at all, but passed into select_task_rq by try_to_wake_up, as a hard coded sd_flags argument. static int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) { ... cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); ... static inline int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) { cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags); ... static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) { ... if (sd_flag & SD_BALANCE_WAKE) { if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1; new_cpu = prev_cpu; } Should we do that, if SD_WAKE_BALANCE is not set for any sched domain? -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-09 2:20 ` Rik van Riel @ 2014-05-09 5:27 ` Rik van Riel 2014-05-09 6:04 ` [PATCH] sched: clean up select_task_rq_fair conditionals and indentation Rik van Riel 2014-05-09 7:34 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Mike Galbraith 0 siblings, 2 replies; 46+ messages in thread From: Rik van Riel @ 2014-05-09 5:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On Thu, 08 May 2014 22:20:25 -0400 Rik van Riel <riel@redhat.com> wrote: > Looks like SD_BALANCE_WAKE is not gotten from the sd flags at > all, but passed into select_task_rq by try_to_wake_up, as a > hard coded sd_flags argument. > Should we do that, if SD_WAKE_BALANCE is not set for any sched domain? I answered my own question. The sd_flag SD_WAKE_BALANCE simply means "this is a wakeup of a previously existing task, please place it properly". However, it appears that the current code will fall back to the large loop with select_idlest_group and friends, if prev_cpu and cpu are not part of the same SD_WAKE_AFFINE sched domain. That is a bug... ---8<--- Subject: sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu If prev_cpu and cpu are not part of the same SD_WAKE_AFFINE domain, the current code will fall back to the select_idlest_group CPU selection mechanism, instead of trying to wake up the task on a CPU near its previous CPU. Fix that by always calling select_idle_sibling when doing an SD_BALANCE_WAKE wakeup. Signed-off-by: Rik van Riel <riel@redhat.com> --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7570dd9..5d33fb1b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4452,7 +4452,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (affine_sd) { if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) prev_cpu = cpu; + } + if (sd_flag & SD_BALANCE_WAKE) { new_cpu = select_idle_sibling(p, prev_cpu); goto unlock; } ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH] sched: clean up select_task_rq_fair conditionals and indentation 2014-05-09 5:27 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Rik van Riel @ 2014-05-09 6:04 ` Rik van Riel 2014-05-09 7:34 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Mike Galbraith 1 sibling, 0 replies; 46+ messages in thread From: Rik van Riel @ 2014-05-09 6:04 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai This cleanup goes on top of the previous patch. We could also skip the whole domain iteration if !want_affine, but that doesn't fit nicely in 80 columns and may have to be done in yet another patch (breaking that bit of the code out into its own function?) ---8<--- Subject: sched: clean up select_task_rq_fair conditionals and indentation The whole top half of select_task_rq_fair is only run if sd_flag & SD_BALANCE_WAKE. The code should probably reflect that. Also remove the useless new_cpu = prev_cpu assignment, as the value of new_cpu is not actually used. Signed-off-by: Rik van Riel <riel@redhat.com> --- kernel/sched/fair.c | 47 ++++++++++++++++++++++++----------------------- 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5d33fb1b..844807b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4424,37 +4424,38 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (p->nr_cpus_allowed == 1) return prev_cpu; + rcu_read_lock(); if (sd_flag & SD_BALANCE_WAKE) { if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1; - new_cpu = prev_cpu; - } - rcu_read_lock(); - for_each_domain(cpu, tmp) { - if (!(tmp->flags & SD_LOAD_BALANCE)) - continue; + for_each_domain(cpu, tmp) { + if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) + break; - /* - * If both cpu and prev_cpu are part of this domain, - * cpu is a valid SD_WAKE_AFFINE target. - */ - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { - affine_sd = tmp; - break; - } + if (!(tmp->flags & SD_LOAD_BALANCE)) + continue; - if (tmp->flags & sd_flag) - sd = tmp; - } + /* + * If both cpu and prev_cpu are part of this domain, + * cpu is a valid SD_WAKE_AFFINE target. + */ + if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && + cpumask_test_cpu(prev_cpu, + sched_domain_span(tmp))) { + affine_sd = tmp; + break; + } - if (affine_sd) { - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; - } + if (tmp->flags & sd_flag) + sd = tmp; + } + + if (affine_sd) { + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + prev_cpu = cpu; + } - if (sd_flag & SD_BALANCE_WAKE) { new_cpu = select_idle_sibling(p, prev_cpu); goto unlock; } ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-09 5:27 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Rik van Riel 2014-05-09 6:04 ` [PATCH] sched: clean up select_task_rq_fair conditionals and indentation Rik van Riel @ 2014-05-09 7:34 ` Mike Galbraith 2014-05-09 14:22 ` Rik van Riel 1 sibling, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-09 7:34 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On Fri, 2014-05-09 at 01:27 -0400, Rik van Riel wrote: > On Thu, 08 May 2014 22:20:25 -0400 > Rik van Riel <riel@redhat.com> wrote: > > > Looks like SD_BALANCE_WAKE is not gotten from the sd flags at > > all, but passed into select_task_rq by try_to_wake_up, as a > > hard coded sd_flags argument. > > > Should we do that, if SD_WAKE_BALANCE is not set for any sched domain? > > I answered my own question. The sd_flag SD_WAKE_BALANCE simply means > "this is a wakeup of a previously existing task, please place it > properly". > > However, it appears that the current code will fall back to the large > loop with select_idlest_group and friends, if prev_cpu and cpu are not > part of the same SD_WAKE_AFFINE sched domain. That is a bug... ttwu(): cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); We pass SD_BALANCE_WAKE for a normal wakeup, so sd will only be set if we encounter a domain during traversal where Joe User has told us to do (expensive) wake balancing before we hit a domain shared by waker/wakee. The user can turn SD_WAKE_AFFINE off beyond socket, and we'll not pull cross node on wakeup. Or, you could create an override button to say despite SD_WAKE_AFFINE perhaps having been set during domain construction (because of some pseudo-random numbers), don't do that if we have a preferred node, or just make that automatically part of having numa scheduling enabled, and don't bother wasting cycles if preferred && this != preferred. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-09 7:34 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Mike Galbraith @ 2014-05-09 14:22 ` Rik van Riel 2014-05-09 15:24 ` Mike Galbraith 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2014-05-09 14:22 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On 05/09/2014 03:34 AM, Mike Galbraith wrote: > On Fri, 2014-05-09 at 01:27 -0400, Rik van Riel wrote: >> On Thu, 08 May 2014 22:20:25 -0400 >> Rik van Riel <riel@redhat.com> wrote: >> >>> Looks like SD_BALANCE_WAKE is not gotten from the sd flags at >>> all, but passed into select_task_rq by try_to_wake_up, as a >>> hard coded sd_flags argument. >> >>> Should we do that, if SD_WAKE_BALANCE is not set for any sched domain? >> >> I answered my own question. The sd_flag SD_WAKE_BALANCE simply means >> "this is a wakeup of a previously existing task, please place it >> properly". >> >> However, it appears that the current code will fall back to the large >> loop with select_idlest_group and friends, if prev_cpu and cpu are not >> part of the same SD_WAKE_AFFINE sched domain. That is a bug... > > ttwu(): cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); > > We pass SD_BALANCE_WAKE for a normal wakeup, so sd will only be set if > we encounter a domain during traversal where Joe User has told us to do > (expensive) wake balancing before we hit a domain shared by waker/wakee. > > The user can turn SD_WAKE_AFFINE off beyond socket, and we'll not pull > cross node on wakeup. > > Or, you could create an override button to say despite SD_WAKE_AFFINE > perhaps having been set during domain construction (because of some > pseudo-random numbers), don't do that if we have a preferred node, or > just make that automatically part of having numa scheduling enabled, and > don't bother wasting cycles if preferred && this != preferred. That's not the problem. The problem is that if we do not do an affine wakeup, due to SD_WAKE_AFFINE not being set on a top level domain, we will not try to run p on prev_cpu, but we will fall through into the loop with find_idlest_group, etc... -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-09 14:22 ` Rik van Riel @ 2014-05-09 15:24 ` Mike Galbraith 2014-05-09 15:24 ` Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-09 15:24 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On Fri, 2014-05-09 at 10:22 -0400, Rik van Riel wrote: > On 05/09/2014 03:34 AM, Mike Galbraith wrote: > > On Fri, 2014-05-09 at 01:27 -0400, Rik van Riel wrote: > >> On Thu, 08 May 2014 22:20:25 -0400 > >> Rik van Riel <riel@redhat.com> wrote: > >> > >>> Looks like SD_BALANCE_WAKE is not gotten from the sd flags at > >>> all, but passed into select_task_rq by try_to_wake_up, as a > >>> hard coded sd_flags argument. > >> > >>> Should we do that, if SD_WAKE_BALANCE is not set for any sched domain? > >> > >> I answered my own question. The sd_flag SD_WAKE_BALANCE simply means > >> "this is a wakeup of a previously existing task, please place it > >> properly". > >> > >> However, it appears that the current code will fall back to the large > >> loop with select_idlest_group and friends, if prev_cpu and cpu are not > >> part of the same SD_WAKE_AFFINE sched domain. That is a bug... > > > > ttwu(): cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); > > > > We pass SD_BALANCE_WAKE for a normal wakeup, so sd will only be set if > > we encounter a domain during traversal where Joe User has told us to do > > (expensive) wake balancing before we hit a domain shared by waker/wakee. > > > > The user can turn SD_WAKE_AFFINE off beyond socket, and we'll not pull > > cross node on wakeup. > > > > Or, you could create an override button to say despite SD_WAKE_AFFINE > > perhaps having been set during domain construction (because of some > > pseudo-random numbers), don't do that if we have a preferred node, or > > just make that automatically part of having numa scheduling enabled, and > > don't bother wasting cycles if preferred && this != preferred. > > That's not the problem. > > The problem is that if we do not do an affine wakeup, due to > SD_WAKE_AFFINE not being set on a top level domain, we will > not try to run p on prev_cpu, but we will fall through into > the loop with find_idlest_group, etc... If no ->flags & SD_BALANCE_WAKE is encountered during traversal, sd remains NULL, we fall through to return prev_cpu. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-09 15:24 ` Mike Galbraith @ 2014-05-09 15:24 ` Rik van Riel 2014-05-09 17:55 ` Mike Galbraith 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2014-05-09 15:24 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On 05/09/2014 11:24 AM, Mike Galbraith wrote: > On Fri, 2014-05-09 at 10:22 -0400, Rik van Riel wrote: >> On 05/09/2014 03:34 AM, Mike Galbraith wrote: >>> On Fri, 2014-05-09 at 01:27 -0400, Rik van Riel wrote: >>>> On Thu, 08 May 2014 22:20:25 -0400 >>>> Rik van Riel <riel@redhat.com> wrote: >>>> >>>>> Looks like SD_BALANCE_WAKE is not gotten from the sd flags at >>>>> all, but passed into select_task_rq by try_to_wake_up, as a >>>>> hard coded sd_flags argument. >>>> >>>>> Should we do that, if SD_WAKE_BALANCE is not set for any sched domain? >>>> >>>> I answered my own question. The sd_flag SD_WAKE_BALANCE simply means >>>> "this is a wakeup of a previously existing task, please place it >>>> properly". >>>> >>>> However, it appears that the current code will fall back to the large >>>> loop with select_idlest_group and friends, if prev_cpu and cpu are not >>>> part of the same SD_WAKE_AFFINE sched domain. That is a bug... >>> >>> ttwu(): cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); >>> >>> We pass SD_BALANCE_WAKE for a normal wakeup, so sd will only be set if >>> we encounter a domain during traversal where Joe User has told us to do >>> (expensive) wake balancing before we hit a domain shared by waker/wakee. >>> >>> The user can turn SD_WAKE_AFFINE off beyond socket, and we'll not pull >>> cross node on wakeup. >>> >>> Or, you could create an override button to say despite SD_WAKE_AFFINE >>> perhaps having been set during domain construction (because of some >>> pseudo-random numbers), don't do that if we have a preferred node, or >>> just make that automatically part of having numa scheduling enabled, and >>> don't bother wasting cycles if preferred && this != preferred. >> >> That's not the problem. >> >> The problem is that if we do not do an affine wakeup, due to >> SD_WAKE_AFFINE not being set on a top level domain, we will >> not try to run p on prev_cpu, but we will fall through into >> the loop with find_idlest_group, etc... > > If no ->flags & SD_BALANCE_WAKE is encountered during traversal, sd > remains NULL, we fall through to return prev_cpu. We do fall through, but into this loop: while (sd) { struct sched_group *group; int weight; if (!(sd->flags & sd_flag)) { sd = sd->child; continue; } group = find_idlest_group(sd, p, cpu, sd_flag); if (!group) { sd = sd->child; continue; } new_cpu = find_idlest_cpu(group, p, cpu); if (new_cpu == -1 || new_cpu == cpu) { /* Now try balancing at a lower domain level of cpu */ sd = sd->child; continue; } /* Now try balancing at a lower domain level of new_cpu */ cpu = new_cpu; weight = sd->span_weight; sd = NULL; for_each_domain(cpu, tmp) { if (weight <= tmp->span_weight) break; if (tmp->flags & sd_flag) sd = tmp; } /* while loop will break here if sd == NULL */ } ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-09 15:24 ` Rik van Riel @ 2014-05-09 17:55 ` Mike Galbraith 2014-05-09 18:16 ` Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-09 17:55 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On Fri, 2014-05-09 at 11:24 -0400, Rik van Riel wrote: > On 05/09/2014 11:24 AM, Mike Galbraith wrote: > > If no ->flags & SD_BALANCE_WAKE is encountered during traversal, sd > > remains NULL, we fall through to return prev_cpu. ^^^^^^^^^^^^ > We do fall through, but into this loop: > while (sd) { ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-09 17:55 ` Mike Galbraith @ 2014-05-09 18:16 ` Rik van Riel 2014-05-10 3:54 ` Mike Galbraith 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2014-05-09 18:16 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On 05/09/2014 01:55 PM, Mike Galbraith wrote: > On Fri, 2014-05-09 at 11:24 -0400, Rik van Riel wrote: >> On 05/09/2014 11:24 AM, Mike Galbraith wrote: > >>> If no ->flags & SD_BALANCE_WAKE is encountered during traversal, sd >>> remains NULL, we fall through to return prev_cpu. > ^^^^^^^^^^^^ > >> We do fall through, but into this loop: > >> while (sd) { You are right. That code is a little hard to follow... That leaves the big question: do we want to fall back to prev_cpu if it is not idle, and it has an idle sibling, or would it be better to find an idle sibling of prev_cpu when we wake up a task? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-09 18:16 ` Rik van Riel @ 2014-05-10 3:54 ` Mike Galbraith 2014-05-13 14:08 ` Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-10 3:54 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai On Fri, 2014-05-09 at 14:16 -0400, Rik van Riel wrote: > That leaves the big question: do we want to fall back to > prev_cpu if it is not idle, and it has an idle sibling, > or would it be better to find an idle sibling of prev_cpu > when we wake up a task? Yes. If there was A correct answer, this stuff would be a lot easier. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-10 3:54 ` Mike Galbraith @ 2014-05-13 14:08 ` Rik van Riel 2014-05-14 4:08 ` Mike Galbraith 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2014-05-13 14:08 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai, Mel Gorman, Vinod, Chegu, Suresh Siddha On 05/09/2014 11:54 PM, Mike Galbraith wrote: > On Fri, 2014-05-09 at 14:16 -0400, Rik van Riel wrote: > >> That leaves the big question: do we want to fall back to >> prev_cpu if it is not idle, and it has an idle sibling, >> or would it be better to find an idle sibling of prev_cpu >> when we wake up a task? > > Yes. If there was A correct answer, this stuff would be a lot easier. OK, after doing some other NUMA stuff, and then looking at the scheduler again with a fresh mind, I have drawn some more conclusions about what the scheduler does, and how it breaks NUMA locality :) 1) If the node_distance between nodes on a NUMA system is <= RECLAIM_DISTANCE, we will call select_idle_sibling for a wakeup of a previously existing task (SD_BALANCE_WAKE) 2) If the node distance exceeds RECLAIM_DISTANCE, we will wake up a task on prev_cpu, even if it is not currently idle This behaviour only happens on certain large NUMA systems, and is different from the behaviour on small systems. I suspect we will want to call select_idle_sibling with prev_cpu in case target and prev_cpu are not in the same SD_WAKE_AFFINE domain. 3) If wake_wide is false, we call select_idle_sibling with the CPU number of the code that is waking up the task 4) If wake_wide is true, we call select_idle_sibling with the CPU number the task was previously running on (prev_cpu) In effect, the "wake task on waking task's CPU" behaviour is the default, regardless of how frequently a task wakes up its wakee, and regardless of impact on NUMA locality. This may need to be changed. 5) select_idle_sibling will place the task on (3) or (4) only if the CPU is actually idle. If task A communicates with task B through a pipe or a socket, and does a sync wakeup, task B will never be placed on task A's CPU (not idle yet), and it will only be placed on its own previous CPU if it is currently idle. 6) If neither CPU is idle, select_idle_sibling will walk all the CPUs in the SD_SHARE_PKG_RESOURCES SD of the target. This looks correct to me, though it could result in more work by the load balancing code later on, since it does not take load into account at all. It is unclear if this needs any changes. Am I overlooking anything? What benchmarks should I run to test any changes I make? Are there particular system types people want me to run tests with? -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu 2014-05-13 14:08 ` Rik van Riel @ 2014-05-14 4:08 ` Mike Galbraith 2014-05-14 15:40 ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-14 4:08 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai, Mel Gorman, Vinod, Chegu, Suresh Siddha On Tue, 2014-05-13 at 10:08 -0400, Rik van Riel wrote: > OK, after doing some other NUMA stuff, and then looking at the scheduler > again with a fresh mind, I have drawn some more conclusions about what > the scheduler does, and how it breaks NUMA locality :) > > 1) If the node_distance between nodes on a NUMA system is > <= RECLAIM_DISTANCE, we will call select_idle_sibling for > a wakeup of a previously existing task (SD_BALANCE_WAKE) > > 2) If the node distance exceeds RECLAIM_DISTANCE, we will > wake up a task on prev_cpu, even if it is not currently > idle > > This behaviour only happens on certain large NUMA systems, > and is different from the behaviour on small systems. > I suspect we will want to call select_idle_sibling with > prev_cpu in case target and prev_cpu are not in the same > SD_WAKE_AFFINE domain. Sometimes. It's the same can of worms remote as it is local.. latency gain may or may not outweigh cache miss pain. > 3) If wake_wide is false, we call select_idle_sibling with > the CPU number of the code that is waking up the task > > 4) If wake_wide is true, we call select_idle_sibling with > the CPU number the task was previously running on (prev_cpu) > > In effect, the "wake task on waking task's CPU" behaviour > is the default, regardless of how frequently a task wakes up > its wakee, and regardless of impact on NUMA locality. > > This may need to be changed. That behavior also improves the odds of communicating tasks sharing a cache though. > Am I overlooking anything? No, I think you're seeing where the worms live. > What benchmarks should I run to test any changes I make? Mixed bag, it'll affects all, bursty, static, ramp up/down. -Mike ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH] sched: call select_idle_sibling when not affine_sd 2014-05-14 4:08 ` Mike Galbraith @ 2014-05-14 15:40 ` Rik van Riel 2014-05-14 15:45 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Rik van Riel @ 2014-05-14 15:40 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai, Mel Gorman, "Vinod, Chegu\" <chegu_vinod On Wed, 14 May 2014 06:08:09 +0200 Mike Galbraith <umgwanakikbuti@gmail.com> wrote: > On Tue, 2014-05-13 at 10:08 -0400, Rik van Riel wrote: > > 1) If the node_distance between nodes on a NUMA system is > > <= RECLAIM_DISTANCE, we will call select_idle_sibling for > > a wakeup of a previously existing task (SD_BALANCE_WAKE) > > > > 2) If the node distance exceeds RECLAIM_DISTANCE, we will > > wake up a task on prev_cpu, even if it is not currently > > idle > > > > This behaviour only happens on certain large NUMA systems, > > and is different from the behaviour on small systems. > > I suspect we will want to call select_idle_sibling with > > prev_cpu in case target and prev_cpu are not in the same > > SD_WAKE_AFFINE domain. > > Sometimes. It's the same can of worms remote as it is local.. latency > gain may or may not outweigh cache miss pain. Ahh, but it is a DIFFERENT can of worms. If the distance between cpu and prev_cpu exceeds RECLAIM_DISTANCE, we will not look for an idle sibling in the same LLC domain as prev_cpu. If the distance is smaller, and we decide not to do an affine wakeup, then we do look for an idle sibling of prev_cpu. This patch makes sure that both types of systems have the same can of worms :) ---8<--- Subject: sched: call select_idle_sibling when not affine_sd On smaller systems, the top level sched domain will be an affine domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE wakeup. This seems to be working well. On larger systems, with the node distance between far away NUMA nodes being > RECLAIM_DISTANCE, select_idle_sibling is only called if the waker and the wakee are on nodes less than RECLAIM_DISTANCE apart. This patch leaves in place the policy of not pulling the task across nodes on such systems, while fixing the issue that select_idle_sibling is not called at all in certain circumstances. The code will look for an idle CPU in the same CPU package as the CPU where the task ran previously. Signed-off-by: Rik van Riel <riel@redhat.com> --- kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 39b63d0..1e58159 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4423,10 +4423,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f sd = tmp; } - if (affine_sd) { - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + prev_cpu = cpu; + if (sd_flag & SD_WAKE_AFFINE) { new_cpu = select_idle_sibling(p, prev_cpu); goto unlock; } ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH] sched: call select_idle_sibling when not affine_sd 2014-05-14 15:40 ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel @ 2014-05-14 15:45 ` Peter Zijlstra 2014-05-19 13:08 ` [tip:sched/core] " tip-bot for Rik van Riel 2014-05-22 12:27 ` [tip:sched/core] sched: Call select_idle_sibling() " tip-bot for Rik van Riel 2 siblings, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2014-05-14 15:45 UTC (permalink / raw) To: Rik van Riel Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo, george.mccollister, ktkhai, Mel Gorman, Vinod [-- Attachment #1: Type: text/plain, Size: 1640 bytes --] On Wed, May 14, 2014 at 11:40:37AM -0400, Rik van Riel wrote: > Subject: sched: call select_idle_sibling when not affine_sd > > On smaller systems, the top level sched domain will be an affine > domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE > wakeup. This seems to be working well. > > On larger systems, with the node distance between far away NUMA nodes > being > RECLAIM_DISTANCE, select_idle_sibling is only called if the > waker and the wakee are on nodes less than RECLAIM_DISTANCE apart. > > This patch leaves in place the policy of not pulling the task across > nodes on such systems, while fixing the issue that select_idle_sibling > is not called at all in certain circumstances. > > The code will look for an idle CPU in the same CPU package as the > CPU where the task ran previously. > > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > kernel/sched/fair.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 39b63d0..1e58159 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4423,10 +4423,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > sd = tmp; > } > > - if (affine_sd) { > - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) > - prev_cpu = cpu; > + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) > + prev_cpu = cpu; > > + if (sd_flag & SD_WAKE_AFFINE) { I think you meant SD_BALANCE_WAKE? > new_cpu = select_idle_sibling(p, prev_cpu); > goto unlock; > } [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [tip:sched/core] sched: call select_idle_sibling when not affine_sd 2014-05-14 15:40 ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel 2014-05-14 15:45 ` Peter Zijlstra @ 2014-05-19 13:08 ` tip-bot for Rik van Riel 2014-05-22 12:27 ` [tip:sched/core] sched: Call select_idle_sibling() " tip-bot for Rik van Riel 2 siblings, 0 replies; 46+ messages in thread From: tip-bot for Rik van Riel @ 2014-05-19 13:08 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, peterz, umgwanakikbuti, riel, mgorman, tglx Commit-ID: b45cf72cf7e1dd3b4a95947f85659cfdc01dbdad Gitweb: http://git.kernel.org/tip/b45cf72cf7e1dd3b4a95947f85659cfdc01dbdad Author: Rik van Riel <riel@redhat.com> AuthorDate: Wed, 14 May 2014 11:40:37 -0400 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Mon, 19 May 2014 22:02:40 +0900 sched: call select_idle_sibling when not affine_sd On smaller systems, the top level sched domain will be an affine domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE wakeup. This seems to be working well. On larger systems, with the node distance between far away NUMA nodes being > RECLAIM_DISTANCE, select_idle_sibling is only called if the waker and the wakee are on nodes less than RECLAIM_DISTANCE apart. This patch leaves in place the policy of not pulling the task across nodes on such systems, while fixing the issue that select_idle_sibling is not called at all in certain circumstances. The code will look for an idle CPU in the same CPU package as the CPU where the task ran previously. Cc: morten.rasmussen@arm.com Cc: mingo@kernel.org Cc: george.mccollister@gmail.com Cc: ktkhai@parallels.com Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140514114037.2d93266f@annuminas.surriel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dd3fa14..429164d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4473,10 +4473,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f sd = tmp; } - if (affine_sd) { - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + prev_cpu = cpu; + if (sd_flag & SD_BALANCE_WAKE) { new_cpu = select_idle_sibling(p, prev_cpu); goto unlock; } ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [tip:sched/core] sched: Call select_idle_sibling() when not affine_sd 2014-05-14 15:40 ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel 2014-05-14 15:45 ` Peter Zijlstra 2014-05-19 13:08 ` [tip:sched/core] " tip-bot for Rik van Riel @ 2014-05-22 12:27 ` tip-bot for Rik van Riel 2 siblings, 0 replies; 46+ messages in thread From: tip-bot for Rik van Riel @ 2014-05-22 12:27 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, peterz, umgwanakikbuti, riel, mgorman, tglx Commit-ID: 8bf21433f38b020c3d8a3805d1d7fb73d7b40c01 Gitweb: http://git.kernel.org/tip/8bf21433f38b020c3d8a3805d1d7fb73d7b40c01 Author: Rik van Riel <riel@redhat.com> AuthorDate: Wed, 14 May 2014 11:40:37 -0400 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Thu, 22 May 2014 11:16:28 +0200 sched: Call select_idle_sibling() when not affine_sd On smaller systems, the top level sched domain will be an affine domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE wakeup. This seems to be working well. On larger systems, with the node distance between far away NUMA nodes being > RECLAIM_DISTANCE, select_idle_sibling is only called if the waker and the wakee are on nodes less than RECLAIM_DISTANCE apart. This patch leaves in place the policy of not pulling the task across nodes on such systems, while fixing the issue that select_idle_sibling is not called at all in certain circumstances. The code will look for an idle CPU in the same CPU package as the CPU where the task ran previously. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: morten.rasmussen@arm.com Cc: george.mccollister@gmail.com Cc: ktkhai@parallels.com Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Link: http://lkml.kernel.org/r/20140514114037.2d93266f@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dd3fa14..429164d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4473,10 +4473,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f sd = tmp; } - if (affine_sd) { - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + prev_cpu = cpu; + if (sd_flag & SD_BALANCE_WAKE) { new_cpu = select_idle_sibling(p, prev_cpu); goto unlock; } ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-02 6:30 ` Rik van Riel 2014-05-02 7:37 ` Mike Galbraith @ 2014-05-04 11:44 ` Preeti Murthy 2014-05-04 12:04 ` Mike Galbraith ` (2 more replies) 1 sibling, 3 replies; 46+ messages in thread From: Preeti Murthy @ 2014-05-04 11:44 UTC (permalink / raw) To: Rik van Riel, umgwanakikbuti Cc: LKML, Morten Rasmussen, Ingo Molnar, Peter Zijlstra, george.mccollister, ktkhai, Preeti U Murthy Hi Rik, Mike On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote: > On 05/02/2014 02:13 AM, Mike Galbraith wrote: >> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: >> >>> Whether or not this is the right thing to do remains to be seen, >>> but it does allow us to verify whether or not the wake_affine >>> strategy of always doing affine wakeups and only disabling them >>> in a specific circumstance is sound, or needs rethinking... >> >> Yes, it needs rethinking. >> >> I know why you want to try this, yes, select_idle_sibling() is very much >> a two faced little bitch. > > My biggest problem with select_idle_sibling and wake_affine in > general is that it will override NUMA placement, even when > processes only wake each other up infrequently... As far as my understanding goes, the logic in select_task_rq_fair() does wake_affine() or calls select_idle_sibling() only at those levels of sched domains where the flag SD_WAKE_AFFINE is set. This flag is not set at the numa domain and hence they will not be balancing across numa nodes. So I don't understand how *these functions* are affecting NUMA placements. The wake_affine() and select_idle_sibling() will shuttle tasks within a NUMA node as far as I can see.i.e. if the cpu that the task previously ran on and the waker cpu belong to the same node. Else they are not called. If the prev_cpu and the waker cpu are on different NUMA nodes then naturally the tasks will get shuttled across NUMA nodes but the culprits are the find_idlest* functions. They do a top-down search for the idlest group and cpu, starting at the NUMA domain *attached to the waker and not the prev_cpu*. This means that the task will end up on a different NUMA node. Looks to me that the problem lies here and not in the wake_affine() and select_idle_siblings(). Regards Preeti U Murthy > > -- > All rights reversed > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-04 11:44 ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy @ 2014-05-04 12:04 ` Mike Galbraith 2014-05-05 4:38 ` Preeti U Murthy 2014-05-04 12:41 ` Rik van Riel 2014-05-06 11:56 ` Peter Zijlstra 2 siblings, 1 reply; 46+ messages in thread From: Mike Galbraith @ 2014-05-04 12:04 UTC (permalink / raw) To: Preeti Murthy Cc: Rik van Riel, LKML, Morten Rasmussen, Ingo Molnar, Peter Zijlstra, george.mccollister, ktkhai, Preeti U Murthy On Sun, 2014-05-04 at 17:14 +0530, Preeti Murthy wrote: > Hi Rik, Mike > > On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote: > > On 05/02/2014 02:13 AM, Mike Galbraith wrote: > >> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > >> > >>> Whether or not this is the right thing to do remains to be seen, > >>> but it does allow us to verify whether or not the wake_affine > >>> strategy of always doing affine wakeups and only disabling them > >>> in a specific circumstance is sound, or needs rethinking... > >> > >> Yes, it needs rethinking. > >> > >> I know why you want to try this, yes, select_idle_sibling() is very much > >> a two faced little bitch. > > > > My biggest problem with select_idle_sibling and wake_affine in > > general is that it will override NUMA placement, even when > > processes only wake each other up infrequently... > > As far as my understanding goes, the logic in select_task_rq_fair() > does wake_affine() or calls select_idle_sibling() only at those > levels of sched domains where the flag SD_WAKE_AFFINE is set. > This flag is not set at the numa domain and hence they will not be > balancing across numa nodes. So I don't understand how > *these functions* are affecting NUMA placements. Depends on how far away node yonder is I suppose. static inline int sd_local_flags(int level) { if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE) return 0; return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; } ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-04 12:04 ` Mike Galbraith @ 2014-05-05 4:38 ` Preeti U Murthy 0 siblings, 0 replies; 46+ messages in thread From: Preeti U Murthy @ 2014-05-05 4:38 UTC (permalink / raw) To: Mike Galbraith Cc: Preeti Murthy, Rik van Riel, LKML, Morten Rasmussen, Ingo Molnar, Peter Zijlstra, george.mccollister, ktkhai On 05/04/2014 05:34 PM, Mike Galbraith wrote: > On Sun, 2014-05-04 at 17:14 +0530, Preeti Murthy wrote: >> Hi Rik, Mike >> >> On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote: >>> On 05/02/2014 02:13 AM, Mike Galbraith wrote: >>>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: >>>> >>>>> Whether or not this is the right thing to do remains to be seen, >>>>> but it does allow us to verify whether or not the wake_affine >>>>> strategy of always doing affine wakeups and only disabling them >>>>> in a specific circumstance is sound, or needs rethinking... >>>> >>>> Yes, it needs rethinking. >>>> >>>> I know why you want to try this, yes, select_idle_sibling() is very much >>>> a two faced little bitch. >>> >>> My biggest problem with select_idle_sibling and wake_affine in >>> general is that it will override NUMA placement, even when >>> processes only wake each other up infrequently... >> >> As far as my understanding goes, the logic in select_task_rq_fair() >> does wake_affine() or calls select_idle_sibling() only at those >> levels of sched domains where the flag SD_WAKE_AFFINE is set. >> This flag is not set at the numa domain and hence they will not be >> balancing across numa nodes. So I don't understand how >> *these functions* are affecting NUMA placements. > > Depends on how far away node yonder is I suppose. > > static inline int sd_local_flags(int level) > { > if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE) > return 0; > > return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; > } > > Hmm thanks Mike, I totally missed this! Regards Preeti U Murthy ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-04 11:44 ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy 2014-05-04 12:04 ` Mike Galbraith @ 2014-05-04 12:41 ` Rik van Riel 2014-05-05 4:50 ` Preeti U Murthy 2014-05-06 13:25 ` Peter Zijlstra 2014-05-06 11:56 ` Peter Zijlstra 2 siblings, 2 replies; 46+ messages in thread From: Rik van Riel @ 2014-05-04 12:41 UTC (permalink / raw) To: Preeti Murthy, umgwanakikbuti Cc: LKML, Morten Rasmussen, Ingo Molnar, Peter Zijlstra, george.mccollister, ktkhai, Preeti U Murthy On 05/04/2014 07:44 AM, Preeti Murthy wrote: > Hi Rik, Mike > > On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote: >> On 05/02/2014 02:13 AM, Mike Galbraith wrote: >>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: >>> >>>> Whether or not this is the right thing to do remains to be seen, >>>> but it does allow us to verify whether or not the wake_affine >>>> strategy of always doing affine wakeups and only disabling them >>>> in a specific circumstance is sound, or needs rethinking... >>> >>> Yes, it needs rethinking. >>> >>> I know why you want to try this, yes, select_idle_sibling() is very much >>> a two faced little bitch. >> >> My biggest problem with select_idle_sibling and wake_affine in >> general is that it will override NUMA placement, even when >> processes only wake each other up infrequently... > > As far as my understanding goes, the logic in select_task_rq_fair() > does wake_affine() or calls select_idle_sibling() only at those > levels of sched domains where the flag SD_WAKE_AFFINE is set. > This flag is not set at the numa domain and hence they will not be > balancing across numa nodes. So I don't understand how > *these functions* are affecting NUMA placements. Even on 8-node DL980 systems, the NUMA distance in the SLIT table is less than RECLAIM_DISTANCE, and we will do wake_affine across the entire system. > The wake_affine() and select_idle_sibling() will shuttle tasks > within a NUMA node as far as I can see.i.e. if the cpu that the task > previously ran on and the waker cpu belong to the same node. > Else they are not called. That is what I first hoped, too. I was wrong. > If the prev_cpu and the waker cpu are on different NUMA nodes > then naturally the tasks will get shuttled across NUMA nodes but > the culprits are the find_idlest* functions. > They do a top-down search for the idlest group and cpu, starting > at the NUMA domain *attached to the waker and not the prev_cpu*. > This means that the task will end up on a different NUMA node. > Looks to me that the problem lies here and not in the wake_affine() > and select_idle_siblings(). I have a patch for find_idlest_group that takes the NUMA distance between each group and the task's preferred node into account. However, as long as the wake_affine stuff still gets to override it, that does not make much difference :) -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-04 12:41 ` Rik van Riel @ 2014-05-05 4:50 ` Preeti U Murthy 2014-05-05 6:43 ` Preeti U Murthy ` (2 more replies) 2014-05-06 13:25 ` Peter Zijlstra 1 sibling, 3 replies; 46+ messages in thread From: Preeti U Murthy @ 2014-05-05 4:50 UTC (permalink / raw) To: Rik van Riel, umgwanakikbuti, Peter Zijlstra Cc: Preeti Murthy, LKML, Morten Rasmussen, Ingo Molnar, george.mccollister, ktkhai On 05/04/2014 06:11 PM, Rik van Riel wrote: > On 05/04/2014 07:44 AM, Preeti Murthy wrote: >> Hi Rik, Mike >> >> On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote: >>> On 05/02/2014 02:13 AM, Mike Galbraith wrote: >>>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: >>>> >>>>> Whether or not this is the right thing to do remains to be seen, >>>>> but it does allow us to verify whether or not the wake_affine >>>>> strategy of always doing affine wakeups and only disabling them >>>>> in a specific circumstance is sound, or needs rethinking... >>>> >>>> Yes, it needs rethinking. >>>> >>>> I know why you want to try this, yes, select_idle_sibling() is very much >>>> a two faced little bitch. >>> >>> My biggest problem with select_idle_sibling and wake_affine in >>> general is that it will override NUMA placement, even when >>> processes only wake each other up infrequently... >> >> As far as my understanding goes, the logic in select_task_rq_fair() >> does wake_affine() or calls select_idle_sibling() only at those >> levels of sched domains where the flag SD_WAKE_AFFINE is set. >> This flag is not set at the numa domain and hence they will not be >> balancing across numa nodes. So I don't understand how >> *these functions* are affecting NUMA placements. > > Even on 8-node DL980 systems, the NUMA distance in the > SLIT table is less than RECLAIM_DISTANCE, and we will > do wake_affine across the entire system. > >> The wake_affine() and select_idle_sibling() will shuttle tasks >> within a NUMA node as far as I can see.i.e. if the cpu that the task >> previously ran on and the waker cpu belong to the same node. >> Else they are not called. > > That is what I first hoped, too. I was wrong. > >> If the prev_cpu and the waker cpu are on different NUMA nodes >> then naturally the tasks will get shuttled across NUMA nodes but >> the culprits are the find_idlest* functions. >> They do a top-down search for the idlest group and cpu, starting >> at the NUMA domain *attached to the waker and not the prev_cpu*. >> This means that the task will end up on a different NUMA node. >> Looks to me that the problem lies here and not in the wake_affine() >> and select_idle_siblings(). > > I have a patch for find_idlest_group that takes the NUMA > distance between each group and the task's preferred node > into account. > > However, as long as the wake_affine stuff still gets to > override it, that does not make much difference :) > Yeah now I see it. But I still feel wake_affine() and select_idle_sibling() are not at fault primarily because when they were introduced, I don't think it was foreseen that the cpu topology would grow to the extent it is now. select_idle_sibling() for instance scans the cpus within the purview of the last level cache of a cpu and this was a small set. Hence there was no overhead. Now with many cpus sharing the L3 cache, we see an overhead. wake_affine() probably did not expect the NUMA nodes to come under its governance as well and hence it sees no harm in waking up tasks close to the waker because it still believes that it will be within a node. What has changed is the code around these two functions I feel. Take this problem for instance. We ourselves are saying in sd_local_flags() that this specific domain is fit for wake affine balance. So naturally the logic in wake_affine and select_idle_sibling() will follow. My point is the peripheral code is seeing the negative affect of these two functions because they pushed themselves under its ambit. Don't you think we should go conservative on the value of RECLAIM_DISTANCE in arch specific code at-least? On powerpc we set it to 10. Besides, the git log does not tell us the basis on which this value was set to a default of 30. Maybe this needs re-thought? Regards Preeti U Murthy ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-05 4:50 ` Preeti U Murthy @ 2014-05-05 6:43 ` Preeti U Murthy 2014-05-05 11:28 ` Rik van Riel 2014-05-06 13:26 ` Peter Zijlstra 2 siblings, 0 replies; 46+ messages in thread From: Preeti U Murthy @ 2014-05-05 6:43 UTC (permalink / raw) To: Rik van Riel, umgwanakikbuti, Peter Zijlstra Cc: Preeti Murthy, LKML, Morten Rasmussen, Ingo Molnar, george.mccollister, ktkhai On 05/05/2014 10:20 AM, Preeti U Murthy wrote: > On 05/04/2014 06:11 PM, Rik van Riel wrote: >> On 05/04/2014 07:44 AM, Preeti Murthy wrote: >>> Hi Rik, Mike >>> >>> On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote: >>>> On 05/02/2014 02:13 AM, Mike Galbraith wrote: >>>>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: >>>>> >>>>>> Whether or not this is the right thing to do remains to be seen, >>>>>> but it does allow us to verify whether or not the wake_affine >>>>>> strategy of always doing affine wakeups and only disabling them >>>>>> in a specific circumstance is sound, or needs rethinking... >>>>> >>>>> Yes, it needs rethinking. >>>>> >>>>> I know why you want to try this, yes, select_idle_sibling() is very much >>>>> a two faced little bitch. >>>> >>>> My biggest problem with select_idle_sibling and wake_affine in >>>> general is that it will override NUMA placement, even when >>>> processes only wake each other up infrequently... >>> >>> As far as my understanding goes, the logic in select_task_rq_fair() >>> does wake_affine() or calls select_idle_sibling() only at those >>> levels of sched domains where the flag SD_WAKE_AFFINE is set. >>> This flag is not set at the numa domain and hence they will not be >>> balancing across numa nodes. So I don't understand how >>> *these functions* are affecting NUMA placements. >> >> Even on 8-node DL980 systems, the NUMA distance in the >> SLIT table is less than RECLAIM_DISTANCE, and we will >> do wake_affine across the entire system. >> >>> The wake_affine() and select_idle_sibling() will shuttle tasks >>> within a NUMA node as far as I can see.i.e. if the cpu that the task >>> previously ran on and the waker cpu belong to the same node. >>> Else they are not called. >> >> That is what I first hoped, too. I was wrong. >> >>> If the prev_cpu and the waker cpu are on different NUMA nodes >>> then naturally the tasks will get shuttled across NUMA nodes but >>> the culprits are the find_idlest* functions. >>> They do a top-down search for the idlest group and cpu, starting >>> at the NUMA domain *attached to the waker and not the prev_cpu*. >>> This means that the task will end up on a different NUMA node. >>> Looks to me that the problem lies here and not in the wake_affine() >>> and select_idle_siblings(). >> >> I have a patch for find_idlest_group that takes the NUMA >> distance between each group and the task's preferred node >> into account. >> >> However, as long as the wake_affine stuff still gets to >> override it, that does not make much difference :) >> > > Yeah now I see it. But I still feel wake_affine() and > select_idle_sibling() are not at fault primarily because when they were > introduced, I don't think it was foreseen that the cpu topology would > grow to the extent it is now. > > select_idle_sibling() for instance scans the cpus within the purview of > the last level cache of a cpu and this was a small set. Hence there was > no overhead. Now with many cpus sharing the L3 cache, we see an > overhead. wake_affine() probably did not expect the NUMA nodes to come > under its governance as well and hence it sees no harm in waking up > tasks close to the waker because it still believes that it will be > within a node. > > What has changed is the code around these two functions I feel. Take > this problem for instance. We ourselves are saying in sd_local_flags() > that this specific domain is fit for wake affine balance. So naturally > the logic in wake_affine and select_idle_sibling() will follow. > My point is the peripheral code is seeing the negative affect of these > two functions because they pushed themselves under its ambit. > > Don't you think we should go conservative on the value of > RECLAIM_DISTANCE in arch specific code at-least? On powerpc we set it to > 10. Besides, the git log does not tell us the basis on which this value > was set to a default of 30. Maybe this needs re-thought? Sorry I overlooked this. Commit 32e45ff43eaf5c17f5a increased the value to 30 and the reason is also clearly mentioned. It is mentioned that the value was arbitrarily chosen. I don't know if this will help this discussion but I thought I would point it out. Thanks Regards Preeti U Murthy > > Regards > Preeti U Murthy > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-05 4:50 ` Preeti U Murthy 2014-05-05 6:43 ` Preeti U Murthy @ 2014-05-05 11:28 ` Rik van Riel 2014-05-06 13:26 ` Peter Zijlstra 2 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2014-05-05 11:28 UTC (permalink / raw) To: Preeti U Murthy, umgwanakikbuti, Peter Zijlstra Cc: Preeti Murthy, LKML, Morten Rasmussen, Ingo Molnar, george.mccollister, ktkhai On 05/05/2014 12:50 AM, Preeti U Murthy wrote: > Yeah now I see it. But I still feel wake_affine() and > select_idle_sibling() are not at fault primarily because when they were > introduced, I don't think it was foreseen that the cpu topology would > grow to the extent it is now. It's not about "fault", it is about the fact that on current large NUMA systems they are broken, and could stand some improvement :) > select_idle_sibling() for instance scans the cpus within the purview of > the last level cache of a cpu and this was a small set. Hence there was > no overhead. Now with many cpus sharing the L3 cache, we see an > overhead. wake_affine() probably did not expect the NUMA nodes to come > under its governance as well and hence it sees no harm in waking up > tasks close to the waker because it still believes that it will be > within a node. If two tasks truly are related to each other, I think we will want to have the wake_affine logic pull them towards each other, all the way across a giant NUMA system if needs be. The problem is that the current wake_affine logic starts in the ON position, and only switches off in a few very specific scenarios. I suspect we would be better off with the reverse, starting with wake_affine in the off position, and switching it on when we detect it makes sense to do so. -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-05 4:50 ` Preeti U Murthy 2014-05-05 6:43 ` Preeti U Murthy 2014-05-05 11:28 ` Rik van Riel @ 2014-05-06 13:26 ` Peter Zijlstra 2 siblings, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2014-05-06 13:26 UTC (permalink / raw) To: Preeti U Murthy Cc: Rik van Riel, umgwanakikbuti, Preeti Murthy, LKML, Morten Rasmussen, Ingo Molnar, george.mccollister, ktkhai [-- Attachment #1: Type: text/plain, Size: 394 bytes --] On Mon, May 05, 2014 at 10:20:20AM +0530, Preeti U Murthy wrote: > Don't you think we should go conservative on the value of > RECLAIM_DISTANCE in arch specific code at-least? On powerpc we set it to > 10. Besides, the git log does not tell us the basis on which this value > was set to a default of 30. Maybe this needs re-thought? See my last email on how all this is entirely pointless :-( [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-04 12:41 ` Rik van Riel 2014-05-05 4:50 ` Preeti U Murthy @ 2014-05-06 13:25 ` Peter Zijlstra 2014-05-06 20:20 ` Rik van Riel 1 sibling, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2014-05-06 13:25 UTC (permalink / raw) To: Rik van Riel Cc: Preeti Murthy, umgwanakikbuti, LKML, Morten Rasmussen, Ingo Molnar, george.mccollister, ktkhai, Preeti U Murthy [-- Attachment #1: Type: text/plain, Size: 690 bytes --] On Sun, May 04, 2014 at 08:41:09AM -0400, Rik van Riel wrote: > Even on 8-node DL980 systems, the NUMA distance in the > SLIT table is less than RECLAIM_DISTANCE, and we will > do wake_affine across the entire system. Yeah, so the problem is that (AFAIK) ACPI doesn't actually specify a metric for the SLIT distance. This (in as far as BIOS people would care to stick to specs anyhow) has lead to the 'fun' situation where BIOS engineers tweak SLIT table values to make OSes behave as they thing it should. So if the BIOS engineer finds that this system should have < RECLAIM_DISTANCE it will simply make the table such that the max SLIT value is below that. And yes, I've seen this :-( [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-06 13:25 ` Peter Zijlstra @ 2014-05-06 20:20 ` Rik van Riel 2014-05-06 20:41 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2014-05-06 20:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Preeti Murthy, umgwanakikbuti, LKML, Morten Rasmussen, Ingo Molnar, george.mccollister, ktkhai, Preeti U Murthy On 05/06/2014 09:25 AM, Peter Zijlstra wrote: > On Sun, May 04, 2014 at 08:41:09AM -0400, Rik van Riel wrote: >> Even on 8-node DL980 systems, the NUMA distance in the >> SLIT table is less than RECLAIM_DISTANCE, and we will >> do wake_affine across the entire system. > > Yeah, so the problem is that (AFAIK) ACPI doesn't actually specify a > metric for the SLIT distance. This (in as far as BIOS people would care > to stick to specs anyhow) has lead to the 'fun' situation where BIOS > engineers tweak SLIT table values to make OSes behave as they thing it > should. > > So if the BIOS engineer finds that this system should have < > RECLAIM_DISTANCE it will simply make the table such that the max SLIT > value is below that. > > And yes, I've seen this :-( It appears to be the case on the vast majority of the NUMA systems that are actually in use. To me, this suggests that we should probably deal with it. -- All rights reversed ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-06 20:20 ` Rik van Riel @ 2014-05-06 20:41 ` Peter Zijlstra 2014-05-07 12:17 ` Ingo Molnar 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2014-05-06 20:41 UTC (permalink / raw) To: Rik van Riel Cc: Preeti Murthy, umgwanakikbuti, LKML, Morten Rasmussen, Ingo Molnar, george.mccollister, ktkhai, Preeti U Murthy On Tue, May 06, 2014 at 04:20:59PM -0400, Rik van Riel wrote: > On 05/06/2014 09:25 AM, Peter Zijlstra wrote: > > On Sun, May 04, 2014 at 08:41:09AM -0400, Rik van Riel wrote: > >> Even on 8-node DL980 systems, the NUMA distance in the > >> SLIT table is less than RECLAIM_DISTANCE, and we will > >> do wake_affine across the entire system. > > > > Yeah, so the problem is that (AFAIK) ACPI doesn't actually specify a > > metric for the SLIT distance. This (in as far as BIOS people would care > > to stick to specs anyhow) has lead to the 'fun' situation where BIOS > > engineers tweak SLIT table values to make OSes behave as they thing it > > should. > > > > So if the BIOS engineer finds that this system should have < > > RECLAIM_DISTANCE it will simply make the table such that the max SLIT > > value is below that. > > > > And yes, I've seen this :-( > > It appears to be the case on the vast majority of the NUMA > systems that are actually in use. > > To me, this suggests that we should probably deal with it. What we could do is redefine this distance in hops, that'll force them to lie more blatantly and actually miss represent the topology. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-06 20:41 ` Peter Zijlstra @ 2014-05-07 12:17 ` Ingo Molnar 0 siblings, 0 replies; 46+ messages in thread From: Ingo Molnar @ 2014-05-07 12:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Rik van Riel, Preeti Murthy, umgwanakikbuti, LKML, Morten Rasmussen, george.mccollister, ktkhai, Preeti U Murthy * Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 06, 2014 at 04:20:59PM -0400, Rik van Riel wrote: > > On 05/06/2014 09:25 AM, Peter Zijlstra wrote: > > > On Sun, May 04, 2014 at 08:41:09AM -0400, Rik van Riel wrote: > > >> Even on 8-node DL980 systems, the NUMA distance in the > > >> SLIT table is less than RECLAIM_DISTANCE, and we will > > >> do wake_affine across the entire system. > > > > > > Yeah, so the problem is that (AFAIK) ACPI doesn't actually specify a > > > metric for the SLIT distance. This (in as far as BIOS people would care > > > to stick to specs anyhow) has lead to the 'fun' situation where BIOS > > > engineers tweak SLIT table values to make OSes behave as they thing it > > > should. > > > > > > So if the BIOS engineer finds that this system should have < > > > RECLAIM_DISTANCE it will simply make the table such that the max SLIT > > > value is below that. > > > > > > And yes, I've seen this :-( > > > > It appears to be the case on the vast majority of the NUMA systems > > that are actually in use. > > > > To me, this suggests that we should probably deal with it. > > What we could do is redefine this distance in hops, that'll force > them to lie more blatantly and actually miss represent the topology. and we should make sure we reduce any graph they represent, so that they can lie only through very heavy misrepresentation of the topology (i.e. not just weight tweaks) which will bite them in other areas (like the mm). Thanks, Ingo ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work 2014-05-04 11:44 ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy 2014-05-04 12:04 ` Mike Galbraith 2014-05-04 12:41 ` Rik van Riel @ 2014-05-06 11:56 ` Peter Zijlstra 2 siblings, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2014-05-06 11:56 UTC (permalink / raw) To: Preeti Murthy Cc: Rik van Riel, umgwanakikbuti, LKML, Morten Rasmussen, Ingo Molnar, george.mccollister, ktkhai, Preeti U Murthy [-- Attachment #1: Type: text/plain, Size: 620 bytes --] On Sun, May 04, 2014 at 05:14:59PM +0530, Preeti Murthy wrote: > As far as my understanding goes, the logic in select_task_rq_fair() > does wake_affine() or calls select_idle_sibling() only at those > levels of sched domains where the flag SD_WAKE_AFFINE is set. > This flag is not set at the numa domain and hence they will not be > balancing across numa nodes. So I don't understand how > *these functions* are affecting NUMA placements. It _is_ set at NUMA domains, just not on those > RECLAIM_DISTANCE. That means you typically need a non-fully connected system and then the top numa tier will not get wake affine. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2014-05-22 12:28 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-02 4:42 [PATCH RFC/TEST] sched: make sync affine wakeups work Rik van Riel
2014-05-02 5:32 ` Mike Galbraith
2014-05-02 5:41 ` Mike Galbraith
2014-05-02 5:58 ` Mike Galbraith
2014-05-02 6:08 ` Rik van Riel
2014-05-02 6:36 ` Mike Galbraith
2014-05-02 6:51 ` Mike Galbraith
2014-05-02 6:13 ` Mike Galbraith
2014-05-02 6:30 ` Rik van Riel
2014-05-02 7:37 ` Mike Galbraith
2014-05-02 10:56 ` Rik van Riel
2014-05-02 11:27 ` Mike Galbraith
2014-05-02 12:51 ` Mike Galbraith
[not found] ` <5363B793.9010208@redhat.com>
2014-05-06 11:54 ` Peter Zijlstra
2014-05-06 20:19 ` Rik van Riel
2014-05-06 20:39 ` Peter Zijlstra
2014-05-06 23:46 ` Rik van Riel
2014-05-09 2:20 ` Rik van Riel
2014-05-09 5:27 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Rik van Riel
2014-05-09 6:04 ` [PATCH] sched: clean up select_task_rq_fair conditionals and indentation Rik van Riel
2014-05-09 7:34 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Mike Galbraith
2014-05-09 14:22 ` Rik van Riel
2014-05-09 15:24 ` Mike Galbraith
2014-05-09 15:24 ` Rik van Riel
2014-05-09 17:55 ` Mike Galbraith
2014-05-09 18:16 ` Rik van Riel
2014-05-10 3:54 ` Mike Galbraith
2014-05-13 14:08 ` Rik van Riel
2014-05-14 4:08 ` Mike Galbraith
2014-05-14 15:40 ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel
2014-05-14 15:45 ` Peter Zijlstra
2014-05-19 13:08 ` [tip:sched/core] " tip-bot for Rik van Riel
2014-05-22 12:27 ` [tip:sched/core] sched: Call select_idle_sibling() " tip-bot for Rik van Riel
2014-05-04 11:44 ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy
2014-05-04 12:04 ` Mike Galbraith
2014-05-05 4:38 ` Preeti U Murthy
2014-05-04 12:41 ` Rik van Riel
2014-05-05 4:50 ` Preeti U Murthy
2014-05-05 6:43 ` Preeti U Murthy
2014-05-05 11:28 ` Rik van Riel
2014-05-06 13:26 ` Peter Zijlstra
2014-05-06 13:25 ` Peter Zijlstra
2014-05-06 20:20 ` Rik van Riel
2014-05-06 20:41 ` Peter Zijlstra
2014-05-07 12:17 ` Ingo Molnar
2014-05-06 11:56 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).