[PATCH RFC/TEST] sched: make sync affine wakeups work

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC/TEST] sched: make sync affine wakeups work
@ 2014-05-02  4:42 Rik van Riel
  2014-05-02  5:32 ` Mike Galbraith
  2014-05-02  6:13 ` Mike Galbraith
  0 siblings, 2 replies; 46+ messages in thread
From: Rik van Riel @ 2014-05-02  4:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: morten.rasmussen, mingo, peterz, george.mccollister, ktkhai,
	umgwanakikbuti

Currently sync wakeups from the wake_affine code cannot work as
designed, because the task doing the sync wakeup from the target
cpu will block its wakee from selecting that cpu.

This is despite the fact that whether or not the wakeup is sync
determines whether or not we want to do an affine wakeup...

This patch allows sync wakeups to pick the waker's cpu.

Whether or not this is the right thing to do remains to be seen,
but it does allow us to verify whether or not the wake_affine
strategy of always doing affine wakeups and only disabling them
in a specific circumstance is sound, or needs rethinking...

Not-yet-signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..7f8fe16 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4358,13 +4358,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, int sync)
 {
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
 
-	if (idle_cpu(target))
+	if (idle_cpu(target) || (sync && cpu_rq(target)->nr_running == 1))
 		return target;
 
 	/*
@@ -4453,7 +4453,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			prev_cpu = cpu;
 
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		new_cpu = select_idle_sibling(p, prev_cpu, sync);
 		goto unlock;
 	}
 
.

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  4:42 [PATCH RFC/TEST] sched: make sync affine wakeups work Rik van Riel
@ 2014-05-02  5:32 ` Mike Galbraith
  2014-05-02  5:41   ` Mike Galbraith
  2014-05-02  5:58   ` Mike Galbraith
  2014-05-02  6:13 ` Mike Galbraith
  1 sibling, 2 replies; 46+ messages in thread
From: Mike Galbraith @ 2014-05-02  5:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: 
> Currently sync wakeups from the wake_affine code cannot work as
> designed, because the task doing the sync wakeup from the target
> cpu will block its wakee from selecting that cpu.
> 
> This is despite the fact that whether or not the wakeup is sync
> determines whether or not we want to do an affine wakeup...

If the sync hint really did mean we ARE going to schedule RSN, waking
local would be a good thing.  It is all too often a big fat lie.

-Mike


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  5:32 ` Mike Galbraith
@ 2014-05-02  5:41   ` Mike Galbraith
  2014-05-02  5:58   ` Mike Galbraith
  1 sibling, 0 replies; 46+ messages in thread
From: Mike Galbraith @ 2014-05-02  5:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On Fri, 2014-05-02 at 07:32 +0200, Mike Galbraith wrote: 
> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: 
> > Currently sync wakeups from the wake_affine code cannot work as
> > designed, because the task doing the sync wakeup from the target
> > cpu will block its wakee from selecting that cpu.
> > 
> > This is despite the fact that whether or not the wakeup is sync
> > determines whether or not we want to do an affine wakeup...
> 
> If the sync hint really did mean we ARE going to schedule RSN, waking
> local would be a good thing.  It is all too often a big fat lie.

BTW, we used to have avg_overlap and a sync less heuristic to improve
the accuracy of the sync wakeup hint, but I ripped it out because it was
dead wrong far too often.  You can try reviving it, but preemption among
other things break it.  You'll have better luck with a simple context
switch rate heuristic, but that's less than perfect too.

-Mike


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  5:32 ` Mike Galbraith
  2014-05-02  5:41   ` Mike Galbraith
@ 2014-05-02  5:58   ` Mike Galbraith
  2014-05-02  6:08     ` Rik van Riel
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-02  5:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On Fri, 2014-05-02 at 07:32 +0200, Mike Galbraith wrote: 
> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: 
> > Currently sync wakeups from the wake_affine code cannot work as
> > designed, because the task doing the sync wakeup from the target
> > cpu will block its wakee from selecting that cpu.
> > 
> > This is despite the fact that whether or not the wakeup is sync
> > determines whether or not we want to do an affine wakeup...
> 
> If the sync hint really did mean we ARE going to schedule RSN, waking
> local would be a good thing.  It is all too often a big fat lie.

One example of that is say pgbench.  The mother of all work (server
thread) for that load wakes with sync hint.  Let the server wake the
first of a small herd CPU affine, and that first wakee then preempt the
server (mother of all work) that drives the entire load.

Byebye throughput.

When there's only one wakee, and there's really not enough overlap to at
least break even, waking CPU affine is a great idea.  Even when your
wakees only run for a short time, if you wake/get_preempted repeat, the
load will serialize.

-Mike

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  5:58   ` Mike Galbraith
@ 2014-05-02  6:08     ` Rik van Riel
  2014-05-02  6:36       ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2014-05-02  6:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On 05/02/2014 01:58 AM, Mike Galbraith wrote:
> On Fri, 2014-05-02 at 07:32 +0200, Mike Galbraith wrote: 
>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: 
>>> Currently sync wakeups from the wake_affine code cannot work as
>>> designed, because the task doing the sync wakeup from the target
>>> cpu will block its wakee from selecting that cpu.
>>>
>>> This is despite the fact that whether or not the wakeup is sync
>>> determines whether or not we want to do an affine wakeup...
>>
>> If the sync hint really did mean we ARE going to schedule RSN, waking
>> local would be a good thing.  It is all too often a big fat lie.
> 
> One example of that is say pgbench.  The mother of all work (server
> thread) for that load wakes with sync hint.  Let the server wake the
> first of a small herd CPU affine, and that first wakee then preempt the
> server (mother of all work) that drives the entire load.
> 
> Byebye throughput.
> 
> When there's only one wakee, and there's really not enough overlap to at
> least break even, waking CPU affine is a great idea.  Even when your
> wakees only run for a short time, if you wake/get_preempted repeat, the
> load will serialize.

I see a similar issue with specjbb2013, with 4 backend and
4 frontend JVMs on a 4 node NUMA system.

The NUMA balancing code nicely places the memory of each JVM
on one NUMA node, but then the wake_affine code will happily
run all of the threads anywhere on the system, totally ruining
memory locality.

The front end and back end only exchange a few hundred messages
a second, over loopback tcp, so the switching rate between
threads is quite low...

I wonder if it would make sense for wake_affine to be off by
default, and only switch on when the right conditions are
detected, instead of having it on by default like we have now?

I have some ideas on that, but I should probably catch some
sleep before trying to code them up :)

Meanwhile, the test patch that I posted may help us figure out
whether the "sync" option in the current wake_affine code does
anything useful.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  6:08     ` Rik van Riel
@ 2014-05-02  6:36       ` Mike Galbraith
  2014-05-02  6:51         ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-02  6:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On Fri, 2014-05-02 at 02:08 -0400, Rik van Riel wrote: 
> On 05/02/2014 01:58 AM, Mike Galbraith wrote:
> > On Fri, 2014-05-02 at 07:32 +0200, Mike Galbraith wrote: 
> >> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: 
> >>> Currently sync wakeups from the wake_affine code cannot work as
> >>> designed, because the task doing the sync wakeup from the target
> >>> cpu will block its wakee from selecting that cpu.
> >>>
> >>> This is despite the fact that whether or not the wakeup is sync
> >>> determines whether or not we want to do an affine wakeup...
> >>
> >> If the sync hint really did mean we ARE going to schedule RSN, waking
> >> local would be a good thing.  It is all too often a big fat lie.
> > 
> > One example of that is say pgbench.  The mother of all work (server
> > thread) for that load wakes with sync hint.  Let the server wake the
> > first of a small herd CPU affine, and that first wakee then preempt the
> > server (mother of all work) that drives the entire load.
> > 
> > Byebye throughput.
> > 
> > When there's only one wakee, and there's really not enough overlap to at
> > least break even, waking CPU affine is a great idea.  Even when your
> > wakees only run for a short time, if you wake/get_preempted repeat, the
> > load will serialize.
> 
> I see a similar issue with specjbb2013, with 4 backend and
> 4 frontend JVMs on a 4 node NUMA system.
> 
> The NUMA balancing code nicely places the memory of each JVM
> on one NUMA node, but then the wake_affine code will happily
> run all of the threads anywhere on the system, totally ruining
> memory locality.

Hm, I thought numasched got excessive pull crap under control.  For
steady hefty loads, you want to kill all but periodic load balancing
once the thing gets cranked up.  The less you move tasks, the better the
load will perform.  Bursty loads exist too though, damn the bad luck.

> The front end and back end only exchange a few hundred messages
> a second, over loopback tcp, so the switching rate between
> threads is quite low...
> 
> I wonder if it would make sense for wake_affine to be off by
> default, and only switch on when the right conditions are
> detected, instead of having it on by default like we have now?

Not IMHO, but I have seen situations where that was exactly what I
recommended to fix the throughput problem the user was having.

Reason why is that case was on a box where FAIR_SLEEPERS is disabled by
default, meaning there is no such thing as wakeup preemption.  Guess
what happens when you don't have shared LLC for a fast/light wakee to
escape to when the waker is a pig.  The worst thing possible in that
case is to wake affine.  Leave the poor thing wherever it was, else it
will take a latency hit that need not have been.

> I have some ideas on that, but I should probably catch some
> sleep before trying to code them up :)

Yeah, there are many aspects to ponder.

> Meanwhile, the test patch that I posted may help us figure out
> whether the "sync" option in the current wake_affine code does
> anything useful.

If I had a NAK stamp and digital ink pad, that patch wouldn't be
readable, much less applicable ;-)

-Mike


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  6:36       ` Mike Galbraith
@ 2014-05-02  6:51         ` Mike Galbraith
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Galbraith @ 2014-05-02  6:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On Fri, 2014-05-02 at 08:36 +0200, Mike Galbraith wrote:

> Reason why is that case was on a box where FAIR_SLEEPERS is disabled by
> default, meaning there is no such thing as wakeup preemption.  Guess
> what happens when you don't have shared LLC for a fast/light wakee to
> escape to when the waker is a pig.  The worst thing possible in that
> case is to wake affine.  Leave the poor thing wherever it was, else it
> will take a latency hit that need not have been.

Oh yeah, and you'll see similar issues playing with kvm.  No escape
routes are available, as no llc domain exists.  Globally do a sync
wakeup CPU affine, and for some loads that will induce massive wreckage
where as If select_isle_sibling() had been there to save the day, all
would have been peachy.  Is it good, or is it evil... depends.

-Mike


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  4:42 [PATCH RFC/TEST] sched: make sync affine wakeups work Rik van Riel
  2014-05-02  5:32 ` Mike Galbraith
@ 2014-05-02  6:13 ` Mike Galbraith
  2014-05-02  6:30   ` Rik van Riel
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-02  6:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:

> Whether or not this is the right thing to do remains to be seen,
> but it does allow us to verify whether or not the wake_affine
> strategy of always doing affine wakeups and only disabling them
> in a specific circumstance is sound, or needs rethinking...

Yes, it needs rethinking.

I know why you want to try this, yes, select_idle_sibling() is very much
a two faced little bitch.  NOHZ doesn't help, my kernels throttle it.

-Mike



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  6:13 ` Mike Galbraith
@ 2014-05-02  6:30   ` Rik van Riel
  2014-05-02  7:37     ` Mike Galbraith
  2014-05-04 11:44     ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy
  0 siblings, 2 replies; 46+ messages in thread
From: Rik van Riel @ 2014-05-02  6:30 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On 05/02/2014 02:13 AM, Mike Galbraith wrote:
> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
> 
>> Whether or not this is the right thing to do remains to be seen,
>> but it does allow us to verify whether or not the wake_affine
>> strategy of always doing affine wakeups and only disabling them
>> in a specific circumstance is sound, or needs rethinking...
> 
> Yes, it needs rethinking.
> 
> I know why you want to try this, yes, select_idle_sibling() is very much
> a two faced little bitch.

My biggest problem with select_idle_sibling and wake_affine in
general is that it will override NUMA placement, even when
processes only wake each other up infrequently...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  6:30   ` Rik van Riel
@ 2014-05-02  7:37     ` Mike Galbraith
  2014-05-02 10:56       ` Rik van Riel
  2014-05-04 11:44     ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-02  7:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On Fri, 2014-05-02 at 02:30 -0400, Rik van Riel wrote: 
> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
> > On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
> > 
> >> Whether or not this is the right thing to do remains to be seen,
> >> but it does allow us to verify whether or not the wake_affine
> >> strategy of always doing affine wakeups and only disabling them
> >> in a specific circumstance is sound, or needs rethinking...
> > 
> > Yes, it needs rethinking.
> > 
> > I know why you want to try this, yes, select_idle_sibling() is very much
> > a two faced little bitch.
> 
> My biggest problem with select_idle_sibling and wake_affine in
> general is that it will override NUMA placement, even when
> processes only wake each other up infrequently...

Hm, seems the thing to do would be to tell select_task_rq_fair() to keep
it's mitts off of tasks that the numasched stuff has placed rather than
decapitating select_idle_sibling() or some other drastic measure.

-Mike


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  7:37     ` Mike Galbraith
@ 2014-05-02 10:56       ` Rik van Riel
  2014-05-02 11:27         ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2014-05-02 10:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On 05/02/2014 03:37 AM, Mike Galbraith wrote:
> On Fri, 2014-05-02 at 02:30 -0400, Rik van Riel wrote: 
>> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
>>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
>>>
>>>> Whether or not this is the right thing to do remains to be seen,
>>>> but it does allow us to verify whether or not the wake_affine
>>>> strategy of always doing affine wakeups and only disabling them
>>>> in a specific circumstance is sound, or needs rethinking...
>>>
>>> Yes, it needs rethinking.
>>>
>>> I know why you want to try this, yes, select_idle_sibling() is very much
>>> a two faced little bitch.
>>
>> My biggest problem with select_idle_sibling and wake_affine in
>> general is that it will override NUMA placement, even when
>> processes only wake each other up infrequently...
> 
> Hm, seems the thing to do would be to tell select_task_rq_fair() to keep
> it's mitts off of tasks that the numasched stuff has placed rather than
> decapitating select_idle_sibling() or some other drastic measure.

Thing is, if tasks are waking each other up frequently enough, we
probably DO want to place them near each other with select_idle_sibling.

We just cannot afford to have it as the default behaviour for casual
wakeup activity, because it will mess up other things.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02 10:56       ` Rik van Riel
@ 2014-05-02 11:27         ` Mike Galbraith
  2014-05-02 12:51           ` Mike Galbraith
       [not found]           ` <5363B793.9010208@redhat.com>
  0 siblings, 2 replies; 46+ messages in thread
From: Mike Galbraith @ 2014-05-02 11:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On Fri, 2014-05-02 at 06:56 -0400, Rik van Riel wrote: 
> On 05/02/2014 03:37 AM, Mike Galbraith wrote:
> > On Fri, 2014-05-02 at 02:30 -0400, Rik van Riel wrote: 
> >> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
> >>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
> >>>
> >>>> Whether or not this is the right thing to do remains to be seen,
> >>>> but it does allow us to verify whether or not the wake_affine
> >>>> strategy of always doing affine wakeups and only disabling them
> >>>> in a specific circumstance is sound, or needs rethinking...
> >>>
> >>> Yes, it needs rethinking.
> >>>
> >>> I know why you want to try this, yes, select_idle_sibling() is very much
> >>> a two faced little bitch.
> >>
> >> My biggest problem with select_idle_sibling and wake_affine in
> >> general is that it will override NUMA placement, even when
> >> processes only wake each other up infrequently...
> > 
> > Hm, seems the thing to do would be to tell select_task_rq_fair() to keep
> > it's mitts off of tasks that the numasched stuff has placed rather than
> > decapitating select_idle_sibling() or some other drastic measure.
> 
> Thing is, if tasks are waking each other up frequently enough, we
> probably DO want to place them near each other with select_idle_sibling.

Right.  I'm thinking you could perhaps create a sched feature like
NUMA_ME_HARDER or such so you can tell it to go away if you find that
your load performs best when movement is left entirely up to the NUMA
placement code.

> We just cannot afford to have it as the default behaviour for casual
> wakeup activity, because it will mess up other things.

I think it is generally good, but yes, it has its bad it's bad side, why
we have tweakables.

-Mike


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02 11:27         ` Mike Galbraith
@ 2014-05-02 12:51           ` Mike Galbraith
       [not found]           ` <5363B793.9010208@redhat.com>
  1 sibling, 0 replies; 46+ messages in thread
From: Mike Galbraith @ 2014-05-02 12:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, morten.rasmussen, mingo, peterz, george.mccollister,
	ktkhai

On Fri, 2014-05-02 at 13:27 +0200, Mike Galbraith wrote: 
> On Fri, 2014-05-02 at 06:56 -0400, Rik van Riel wrote: 
> > On 05/02/2014 03:37 AM, Mike Galbraith wrote:
> > > On Fri, 2014-05-02 at 02:30 -0400, Rik van Riel wrote: 
> > >> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
> > >>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
> > >>>
> > >>>> Whether or not this is the right thing to do remains to be seen,
> > >>>> but it does allow us to verify whether or not the wake_affine
> > >>>> strategy of always doing affine wakeups and only disabling them
> > >>>> in a specific circumstance is sound, or needs rethinking...
> > >>>
> > >>> Yes, it needs rethinking.
> > >>>
> > >>> I know why you want to try this, yes, select_idle_sibling() is very much
> > >>> a two faced little bitch.
> > >>
> > >> My biggest problem with select_idle_sibling and wake_affine in
> > >> general is that it will override NUMA placement, even when
> > >> processes only wake each other up infrequently...
> > > 
> > > Hm, seems the thing to do would be to tell select_task_rq_fair() to keep
> > > it's mitts off of tasks that the numasched stuff has placed rather than
> > > decapitating select_idle_sibling() or some other drastic measure.
> > 
> > Thing is, if tasks are waking each other up frequently enough, we
> > probably DO want to place them near each other with select_idle_sibling.
> 
> Right.  I'm thinking you could perhaps create a sched feature like
> NUMA_ME_HARDER or such so you can tell it to go away if you find that
> your load performs best when movement is left entirely up to the NUMA
> placement code.
> 
> > We just cannot afford to have it as the default behaviour for casual
> > wakeup activity, because it will mess up other things.
> 
> I think it is generally good, but yes, it has its bad it's bad side, why
> we have tweakables.

Slightly garbled (multitasking).

To make that a little clearer, generally, trying to bring waker and
wakee together is a good thing.  I think changing the default to avoid
doing the generally good thing is wrong.  Box drivers should be given
whatever knobs they need to adjust as required, and use them, because
there is no one correct answer.

-Mike


^ permalink raw reply	[flat|nested] 46+ messages in thread

[parent not found: <5363B793.9010208@redhat.com>]

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
       [not found]           ` <5363B793.9010208@redhat.com>
@ 2014-05-06 11:54             ` Peter Zijlstra
  2014-05-06 20:19               ` Rik van Riel
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-05-06 11:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

[-- Attachment #1: Type: text/plain, Size: 314 bytes --]

On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote:
> As an aside, it also looks like SD_BALANCE_WAKE is set on all domains
> of a NUMA system by default, so even the non-affine wakeup will end
> up looking for the lowest load NUMA node to start up on.

I can't find it being set on anything by default.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-06 11:54             ` Peter Zijlstra
@ 2014-05-06 20:19               ` Rik van Riel
  2014-05-06 20:39                 ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2014-05-06 20:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On 05/06/2014 07:54 AM, Peter Zijlstra wrote:
> On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote:
>> As an aside, it also looks like SD_BALANCE_WAKE is set on all domains
>> of a NUMA system by default, so even the non-affine wakeup will end
>> up looking for the lowest load NUMA node to start up on.
> 
> I can't find it being set on anything by default.

                .flags                  = 1*SD_LOAD_BALANCE
                                        | 1*SD_BALANCE_NEWIDLE
                                        | 0*SD_BALANCE_EXEC
                                        | 0*SD_BALANCE_FORK
                                        | 0*SD_BALANCE_WAKE
                                        | 0*SD_WAKE_AFFINE
                                        | 0*SD_SHARE_CPUPOWER
                                        | 0*SD_SHARE_PKG_RESOURCES
                                        | 1*SD_SERIALIZE
                                        | 0*SD_PREFER_SIBLING
                                        | 1*SD_NUMA
                                        | sd_local_flags(level)


static inline int sd_local_flags(int level)
{
        if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
                return 0;

        return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
}

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-06 20:19               ` Rik van Riel
@ 2014-05-06 20:39                 ` Peter Zijlstra
  2014-05-06 23:46                   ` Rik van Riel
  2014-05-09  2:20                   ` Rik van Riel
  0 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-05-06 20:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On Tue, May 06, 2014 at 04:19:21PM -0400, Rik van Riel wrote:
> On 05/06/2014 07:54 AM, Peter Zijlstra wrote:
> > On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote:
> >> As an aside, it also looks like SD_BALANCE_WAKE is set on all domains
                                     ^^^^^^^^^^^^^^^

> >> of a NUMA system by default, so even the non-affine wakeup will end
> >> up looking for the lowest load NUMA node to start up on.
> > 
> > I can't find it being set on anything by default.
> 
>                 .flags                  = 1*SD_LOAD_BALANCE
>                                         | 1*SD_BALANCE_NEWIDLE
>                                         | 0*SD_BALANCE_EXEC
>                                         | 0*SD_BALANCE_FORK
>                                         | 0*SD_BALANCE_WAKE

                                            ^

last time I checked 0*x was still 0.

>                                         | 0*SD_WAKE_AFFINE
>                                         | 0*SD_SHARE_CPUPOWER
>                                         | 0*SD_SHARE_PKG_RESOURCES
>                                         | 1*SD_SERIALIZE
>                                         | 0*SD_PREFER_SIBLING
>                                         | 1*SD_NUMA
>                                         | sd_local_flags(level)
> 
> 
> static inline int sd_local_flags(int level)
> {
>         if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
>                 return 0;
> 
>         return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
> }

No BALANCE_WAKE there

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-06 20:39                 ` Peter Zijlstra
@ 2014-05-06 23:46                   ` Rik van Riel
  2014-05-09  2:20                   ` Rik van Riel
  1 sibling, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2014-05-06 23:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On 05/06/2014 04:39 PM, Peter Zijlstra wrote:
> On Tue, May 06, 2014 at 04:19:21PM -0400, Rik van Riel wrote:
>> On 05/06/2014 07:54 AM, Peter Zijlstra wrote:
>>> On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote:
>>>> As an aside, it also looks like SD_BALANCE_WAKE is set on all domains
>                                      ^^^^^^^^^^^^^^^

>> static inline int sd_local_flags(int level)
>> {
>>         if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
>>                 return 0;
>>
>>         return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
>> }
> 
> No BALANCE_WAKE there

Yeah you're right, I typoed in my earlier email and didn't pick
up on it later.  Doh.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-06 20:39                 ` Peter Zijlstra
  2014-05-06 23:46                   ` Rik van Riel
@ 2014-05-09  2:20                   ` Rik van Riel
  2014-05-09  5:27                     ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Rik van Riel
  1 sibling, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2014-05-09  2:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On 05/06/2014 04:39 PM, Peter Zijlstra wrote:
> On Tue, May 06, 2014 at 04:19:21PM -0400, Rik van Riel wrote:
>> On 05/06/2014 07:54 AM, Peter Zijlstra wrote:
>>> On Fri, May 02, 2014 at 11:19:47AM -0400, Rik van Riel wrote:
>>>> As an aside, it also looks like SD_BALANCE_WAKE is set on all domains
>                                      ^^^^^^^^^^^^^^^

> No BALANCE_WAKE there

Looks like SD_BALANCE_WAKE is not gotten from the sd flags at
all, but passed into select_task_rq by try_to_wake_up, as a
hard coded sd_flags argument.

static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
...
        cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
...

static inline
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int
wake_flags)
{
        cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
...

static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag,
int wake_flags)
{
...

        if (sd_flag & SD_BALANCE_WAKE) {
                if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
                        want_affine = 1;
                new_cpu = prev_cpu;
        }

Should we do that, if SD_WAKE_BALANCE is not set for any sched domain?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-09  2:20                   ` Rik van Riel
@ 2014-05-09  5:27                     ` Rik van Riel
  2014-05-09  6:04                       ` [PATCH] sched: clean up select_task_rq_fair conditionals and indentation Rik van Riel
  2014-05-09  7:34                       ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Mike Galbraith
  0 siblings, 2 replies; 46+ messages in thread
From: Rik van Riel @ 2014-05-09  5:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On Thu, 08 May 2014 22:20:25 -0400
Rik van Riel <riel@redhat.com> wrote:

> Looks like SD_BALANCE_WAKE is not gotten from the sd flags at
> all, but passed into select_task_rq by try_to_wake_up, as a
> hard coded sd_flags argument.

> Should we do that, if SD_WAKE_BALANCE is not set for any sched domain?

I answered my own question. The sd_flag SD_WAKE_BALANCE simply means
"this is a wakeup of a previously existing task, please place it
properly".

However, it appears that the current code will fall back to the large
loop with select_idlest_group and friends, if prev_cpu and cpu are not
part of the same SD_WAKE_AFFINE sched domain. That is a bug...

---8<---

Subject: sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu

If prev_cpu and cpu are not part of the same SD_WAKE_AFFINE domain,
the current code will fall back to the select_idlest_group CPU selection
mechanism, instead of trying to wake up the task on a CPU near its
previous CPU.

Fix that by always calling select_idle_sibling when doing an SD_BALANCE_WAKE
wakeup.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..5d33fb1b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4452,7 +4452,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	if (affine_sd) {
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			prev_cpu = cpu;
+	}
 
+	if (sd_flag & SD_BALANCE_WAKE) {
 		new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH] sched: clean up select_task_rq_fair conditionals and indentation
  2014-05-09  5:27                     ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Rik van Riel
@ 2014-05-09  6:04                       ` Rik van Riel
  2014-05-09  7:34                       ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Mike Galbraith
  1 sibling, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2014-05-09  6:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, morten.rasmussen,
	mingo, george.mccollister, ktkhai

This cleanup goes on top of the previous patch. We could also skip the
whole domain iteration if !want_affine, but that doesn't fit nicely in
80 columns and may have to be done in yet another patch (breaking that
bit of the code out into its own function?)

---8<---

Subject: sched: clean up select_task_rq_fair conditionals and indentation

The whole top half of select_task_rq_fair is only run if
sd_flag & SD_BALANCE_WAKE. The code should probably reflect
that.

Also remove the useless new_cpu = prev_cpu assignment, as the
value of new_cpu is not actually used.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 47 ++++++++++++++++++++++++-----------------------
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d33fb1b..844807b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4424,37 +4424,38 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
 
+	rcu_read_lock();
 	if (sd_flag & SD_BALANCE_WAKE) {
 		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
 			want_affine = 1;
-		new_cpu = prev_cpu;
-	}
 
-	rcu_read_lock();
-	for_each_domain(cpu, tmp) {
-		if (!(tmp->flags & SD_LOAD_BALANCE))
-			continue;
+		for_each_domain(cpu, tmp) {
+			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+				break;
 
-		/*
-		 * If both cpu and prev_cpu are part of this domain,
-		 * cpu is a valid SD_WAKE_AFFINE target.
-		 */
-		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
-			affine_sd = tmp;
-			break;
-		}
+			if (!(tmp->flags & SD_LOAD_BALANCE))
+				continue;
 
-		if (tmp->flags & sd_flag)
-			sd = tmp;
-	}
+			/*
+			 * If both cpu and prev_cpu are part of this domain,
+			 * cpu is a valid SD_WAKE_AFFINE target.
+			 */
+			if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+			    cpumask_test_cpu(prev_cpu,
+					     sched_domain_span(tmp))) {
+				affine_sd = tmp;
+				break;
+			}
 
-	if (affine_sd) {
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-			prev_cpu = cpu;
-	}
+			if (tmp->flags & sd_flag)
+				sd = tmp;
+		}
+
+		if (affine_sd) {
+			if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+				prev_cpu = cpu;
+		}
 
-	if (sd_flag & SD_BALANCE_WAKE) {
 		new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-09  5:27                     ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Rik van Riel
  2014-05-09  6:04                       ` [PATCH] sched: clean up select_task_rq_fair conditionals and indentation Rik van Riel
@ 2014-05-09  7:34                       ` Mike Galbraith
  2014-05-09 14:22                         ` Rik van Riel
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-09  7:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On Fri, 2014-05-09 at 01:27 -0400, Rik van Riel wrote: 
> On Thu, 08 May 2014 22:20:25 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > Looks like SD_BALANCE_WAKE is not gotten from the sd flags at
> > all, but passed into select_task_rq by try_to_wake_up, as a
> > hard coded sd_flags argument.
> 
> > Should we do that, if SD_WAKE_BALANCE is not set for any sched domain?
> 
> I answered my own question. The sd_flag SD_WAKE_BALANCE simply means
> "this is a wakeup of a previously existing task, please place it
> properly".
> 
> However, it appears that the current code will fall back to the large
> loop with select_idlest_group and friends, if prev_cpu and cpu are not
> part of the same SD_WAKE_AFFINE sched domain. That is a bug...

ttwu(): cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);

We pass SD_BALANCE_WAKE for a normal wakeup, so sd will only be set if
we encounter a domain during traversal where Joe User has told us to do
(expensive) wake balancing before we hit a domain shared by waker/wakee.

The user can turn SD_WAKE_AFFINE off beyond socket, and we'll not pull
cross node on wakeup.

Or, you could create an override button to say despite SD_WAKE_AFFINE
perhaps having been set during domain construction (because of some
pseudo-random numbers), don't do that if we have a preferred node, or
just make that automatically part of having numa scheduling enabled, and
don't bother wasting cycles if preferred && this != preferred.

-Mike  


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-09  7:34                       ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Mike Galbraith
@ 2014-05-09 14:22                         ` Rik van Riel
  2014-05-09 15:24                           ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2014-05-09 14:22 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On 05/09/2014 03:34 AM, Mike Galbraith wrote:
> On Fri, 2014-05-09 at 01:27 -0400, Rik van Riel wrote: 
>> On Thu, 08 May 2014 22:20:25 -0400
>> Rik van Riel <riel@redhat.com> wrote:
>>
>>> Looks like SD_BALANCE_WAKE is not gotten from the sd flags at
>>> all, but passed into select_task_rq by try_to_wake_up, as a
>>> hard coded sd_flags argument.
>>
>>> Should we do that, if SD_WAKE_BALANCE is not set for any sched domain?
>>
>> I answered my own question. The sd_flag SD_WAKE_BALANCE simply means
>> "this is a wakeup of a previously existing task, please place it
>> properly".
>>
>> However, it appears that the current code will fall back to the large
>> loop with select_idlest_group and friends, if prev_cpu and cpu are not
>> part of the same SD_WAKE_AFFINE sched domain. That is a bug...
> 
> ttwu(): cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> 
> We pass SD_BALANCE_WAKE for a normal wakeup, so sd will only be set if
> we encounter a domain during traversal where Joe User has told us to do
> (expensive) wake balancing before we hit a domain shared by waker/wakee.
> 
> The user can turn SD_WAKE_AFFINE off beyond socket, and we'll not pull
> cross node on wakeup.
> 
> Or, you could create an override button to say despite SD_WAKE_AFFINE
> perhaps having been set during domain construction (because of some
> pseudo-random numbers), don't do that if we have a preferred node, or
> just make that automatically part of having numa scheduling enabled, and
> don't bother wasting cycles if preferred && this != preferred.

That's not the problem.

The problem is that if we do not do an affine wakeup, due to
SD_WAKE_AFFINE not being set on a top level domain, we will
not try to run p on prev_cpu, but we will fall through into
the loop with find_idlest_group, etc...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-09 14:22                         ` Rik van Riel
@ 2014-05-09 15:24                           ` Mike Galbraith
  2014-05-09 15:24                             ` Rik van Riel
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-09 15:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On Fri, 2014-05-09 at 10:22 -0400, Rik van Riel wrote: 
> On 05/09/2014 03:34 AM, Mike Galbraith wrote:
> > On Fri, 2014-05-09 at 01:27 -0400, Rik van Riel wrote: 
> >> On Thu, 08 May 2014 22:20:25 -0400
> >> Rik van Riel <riel@redhat.com> wrote:
> >>
> >>> Looks like SD_BALANCE_WAKE is not gotten from the sd flags at
> >>> all, but passed into select_task_rq by try_to_wake_up, as a
> >>> hard coded sd_flags argument.
> >>
> >>> Should we do that, if SD_WAKE_BALANCE is not set for any sched domain?
> >>
> >> I answered my own question. The sd_flag SD_WAKE_BALANCE simply means
> >> "this is a wakeup of a previously existing task, please place it
> >> properly".
> >>
> >> However, it appears that the current code will fall back to the large
> >> loop with select_idlest_group and friends, if prev_cpu and cpu are not
> >> part of the same SD_WAKE_AFFINE sched domain. That is a bug...
> > 
> > ttwu(): cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> > 
> > We pass SD_BALANCE_WAKE for a normal wakeup, so sd will only be set if
> > we encounter a domain during traversal where Joe User has told us to do
> > (expensive) wake balancing before we hit a domain shared by waker/wakee.
> > 
> > The user can turn SD_WAKE_AFFINE off beyond socket, and we'll not pull
> > cross node on wakeup.
> > 
> > Or, you could create an override button to say despite SD_WAKE_AFFINE
> > perhaps having been set during domain construction (because of some
> > pseudo-random numbers), don't do that if we have a preferred node, or
> > just make that automatically part of having numa scheduling enabled, and
> > don't bother wasting cycles if preferred && this != preferred.
> 
> That's not the problem.
> 
> The problem is that if we do not do an affine wakeup, due to
> SD_WAKE_AFFINE not being set on a top level domain, we will
> not try to run p on prev_cpu, but we will fall through into
> the loop with find_idlest_group, etc...

If no ->flags & SD_BALANCE_WAKE is encountered during traversal, sd
remains NULL, we fall through to return prev_cpu. 

-Mike




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-09 15:24                           ` Mike Galbraith
@ 2014-05-09 15:24                             ` Rik van Riel
  2014-05-09 17:55                               ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2014-05-09 15:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On 05/09/2014 11:24 AM, Mike Galbraith wrote:
> On Fri, 2014-05-09 at 10:22 -0400, Rik van Riel wrote:
>> On 05/09/2014 03:34 AM, Mike Galbraith wrote:
>>> On Fri, 2014-05-09 at 01:27 -0400, Rik van Riel wrote:
>>>> On Thu, 08 May 2014 22:20:25 -0400
>>>> Rik van Riel <riel@redhat.com> wrote:
>>>>
>>>>> Looks like SD_BALANCE_WAKE is not gotten from the sd flags at
>>>>> all, but passed into select_task_rq by try_to_wake_up, as a
>>>>> hard coded sd_flags argument.
>>>>
>>>>> Should we do that, if SD_WAKE_BALANCE is not set for any sched domain?
>>>>
>>>> I answered my own question. The sd_flag SD_WAKE_BALANCE simply means
>>>> "this is a wakeup of a previously existing task, please place it
>>>> properly".
>>>>
>>>> However, it appears that the current code will fall back to the large
>>>> loop with select_idlest_group and friends, if prev_cpu and cpu are not
>>>> part of the same SD_WAKE_AFFINE sched domain. That is a bug...
>>>
>>> ttwu(): cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>>>
>>> We pass SD_BALANCE_WAKE for a normal wakeup, so sd will only be set if
>>> we encounter a domain during traversal where Joe User has told us to do
>>> (expensive) wake balancing before we hit a domain shared by waker/wakee.
>>>
>>> The user can turn SD_WAKE_AFFINE off beyond socket, and we'll not pull
>>> cross node on wakeup.
>>>
>>> Or, you could create an override button to say despite SD_WAKE_AFFINE
>>> perhaps having been set during domain construction (because of some
>>> pseudo-random numbers), don't do that if we have a preferred node, or
>>> just make that automatically part of having numa scheduling enabled, and
>>> don't bother wasting cycles if preferred && this != preferred.
>>
>> That's not the problem.
>>
>> The problem is that if we do not do an affine wakeup, due to
>> SD_WAKE_AFFINE not being set on a top level domain, we will
>> not try to run p on prev_cpu, but we will fall through into
>> the loop with find_idlest_group, etc...
>
> If no ->flags & SD_BALANCE_WAKE is encountered during traversal, sd
> remains NULL, we fall through to return prev_cpu.

We do fall through, but into this loop:

         while (sd) {
                 struct sched_group *group;
                 int weight;

                 if (!(sd->flags & sd_flag)) {
                         sd = sd->child;
                         continue;
                 }

                 group = find_idlest_group(sd, p, cpu, sd_flag);
                 if (!group) {
                         sd = sd->child;
                         continue;
                 }

                 new_cpu = find_idlest_cpu(group, p, cpu);
                 if (new_cpu == -1 || new_cpu == cpu) {
                         /* Now try balancing at a lower domain level of 
cpu */
                         sd = sd->child;
                         continue;
                 }

                 /* Now try balancing at a lower domain level of new_cpu */
                 cpu = new_cpu;
                 weight = sd->span_weight;
                 sd = NULL;
                 for_each_domain(cpu, tmp) {
                         if (weight <= tmp->span_weight)
                                 break;
                         if (tmp->flags & sd_flag)
                                 sd = tmp;
                 }
                 /* while loop will break here if sd == NULL */
         }




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-09 15:24                             ` Rik van Riel
@ 2014-05-09 17:55                               ` Mike Galbraith
  2014-05-09 18:16                                 ` Rik van Riel
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-09 17:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On Fri, 2014-05-09 at 11:24 -0400, Rik van Riel wrote: 
> On 05/09/2014 11:24 AM, Mike Galbraith wrote:

> > If no ->flags & SD_BALANCE_WAKE is encountered during traversal, sd
> > remains NULL, we fall through to return prev_cpu.
    ^^^^^^^^^^^^ 

> We do fall through, but into this loop:

>          while (sd) {




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-09 17:55                               ` Mike Galbraith
@ 2014-05-09 18:16                                 ` Rik van Riel
  2014-05-10  3:54                                   ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2014-05-09 18:16 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On 05/09/2014 01:55 PM, Mike Galbraith wrote:
> On Fri, 2014-05-09 at 11:24 -0400, Rik van Riel wrote:
>> On 05/09/2014 11:24 AM, Mike Galbraith wrote:
>
>>> If no ->flags & SD_BALANCE_WAKE is encountered during traversal, sd
>>> remains NULL, we fall through to return prev_cpu.
>      ^^^^^^^^^^^^
>
>> We do fall through, but into this loop:
>
>>           while (sd) {

You are right. That code is a little hard to follow...

That leaves the big question: do we want to fall back to
prev_cpu if it is not idle, and it has an idle sibling,
or would it be better to find an idle sibling of prev_cpu
when we wake up a task?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-09 18:16                                 ` Rik van Riel
@ 2014-05-10  3:54                                   ` Mike Galbraith
  2014-05-13 14:08                                     ` Rik van Riel
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-10  3:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai

On Fri, 2014-05-09 at 14:16 -0400, Rik van Riel wrote:

> That leaves the big question: do we want to fall back to
> prev_cpu if it is not idle, and it has an idle sibling,
> or would it be better to find an idle sibling of prev_cpu
> when we wake up a task?

Yes.  If there was A correct answer, this stuff would be a lot easier.

-Mike


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-10  3:54                                   ` Mike Galbraith
@ 2014-05-13 14:08                                     ` Rik van Riel
  2014-05-14  4:08                                       ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2014-05-13 14:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai, Mel Gorman, Vinod, Chegu,
	Suresh Siddha

On 05/09/2014 11:54 PM, Mike Galbraith wrote:
> On Fri, 2014-05-09 at 14:16 -0400, Rik van Riel wrote:
> 
>> That leaves the big question: do we want to fall back to
>> prev_cpu if it is not idle, and it has an idle sibling,
>> or would it be better to find an idle sibling of prev_cpu
>> when we wake up a task?
> 
> Yes.  If there was A correct answer, this stuff would be a lot easier.

OK, after doing some other NUMA stuff, and then looking at the scheduler
again with a fresh mind, I have drawn some more conclusions about what
the scheduler does, and how it breaks NUMA locality :)

1) If the node_distance between nodes on a NUMA system is
   <= RECLAIM_DISTANCE, we will call select_idle_sibling for
   a wakeup of a previously existing task (SD_BALANCE_WAKE)

2) If the node distance exceeds RECLAIM_DISTANCE, we will
   wake up a task on prev_cpu, even if it is not currently
   idle

   This behaviour only happens on certain large NUMA systems,
   and is different from the behaviour on small systems.
   I suspect we will want to call select_idle_sibling with
   prev_cpu in case target and prev_cpu are not in the same
   SD_WAKE_AFFINE domain.

3) If wake_wide is false, we call select_idle_sibling with
   the CPU number of the code that is waking up the task

4) If wake_wide is true, we call select_idle_sibling with
   the CPU number the task was previously running on (prev_cpu)

   In effect, the "wake task on waking task's CPU" behaviour
   is the default, regardless of how frequently a task wakes up
   its wakee, and regardless of impact on NUMA locality.

   This may need to be changed.

5) select_idle_sibling will place the task on (3) or (4) only
   if the CPU is actually idle. If task A communicates with task
   B through a pipe or a socket, and does a sync wakeup, task
   B will never be placed on task A's CPU (not idle yet), and it
   will only be placed on its own previous CPU if it is currently
   idle.

6) If neither CPU is idle, select_idle_sibling will walk all the
   CPUs in the SD_SHARE_PKG_RESOURCES SD of the target. This looks
   correct to me, though it could result in more work by the load
   balancing code later on, since it does not take load into account
   at all. It is unclear if this needs any changes.

Am I overlooking anything?

What benchmarks should I run to test any changes I make?

Are there particular system types people want me to run tests with?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu
  2014-05-13 14:08                                     ` Rik van Riel
@ 2014-05-14  4:08                                       ` Mike Galbraith
  2014-05-14 15:40                                         ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-14  4:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai, Mel Gorman, Vinod, Chegu,
	Suresh Siddha

On Tue, 2014-05-13 at 10:08 -0400, Rik van Riel wrote:

> OK, after doing some other NUMA stuff, and then looking at the scheduler
> again with a fresh mind, I have drawn some more conclusions about what
> the scheduler does, and how it breaks NUMA locality :)
> 
> 1) If the node_distance between nodes on a NUMA system is
>    <= RECLAIM_DISTANCE, we will call select_idle_sibling for
>    a wakeup of a previously existing task (SD_BALANCE_WAKE)
> 
> 2) If the node distance exceeds RECLAIM_DISTANCE, we will
>    wake up a task on prev_cpu, even if it is not currently
>    idle
> 
>    This behaviour only happens on certain large NUMA systems,
>    and is different from the behaviour on small systems.
>    I suspect we will want to call select_idle_sibling with
>    prev_cpu in case target and prev_cpu are not in the same
>    SD_WAKE_AFFINE domain.

Sometimes.  It's the same can of worms remote as it is local.. latency
gain may or may not outweigh cache miss pain.

> 3) If wake_wide is false, we call select_idle_sibling with
>    the CPU number of the code that is waking up the task
> 
> 4) If wake_wide is true, we call select_idle_sibling with
>    the CPU number the task was previously running on (prev_cpu)
> 
>    In effect, the "wake task on waking task's CPU" behaviour
>    is the default, regardless of how frequently a task wakes up
>    its wakee, and regardless of impact on NUMA locality.
> 
>    This may need to be changed.

That behavior also improves the odds of communicating tasks sharing a
cache though.

> Am I overlooking anything?

No, I think you're seeing where the worms live. 

> What benchmarks should I run to test any changes I make?

Mixed bag, it'll affects all, bursty, static, ramp up/down.

-Mike




^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH] sched: call select_idle_sibling when not affine_sd
  2014-05-14  4:08                                       ` Mike Galbraith
@ 2014-05-14 15:40                                         ` Rik van Riel
  2014-05-14 15:45                                           ` Peter Zijlstra
                                                             ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Rik van Riel @ 2014-05-14 15:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai, Mel Gorman,
	"Vinod, Chegu\" <chegu_vinod

On Wed, 14 May 2014 06:08:09 +0200
Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> On Tue, 2014-05-13 at 10:08 -0400, Rik van Riel wrote:

> > 1) If the node_distance between nodes on a NUMA system is
> >    <= RECLAIM_DISTANCE, we will call select_idle_sibling for
> >    a wakeup of a previously existing task (SD_BALANCE_WAKE)
> > 
> > 2) If the node distance exceeds RECLAIM_DISTANCE, we will
> >    wake up a task on prev_cpu, even if it is not currently
> >    idle
> > 
> >    This behaviour only happens on certain large NUMA systems,
> >    and is different from the behaviour on small systems.
> >    I suspect we will want to call select_idle_sibling with
> >    prev_cpu in case target and prev_cpu are not in the same
> >    SD_WAKE_AFFINE domain.
> 
> Sometimes.  It's the same can of worms remote as it is local.. latency
> gain may or may not outweigh cache miss pain.

Ahh, but it is a DIFFERENT can of worms. If the distance between
cpu and prev_cpu exceeds RECLAIM_DISTANCE, we will not look for 
an idle sibling in the same LLC domain as prev_cpu.

If the distance is smaller, and we decide not to do an affine
wakeup, then we do look for an idle sibling of prev_cpu.

This patch makes sure that both types of systems have the same
can of worms :)

---8<---

Subject: sched: call select_idle_sibling when not affine_sd

On smaller systems, the top level sched domain will be an affine
domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE
wakeup. This seems to be working well.

On larger systems, with the node distance between far away NUMA nodes
being > RECLAIM_DISTANCE, select_idle_sibling is only called if the
waker and the wakee are on nodes less than RECLAIM_DISTANCE apart.

This patch leaves in place the policy of not pulling the task across
nodes on such systems, while fixing the issue that select_idle_sibling
is not called at all in certain circumstances.

The code will look for an idle CPU in the same CPU package as the
CPU where the task ran previously.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39b63d0..1e58159 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4423,10 +4423,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			sd = tmp;
 	}
 
-	if (affine_sd) {
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-			prev_cpu = cpu;
+	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		prev_cpu = cpu;
 
+	if (sd_flag & SD_WAKE_AFFINE) {
 		new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] sched: call select_idle_sibling when not affine_sd
  2014-05-14 15:40                                         ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel
@ 2014-05-14 15:45                                           ` Peter Zijlstra
  2014-05-19 13:08                                           ` [tip:sched/core] " tip-bot for Rik van Riel
  2014-05-22 12:27                                           ` [tip:sched/core] sched: Call select_idle_sibling() " tip-bot for Rik van Riel
  2 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-05-14 15:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mike Galbraith, linux-kernel, morten.rasmussen, mingo,
	george.mccollister, ktkhai, Mel Gorman, Vinod

[-- Attachment #1: Type: text/plain, Size: 1640 bytes --]

On Wed, May 14, 2014 at 11:40:37AM -0400, Rik van Riel wrote:
> Subject: sched: call select_idle_sibling when not affine_sd
> 
> On smaller systems, the top level sched domain will be an affine
> domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE
> wakeup. This seems to be working well.
> 
> On larger systems, with the node distance between far away NUMA nodes
> being > RECLAIM_DISTANCE, select_idle_sibling is only called if the
> waker and the wakee are on nodes less than RECLAIM_DISTANCE apart.
> 
> This patch leaves in place the policy of not pulling the task across
> nodes on such systems, while fixing the issue that select_idle_sibling
> is not called at all in certain circumstances.
> 
> The code will look for an idle CPU in the same CPU package as the
> CPU where the task ran previously.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/fair.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 39b63d0..1e58159 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4423,10 +4423,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			sd = tmp;
>  	}
>  
> -	if (affine_sd) {
> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> -			prev_cpu = cpu;
> +	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> +		prev_cpu = cpu;
>  
> +	if (sd_flag & SD_WAKE_AFFINE) {

I think you meant SD_BALANCE_WAKE?

>  		new_cpu = select_idle_sibling(p, prev_cpu);
>  		goto unlock;
>  	}


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [tip:sched/core] sched: call select_idle_sibling when not affine_sd
  2014-05-14 15:40                                         ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel
  2014-05-14 15:45                                           ` Peter Zijlstra
@ 2014-05-19 13:08                                           ` tip-bot for Rik van Riel
  2014-05-22 12:27                                           ` [tip:sched/core] sched: Call select_idle_sibling() " tip-bot for Rik van Riel
  2 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Rik van Riel @ 2014-05-19 13:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, umgwanakikbuti, riel, mgorman,
	tglx

Commit-ID:  b45cf72cf7e1dd3b4a95947f85659cfdc01dbdad
Gitweb:     http://git.kernel.org/tip/b45cf72cf7e1dd3b4a95947f85659cfdc01dbdad
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 14 May 2014 11:40:37 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 19 May 2014 22:02:40 +0900

sched: call select_idle_sibling when not affine_sd

On smaller systems, the top level sched domain will be an affine
domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE
wakeup. This seems to be working well.

On larger systems, with the node distance between far away NUMA nodes
being > RECLAIM_DISTANCE, select_idle_sibling is only called if the
waker and the wakee are on nodes less than RECLAIM_DISTANCE apart.

This patch leaves in place the policy of not pulling the task across
nodes on such systems, while fixing the issue that select_idle_sibling
is not called at all in certain circumstances.

The code will look for an idle CPU in the same CPU package as the
CPU where the task ran previously.

Cc: morten.rasmussen@arm.com
Cc: mingo@kernel.org
Cc: george.mccollister@gmail.com
Cc: ktkhai@parallels.com
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140514114037.2d93266f@annuminas.surriel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd3fa14..429164d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4473,10 +4473,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			sd = tmp;
 	}
 
-	if (affine_sd) {
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-			prev_cpu = cpu;
+	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		prev_cpu = cpu;
 
+	if (sd_flag & SD_BALANCE_WAKE) {
 		new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [tip:sched/core] sched: Call select_idle_sibling() when not affine_sd
  2014-05-14 15:40                                         ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel
  2014-05-14 15:45                                           ` Peter Zijlstra
  2014-05-19 13:08                                           ` [tip:sched/core] " tip-bot for Rik van Riel
@ 2014-05-22 12:27                                           ` tip-bot for Rik van Riel
  2 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Rik van Riel @ 2014-05-22 12:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, umgwanakikbuti, riel, mgorman,
	tglx

Commit-ID:  8bf21433f38b020c3d8a3805d1d7fb73d7b40c01
Gitweb:     http://git.kernel.org/tip/8bf21433f38b020c3d8a3805d1d7fb73d7b40c01
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 14 May 2014 11:40:37 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 22 May 2014 11:16:28 +0200

sched: Call select_idle_sibling() when not affine_sd

On smaller systems, the top level sched domain will be an affine
domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE
wakeup. This seems to be working well.

On larger systems, with the node distance between far away NUMA nodes
being > RECLAIM_DISTANCE, select_idle_sibling is only called if the
waker and the wakee are on nodes less than RECLAIM_DISTANCE apart.

This patch leaves in place the policy of not pulling the task across
nodes on such systems, while fixing the issue that select_idle_sibling
is not called at all in certain circumstances.

The code will look for an idle CPU in the same CPU package as the
CPU where the task ran previously.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: morten.rasmussen@arm.com
Cc: george.mccollister@gmail.com
Cc: ktkhai@parallels.com
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Link: http://lkml.kernel.org/r/20140514114037.2d93266f@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd3fa14..429164d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4473,10 +4473,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			sd = tmp;
 	}
 
-	if (affine_sd) {
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-			prev_cpu = cpu;
+	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		prev_cpu = cpu;
 
+	if (sd_flag & SD_BALANCE_WAKE) {
 		new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-02  6:30   ` Rik van Riel
  2014-05-02  7:37     ` Mike Galbraith
@ 2014-05-04 11:44     ` Preeti Murthy
  2014-05-04 12:04       ` Mike Galbraith
                         ` (2 more replies)
  1 sibling, 3 replies; 46+ messages in thread
From: Preeti Murthy @ 2014-05-04 11:44 UTC (permalink / raw)
  To: Rik van Riel, umgwanakikbuti
  Cc: LKML, Morten Rasmussen, Ingo Molnar, Peter Zijlstra,
	george.mccollister, ktkhai, Preeti U Murthy

Hi Rik, Mike

On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote:
> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
>>
>>> Whether or not this is the right thing to do remains to be seen,
>>> but it does allow us to verify whether or not the wake_affine
>>> strategy of always doing affine wakeups and only disabling them
>>> in a specific circumstance is sound, or needs rethinking...
>>
>> Yes, it needs rethinking.
>>
>> I know why you want to try this, yes, select_idle_sibling() is very much
>> a two faced little bitch.
>
> My biggest problem with select_idle_sibling and wake_affine in
> general is that it will override NUMA placement, even when
> processes only wake each other up infrequently...

As far as my understanding goes, the logic in select_task_rq_fair()
does wake_affine() or calls select_idle_sibling() only at those
levels of sched domains where the flag SD_WAKE_AFFINE is set.
This flag is not set at the numa domain and hence they will not be
balancing across numa nodes. So I don't understand how
*these functions* are affecting NUMA placements.

The wake_affine() and select_idle_sibling() will shuttle tasks
within a NUMA node as far as I can see.i.e. if the cpu that the task
previously ran on and the waker cpu belong to the same node.
Else they are not called.

If the prev_cpu and the waker cpu are on different NUMA nodes
then naturally the tasks will get shuttled across NUMA nodes but
the culprits are the find_idlest* functions.
   They do a top-down search for the idlest group and cpu, starting
at the NUMA domain *attached to the waker and not the prev_cpu*.
This means that the task will end up on a different NUMA node.
Looks to me that the problem lies here and not in the wake_affine()
and select_idle_siblings().

Regards
Preeti U Murthy

>
> --
> All rights reversed
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-04 11:44     ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy
@ 2014-05-04 12:04       ` Mike Galbraith
  2014-05-05  4:38         ` Preeti U Murthy
  2014-05-04 12:41       ` Rik van Riel
  2014-05-06 11:56       ` Peter Zijlstra
  2 siblings, 1 reply; 46+ messages in thread
From: Mike Galbraith @ 2014-05-04 12:04 UTC (permalink / raw)
  To: Preeti Murthy
  Cc: Rik van Riel, LKML, Morten Rasmussen, Ingo Molnar, Peter Zijlstra,
	george.mccollister, ktkhai, Preeti U Murthy

On Sun, 2014-05-04 at 17:14 +0530, Preeti Murthy wrote: 
> Hi Rik, Mike
> 
> On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote:
> > On 05/02/2014 02:13 AM, Mike Galbraith wrote:
> >> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
> >>
> >>> Whether or not this is the right thing to do remains to be seen,
> >>> but it does allow us to verify whether or not the wake_affine
> >>> strategy of always doing affine wakeups and only disabling them
> >>> in a specific circumstance is sound, or needs rethinking...
> >>
> >> Yes, it needs rethinking.
> >>
> >> I know why you want to try this, yes, select_idle_sibling() is very much
> >> a two faced little bitch.
> >
> > My biggest problem with select_idle_sibling and wake_affine in
> > general is that it will override NUMA placement, even when
> > processes only wake each other up infrequently...
> 
> As far as my understanding goes, the logic in select_task_rq_fair()
> does wake_affine() or calls select_idle_sibling() only at those
> levels of sched domains where the flag SD_WAKE_AFFINE is set.
> This flag is not set at the numa domain and hence they will not be
> balancing across numa nodes. So I don't understand how
> *these functions* are affecting NUMA placements.

Depends on how far away node yonder is I suppose.

static inline int sd_local_flags(int level)
{
        if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
                return 0;

        return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
}



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-04 12:04       ` Mike Galbraith
@ 2014-05-05  4:38         ` Preeti U Murthy
  0 siblings, 0 replies; 46+ messages in thread
From: Preeti U Murthy @ 2014-05-05  4:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Preeti Murthy, Rik van Riel, LKML, Morten Rasmussen, Ingo Molnar,
	Peter Zijlstra, george.mccollister, ktkhai

On 05/04/2014 05:34 PM, Mike Galbraith wrote:
> On Sun, 2014-05-04 at 17:14 +0530, Preeti Murthy wrote: 
>> Hi Rik, Mike
>>
>> On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote:
>>> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
>>>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
>>>>
>>>>> Whether or not this is the right thing to do remains to be seen,
>>>>> but it does allow us to verify whether or not the wake_affine
>>>>> strategy of always doing affine wakeups and only disabling them
>>>>> in a specific circumstance is sound, or needs rethinking...
>>>>
>>>> Yes, it needs rethinking.
>>>>
>>>> I know why you want to try this, yes, select_idle_sibling() is very much
>>>> a two faced little bitch.
>>>
>>> My biggest problem with select_idle_sibling and wake_affine in
>>> general is that it will override NUMA placement, even when
>>> processes only wake each other up infrequently...
>>
>> As far as my understanding goes, the logic in select_task_rq_fair()
>> does wake_affine() or calls select_idle_sibling() only at those
>> levels of sched domains where the flag SD_WAKE_AFFINE is set.
>> This flag is not set at the numa domain and hence they will not be
>> balancing across numa nodes. So I don't understand how
>> *these functions* are affecting NUMA placements.
> 
> Depends on how far away node yonder is I suppose.
> 
> static inline int sd_local_flags(int level)
> {
>         if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
>                 return 0;
> 
>         return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
> }
> 
> 

Hmm thanks Mike, I totally missed this!

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-04 11:44     ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy
  2014-05-04 12:04       ` Mike Galbraith
@ 2014-05-04 12:41       ` Rik van Riel
  2014-05-05  4:50         ` Preeti U Murthy
  2014-05-06 13:25         ` Peter Zijlstra
  2014-05-06 11:56       ` Peter Zijlstra
  2 siblings, 2 replies; 46+ messages in thread
From: Rik van Riel @ 2014-05-04 12:41 UTC (permalink / raw)
  To: Preeti Murthy, umgwanakikbuti
  Cc: LKML, Morten Rasmussen, Ingo Molnar, Peter Zijlstra,
	george.mccollister, ktkhai, Preeti U Murthy

On 05/04/2014 07:44 AM, Preeti Murthy wrote:
> Hi Rik, Mike
> 
> On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote:
>> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
>>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
>>>
>>>> Whether or not this is the right thing to do remains to be seen,
>>>> but it does allow us to verify whether or not the wake_affine
>>>> strategy of always doing affine wakeups and only disabling them
>>>> in a specific circumstance is sound, or needs rethinking...
>>>
>>> Yes, it needs rethinking.
>>>
>>> I know why you want to try this, yes, select_idle_sibling() is very much
>>> a two faced little bitch.
>>
>> My biggest problem with select_idle_sibling and wake_affine in
>> general is that it will override NUMA placement, even when
>> processes only wake each other up infrequently...
> 
> As far as my understanding goes, the logic in select_task_rq_fair()
> does wake_affine() or calls select_idle_sibling() only at those
> levels of sched domains where the flag SD_WAKE_AFFINE is set.
> This flag is not set at the numa domain and hence they will not be
> balancing across numa nodes. So I don't understand how
> *these functions* are affecting NUMA placements.

Even on 8-node DL980 systems, the NUMA distance in the
SLIT table is less than RECLAIM_DISTANCE, and we will
do wake_affine across the entire system.

> The wake_affine() and select_idle_sibling() will shuttle tasks
> within a NUMA node as far as I can see.i.e. if the cpu that the task
> previously ran on and the waker cpu belong to the same node.
> Else they are not called.

That is what I first hoped, too. I was wrong.

> If the prev_cpu and the waker cpu are on different NUMA nodes
> then naturally the tasks will get shuttled across NUMA nodes but
> the culprits are the find_idlest* functions.
>    They do a top-down search for the idlest group and cpu, starting
> at the NUMA domain *attached to the waker and not the prev_cpu*.
> This means that the task will end up on a different NUMA node.
> Looks to me that the problem lies here and not in the wake_affine()
> and select_idle_siblings().

I have a patch for find_idlest_group that takes the NUMA
distance between each group and the task's preferred node
into account.

However, as long as the wake_affine stuff still gets to
override it, that does not make much difference :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-04 12:41       ` Rik van Riel
@ 2014-05-05  4:50         ` Preeti U Murthy
  2014-05-05  6:43           ` Preeti U Murthy
                             ` (2 more replies)
  2014-05-06 13:25         ` Peter Zijlstra
  1 sibling, 3 replies; 46+ messages in thread
From: Preeti U Murthy @ 2014-05-05  4:50 UTC (permalink / raw)
  To: Rik van Riel, umgwanakikbuti, Peter Zijlstra
  Cc: Preeti Murthy, LKML, Morten Rasmussen, Ingo Molnar,
	george.mccollister, ktkhai

On 05/04/2014 06:11 PM, Rik van Riel wrote:
> On 05/04/2014 07:44 AM, Preeti Murthy wrote:
>> Hi Rik, Mike
>>
>> On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote:
>>> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
>>>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
>>>>
>>>>> Whether or not this is the right thing to do remains to be seen,
>>>>> but it does allow us to verify whether or not the wake_affine
>>>>> strategy of always doing affine wakeups and only disabling them
>>>>> in a specific circumstance is sound, or needs rethinking...
>>>>
>>>> Yes, it needs rethinking.
>>>>
>>>> I know why you want to try this, yes, select_idle_sibling() is very much
>>>> a two faced little bitch.
>>>
>>> My biggest problem with select_idle_sibling and wake_affine in
>>> general is that it will override NUMA placement, even when
>>> processes only wake each other up infrequently...
>>
>> As far as my understanding goes, the logic in select_task_rq_fair()
>> does wake_affine() or calls select_idle_sibling() only at those
>> levels of sched domains where the flag SD_WAKE_AFFINE is set.
>> This flag is not set at the numa domain and hence they will not be
>> balancing across numa nodes. So I don't understand how
>> *these functions* are affecting NUMA placements.
> 
> Even on 8-node DL980 systems, the NUMA distance in the
> SLIT table is less than RECLAIM_DISTANCE, and we will
> do wake_affine across the entire system.
> 
>> The wake_affine() and select_idle_sibling() will shuttle tasks
>> within a NUMA node as far as I can see.i.e. if the cpu that the task
>> previously ran on and the waker cpu belong to the same node.
>> Else they are not called.
> 
> That is what I first hoped, too. I was wrong.
> 
>> If the prev_cpu and the waker cpu are on different NUMA nodes
>> then naturally the tasks will get shuttled across NUMA nodes but
>> the culprits are the find_idlest* functions.
>>    They do a top-down search for the idlest group and cpu, starting
>> at the NUMA domain *attached to the waker and not the prev_cpu*.
>> This means that the task will end up on a different NUMA node.
>> Looks to me that the problem lies here and not in the wake_affine()
>> and select_idle_siblings().
> 
> I have a patch for find_idlest_group that takes the NUMA
> distance between each group and the task's preferred node
> into account.
> 
> However, as long as the wake_affine stuff still gets to
> override it, that does not make much difference :)
> 

Yeah now I see it. But I still feel wake_affine() and
select_idle_sibling() are not at fault primarily because when they were
introduced, I don't think it was foreseen that the cpu topology would
grow to the extent it is now.

select_idle_sibling() for instance scans the cpus within the purview of
the last level cache of a cpu and this was a small set. Hence there was
no overhead. Now with many cpus sharing the L3 cache, we see an
overhead. wake_affine() probably did not expect the NUMA nodes to come
under its governance as well and hence it sees no harm in waking up
tasks close to the waker because it still believes that it will be
within a node.

What has changed is the code around these two functions I feel. Take
this problem for instance. We ourselves are saying in sd_local_flags()
that this specific domain is fit for wake affine balance. So naturally
the logic in wake_affine and select_idle_sibling() will follow.
  My point is the peripheral code is seeing the negative affect of these
two functions because they pushed themselves under its ambit.

Don't you think we should go conservative on the value of
RECLAIM_DISTANCE in arch specific code at-least? On powerpc we set it to
10. Besides, the git log does not tell us the basis on which this value
was set to a default of 30. Maybe this needs re-thought?

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-05  4:50         ` Preeti U Murthy
@ 2014-05-05  6:43           ` Preeti U Murthy
  2014-05-05 11:28           ` Rik van Riel
  2014-05-06 13:26           ` Peter Zijlstra
  2 siblings, 0 replies; 46+ messages in thread
From: Preeti U Murthy @ 2014-05-05  6:43 UTC (permalink / raw)
  To: Rik van Riel, umgwanakikbuti, Peter Zijlstra
  Cc: Preeti Murthy, LKML, Morten Rasmussen, Ingo Molnar,
	george.mccollister, ktkhai

On 05/05/2014 10:20 AM, Preeti U Murthy wrote:
> On 05/04/2014 06:11 PM, Rik van Riel wrote:
>> On 05/04/2014 07:44 AM, Preeti Murthy wrote:
>>> Hi Rik, Mike
>>>
>>> On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote:
>>>> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
>>>>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
>>>>>
>>>>>> Whether or not this is the right thing to do remains to be seen,
>>>>>> but it does allow us to verify whether or not the wake_affine
>>>>>> strategy of always doing affine wakeups and only disabling them
>>>>>> in a specific circumstance is sound, or needs rethinking...
>>>>>
>>>>> Yes, it needs rethinking.
>>>>>
>>>>> I know why you want to try this, yes, select_idle_sibling() is very much
>>>>> a two faced little bitch.
>>>>
>>>> My biggest problem with select_idle_sibling and wake_affine in
>>>> general is that it will override NUMA placement, even when
>>>> processes only wake each other up infrequently...
>>>
>>> As far as my understanding goes, the logic in select_task_rq_fair()
>>> does wake_affine() or calls select_idle_sibling() only at those
>>> levels of sched domains where the flag SD_WAKE_AFFINE is set.
>>> This flag is not set at the numa domain and hence they will not be
>>> balancing across numa nodes. So I don't understand how
>>> *these functions* are affecting NUMA placements.
>>
>> Even on 8-node DL980 systems, the NUMA distance in the
>> SLIT table is less than RECLAIM_DISTANCE, and we will
>> do wake_affine across the entire system.
>>
>>> The wake_affine() and select_idle_sibling() will shuttle tasks
>>> within a NUMA node as far as I can see.i.e. if the cpu that the task
>>> previously ran on and the waker cpu belong to the same node.
>>> Else they are not called.
>>
>> That is what I first hoped, too. I was wrong.
>>
>>> If the prev_cpu and the waker cpu are on different NUMA nodes
>>> then naturally the tasks will get shuttled across NUMA nodes but
>>> the culprits are the find_idlest* functions.
>>>    They do a top-down search for the idlest group and cpu, starting
>>> at the NUMA domain *attached to the waker and not the prev_cpu*.
>>> This means that the task will end up on a different NUMA node.
>>> Looks to me that the problem lies here and not in the wake_affine()
>>> and select_idle_siblings().
>>
>> I have a patch for find_idlest_group that takes the NUMA
>> distance between each group and the task's preferred node
>> into account.
>>
>> However, as long as the wake_affine stuff still gets to
>> override it, that does not make much difference :)
>>
> 
> Yeah now I see it. But I still feel wake_affine() and
> select_idle_sibling() are not at fault primarily because when they were
> introduced, I don't think it was foreseen that the cpu topology would
> grow to the extent it is now.
> 
> select_idle_sibling() for instance scans the cpus within the purview of
> the last level cache of a cpu and this was a small set. Hence there was
> no overhead. Now with many cpus sharing the L3 cache, we see an
> overhead. wake_affine() probably did not expect the NUMA nodes to come
> under its governance as well and hence it sees no harm in waking up
> tasks close to the waker because it still believes that it will be
> within a node.
> 
> What has changed is the code around these two functions I feel. Take
> this problem for instance. We ourselves are saying in sd_local_flags()
> that this specific domain is fit for wake affine balance. So naturally
> the logic in wake_affine and select_idle_sibling() will follow.
>   My point is the peripheral code is seeing the negative affect of these
> two functions because they pushed themselves under its ambit.
> 
> Don't you think we should go conservative on the value of
> RECLAIM_DISTANCE in arch specific code at-least? On powerpc we set it to
> 10. Besides, the git log does not tell us the basis on which this value
> was set to a default of 30. Maybe this needs re-thought?

Sorry I overlooked this. Commit 32e45ff43eaf5c17f5a increased the value
to 30 and the reason is also clearly mentioned. It is mentioned that the
value was arbitrarily chosen. I don't know if this will help this
discussion but I thought I would point it out.

Thanks

Regards
Preeti U Murthy
> 
> Regards
> Preeti U Murthy
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-05  4:50         ` Preeti U Murthy
  2014-05-05  6:43           ` Preeti U Murthy
@ 2014-05-05 11:28           ` Rik van Riel
  2014-05-06 13:26           ` Peter Zijlstra
  2 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2014-05-05 11:28 UTC (permalink / raw)
  To: Preeti U Murthy, umgwanakikbuti, Peter Zijlstra
  Cc: Preeti Murthy, LKML, Morten Rasmussen, Ingo Molnar,
	george.mccollister, ktkhai

On 05/05/2014 12:50 AM, Preeti U Murthy wrote:

> Yeah now I see it. But I still feel wake_affine() and
> select_idle_sibling() are not at fault primarily because when they were
> introduced, I don't think it was foreseen that the cpu topology would
> grow to the extent it is now.

It's not about "fault", it is about the fact that on current
large NUMA systems they are broken, and could stand some
improvement :)

> select_idle_sibling() for instance scans the cpus within the purview of
> the last level cache of a cpu and this was a small set. Hence there was
> no overhead. Now with many cpus sharing the L3 cache, we see an
> overhead. wake_affine() probably did not expect the NUMA nodes to come
> under its governance as well and hence it sees no harm in waking up
> tasks close to the waker because it still believes that it will be
> within a node.

If two tasks truly are related to each other, I think we
will want to have the wake_affine logic pull them towards
each other, all the way across a giant NUMA system if
needs be.

The problem is that the current wake_affine logic starts
in the ON position, and only switches off in a few very
specific scenarios.

I suspect we would be better off with the reverse, starting
with wake_affine in the off position, and switching it on
when we detect it makes sense to do so.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-05  4:50         ` Preeti U Murthy
  2014-05-05  6:43           ` Preeti U Murthy
  2014-05-05 11:28           ` Rik van Riel
@ 2014-05-06 13:26           ` Peter Zijlstra
  2 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-05-06 13:26 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Rik van Riel, umgwanakikbuti, Preeti Murthy, LKML,
	Morten Rasmussen, Ingo Molnar, george.mccollister, ktkhai

[-- Attachment #1: Type: text/plain, Size: 394 bytes --]

On Mon, May 05, 2014 at 10:20:20AM +0530, Preeti U Murthy wrote:
> Don't you think we should go conservative on the value of
> RECLAIM_DISTANCE in arch specific code at-least? On powerpc we set it to
> 10. Besides, the git log does not tell us the basis on which this value
> was set to a default of 30. Maybe this needs re-thought?

See my last email on how all this is entirely pointless :-(

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-04 12:41       ` Rik van Riel
  2014-05-05  4:50         ` Preeti U Murthy
@ 2014-05-06 13:25         ` Peter Zijlstra
  2014-05-06 20:20           ` Rik van Riel
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-05-06 13:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Preeti Murthy, umgwanakikbuti, LKML, Morten Rasmussen,
	Ingo Molnar, george.mccollister, ktkhai, Preeti U Murthy

[-- Attachment #1: Type: text/plain, Size: 690 bytes --]

On Sun, May 04, 2014 at 08:41:09AM -0400, Rik van Riel wrote:
> Even on 8-node DL980 systems, the NUMA distance in the
> SLIT table is less than RECLAIM_DISTANCE, and we will
> do wake_affine across the entire system.

Yeah, so the problem is that (AFAIK) ACPI doesn't actually specify a
metric for the SLIT distance. This (in as far as BIOS people would care
to stick to specs anyhow) has lead to the 'fun' situation where BIOS
engineers tweak SLIT table values to make OSes behave as they thing it
should.

So if the BIOS engineer finds that this system should have <
RECLAIM_DISTANCE it will simply make the table such that the max SLIT
value is below that.

And yes, I've seen this :-(

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-06 13:25         ` Peter Zijlstra
@ 2014-05-06 20:20           ` Rik van Riel
  2014-05-06 20:41             ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2014-05-06 20:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Preeti Murthy, umgwanakikbuti, LKML, Morten Rasmussen,
	Ingo Molnar, george.mccollister, ktkhai, Preeti U Murthy

On 05/06/2014 09:25 AM, Peter Zijlstra wrote:
> On Sun, May 04, 2014 at 08:41:09AM -0400, Rik van Riel wrote:
>> Even on 8-node DL980 systems, the NUMA distance in the
>> SLIT table is less than RECLAIM_DISTANCE, and we will
>> do wake_affine across the entire system.
> 
> Yeah, so the problem is that (AFAIK) ACPI doesn't actually specify a
> metric for the SLIT distance. This (in as far as BIOS people would care
> to stick to specs anyhow) has lead to the 'fun' situation where BIOS
> engineers tweak SLIT table values to make OSes behave as they thing it
> should.
> 
> So if the BIOS engineer finds that this system should have <
> RECLAIM_DISTANCE it will simply make the table such that the max SLIT
> value is below that.
> 
> And yes, I've seen this :-(

It appears to be the case on the vast majority of the NUMA
systems that are actually in use.

To me, this suggests that we should probably deal with it.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-06 20:20           ` Rik van Riel
@ 2014-05-06 20:41             ` Peter Zijlstra
  2014-05-07 12:17               ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-05-06 20:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Preeti Murthy, umgwanakikbuti, LKML, Morten Rasmussen,
	Ingo Molnar, george.mccollister, ktkhai, Preeti U Murthy

On Tue, May 06, 2014 at 04:20:59PM -0400, Rik van Riel wrote:
> On 05/06/2014 09:25 AM, Peter Zijlstra wrote:
> > On Sun, May 04, 2014 at 08:41:09AM -0400, Rik van Riel wrote:
> >> Even on 8-node DL980 systems, the NUMA distance in the
> >> SLIT table is less than RECLAIM_DISTANCE, and we will
> >> do wake_affine across the entire system.
> > 
> > Yeah, so the problem is that (AFAIK) ACPI doesn't actually specify a
> > metric for the SLIT distance. This (in as far as BIOS people would care
> > to stick to specs anyhow) has lead to the 'fun' situation where BIOS
> > engineers tweak SLIT table values to make OSes behave as they thing it
> > should.
> > 
> > So if the BIOS engineer finds that this system should have <
> > RECLAIM_DISTANCE it will simply make the table such that the max SLIT
> > value is below that.
> > 
> > And yes, I've seen this :-(
> 
> It appears to be the case on the vast majority of the NUMA
> systems that are actually in use.
> 
> To me, this suggests that we should probably deal with it.

What we could do is redefine this distance in hops, that'll force them
to lie more blatantly and actually miss represent the topology.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-06 20:41             ` Peter Zijlstra
@ 2014-05-07 12:17               ` Ingo Molnar
  0 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2014-05-07 12:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Preeti Murthy, umgwanakikbuti, LKML,
	Morten Rasmussen, george.mccollister, ktkhai, Preeti U Murthy


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, May 06, 2014 at 04:20:59PM -0400, Rik van Riel wrote:
> > On 05/06/2014 09:25 AM, Peter Zijlstra wrote:
> > > On Sun, May 04, 2014 at 08:41:09AM -0400, Rik van Riel wrote:
> > >> Even on 8-node DL980 systems, the NUMA distance in the
> > >> SLIT table is less than RECLAIM_DISTANCE, and we will
> > >> do wake_affine across the entire system.
> > > 
> > > Yeah, so the problem is that (AFAIK) ACPI doesn't actually specify a
> > > metric for the SLIT distance. This (in as far as BIOS people would care
> > > to stick to specs anyhow) has lead to the 'fun' situation where BIOS
> > > engineers tweak SLIT table values to make OSes behave as they thing it
> > > should.
> > > 
> > > So if the BIOS engineer finds that this system should have <
> > > RECLAIM_DISTANCE it will simply make the table such that the max SLIT
> > > value is below that.
> > > 
> > > And yes, I've seen this :-(
> > 
> > It appears to be the case on the vast majority of the NUMA systems 
> > that are actually in use.
> > 
> > To me, this suggests that we should probably deal with it.
> 
> What we could do is redefine this distance in hops, that'll force 
> them to lie more blatantly and actually miss represent the topology.

and we should make sure we reduce any graph they represent, so that 
they can lie only through very heavy misrepresentation of the topology 
(i.e. not just weight tweaks) which will bite them in other areas 
(like the mm).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
  2014-05-04 11:44     ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy
  2014-05-04 12:04       ` Mike Galbraith
  2014-05-04 12:41       ` Rik van Riel
@ 2014-05-06 11:56       ` Peter Zijlstra
  2 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-05-06 11:56 UTC (permalink / raw)
  To: Preeti Murthy
  Cc: Rik van Riel, umgwanakikbuti, LKML, Morten Rasmussen, Ingo Molnar,
	george.mccollister, ktkhai, Preeti U Murthy

[-- Attachment #1: Type: text/plain, Size: 620 bytes --]

On Sun, May 04, 2014 at 05:14:59PM +0530, Preeti Murthy wrote:
> As far as my understanding goes, the logic in select_task_rq_fair()
> does wake_affine() or calls select_idle_sibling() only at those
> levels of sched domains where the flag SD_WAKE_AFFINE is set.
> This flag is not set at the numa domain and hence they will not be
> balancing across numa nodes. So I don't understand how
> *these functions* are affecting NUMA placements.

It _is_ set at NUMA domains, just not on those > RECLAIM_DISTANCE. That
means you typically need a non-fully connected system and then the top
numa tier will not get wake affine.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2014-05-22 12:28 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-02  4:42 [PATCH RFC/TEST] sched: make sync affine wakeups work Rik van Riel
2014-05-02  5:32 ` Mike Galbraith
2014-05-02  5:41   ` Mike Galbraith
2014-05-02  5:58   ` Mike Galbraith
2014-05-02  6:08     ` Rik van Riel
2014-05-02  6:36       ` Mike Galbraith
2014-05-02  6:51         ` Mike Galbraith
2014-05-02  6:13 ` Mike Galbraith
2014-05-02  6:30   ` Rik van Riel
2014-05-02  7:37     ` Mike Galbraith
2014-05-02 10:56       ` Rik van Riel
2014-05-02 11:27         ` Mike Galbraith
2014-05-02 12:51           ` Mike Galbraith
     [not found]           ` <5363B793.9010208@redhat.com>
2014-05-06 11:54             ` Peter Zijlstra
2014-05-06 20:19               ` Rik van Riel
2014-05-06 20:39                 ` Peter Zijlstra
2014-05-06 23:46                   ` Rik van Riel
2014-05-09  2:20                   ` Rik van Riel
2014-05-09  5:27                     ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Rik van Riel
2014-05-09  6:04                       ` [PATCH] sched: clean up select_task_rq_fair conditionals and indentation Rik van Riel
2014-05-09  7:34                       ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Mike Galbraith
2014-05-09 14:22                         ` Rik van Riel
2014-05-09 15:24                           ` Mike Galbraith
2014-05-09 15:24                             ` Rik van Riel
2014-05-09 17:55                               ` Mike Galbraith
2014-05-09 18:16                                 ` Rik van Riel
2014-05-10  3:54                                   ` Mike Galbraith
2014-05-13 14:08                                     ` Rik van Riel
2014-05-14  4:08                                       ` Mike Galbraith
2014-05-14 15:40                                         ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel
2014-05-14 15:45                                           ` Peter Zijlstra
2014-05-19 13:08                                           ` [tip:sched/core] " tip-bot for Rik van Riel
2014-05-22 12:27                                           ` [tip:sched/core] sched: Call select_idle_sibling() " tip-bot for Rik van Riel
2014-05-04 11:44     ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy
2014-05-04 12:04       ` Mike Galbraith
2014-05-05  4:38         ` Preeti U Murthy
2014-05-04 12:41       ` Rik van Riel
2014-05-05  4:50         ` Preeti U Murthy
2014-05-05  6:43           ` Preeti U Murthy
2014-05-05 11:28           ` Rik van Riel
2014-05-06 13:26           ` Peter Zijlstra
2014-05-06 13:25         ` Peter Zijlstra
2014-05-06 20:20           ` Rik van Riel
2014-05-06 20:41             ` Peter Zijlstra
2014-05-07 12:17               ` Ingo Molnar
2014-05-06 11:56       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).