public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
@ 2013-09-25  7:53 Peter Zijlstra
  2013-09-25  8:56 ` Mike Galbraith
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-09-25  7:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mike Galbraith, Paul Turner, Rik van Riel, linux-kernel

Subject: sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Sep 25 08:28:39 CEST 2013

When a task is the only running task and does a sync wakeup; avoid
going through select_idle_sibling() as it doesn't know the current CPU
is going to be idle shortly.

Without this two sync wakers will ping-pong between CPUs for no
reason.

Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3461,6 +3461,16 @@ select_task_rq_fair(struct task_struct *
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			prev_cpu = cpu;
 
+		/*
+		 * Don't bother with select_idle_sibling() in the case of a sync wakeup
+		 * where we know the only running task will soon go away. Going
+		 * through select_idle_sibling will only lead to pointless ping-pong.
+		 */
+		if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1) {
+			new_cpu = cpu;
+			goto unlock;
+		}
+
 		new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-25  7:53 [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true) Peter Zijlstra
@ 2013-09-25  8:56 ` Mike Galbraith
  2013-09-26  2:50   ` Michael wang
  2013-09-26  9:58   ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Mike Galbraith @ 2013-09-25  8:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Paul Turner, Rik van Riel, linux-kernel

On Wed, 2013-09-25 at 09:53 +0200, Peter Zijlstra wrote: 
> Subject: sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed Sep 25 08:28:39 CEST 2013
> 
> When a task is the only running task and does a sync wakeup; avoid
> going through select_idle_sibling() as it doesn't know the current CPU
> is going to be idle shortly.
> 
> Without this two sync wakers will ping-pong between CPUs for no
> reason.

That will make pipe-test go fugly -> pretty, and help very fast/light
localhost network, but eat heavier localhost overlap recovery.  We need
a working (and cheap) overlap detector scheme, so we can know when there
is enough to be worth going after.

(I sent you some lmbench numbers offline a while back showing the
two-faced little <b-word> in action, doing both good and evil)
> Suggested-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3461,6 +3461,16 @@ select_task_rq_fair(struct task_struct *
>  		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>  			prev_cpu = cpu;
>  
> +		/*
> +		 * Don't bother with select_idle_sibling() in the case of a sync wakeup
> +		 * where we know the only running task will soon go away. Going
> +		 * through select_idle_sibling will only lead to pointless ping-pong.
> +		 */
> +		if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1) {
> +			new_cpu = cpu;
> +			goto unlock;
> +		}
> +
>  		new_cpu = select_idle_sibling(p, prev_cpu);
>  		goto unlock;
>  	}



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-25  8:56 ` Mike Galbraith
@ 2013-09-26  2:50   ` Michael wang
  2013-09-26  3:41     ` Mike Galbraith
  2013-09-26  9:58   ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Michael wang @ 2013-09-26  2:50 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: Ingo Molnar, Paul Turner, Rik van Riel, linux-kernel

On 09/25/2013 04:56 PM, Mike Galbraith wrote:
> On Wed, 2013-09-25 at 09:53 +0200, Peter Zijlstra wrote: 
>> Subject: sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
>> From: Peter Zijlstra <peterz@infradead.org>
>> Date: Wed Sep 25 08:28:39 CEST 2013
>>
>> When a task is the only running task and does a sync wakeup; avoid
>> going through select_idle_sibling() as it doesn't know the current CPU
>> is going to be idle shortly.
>>
>> Without this two sync wakers will ping-pong between CPUs for no
>> reason.
> 
> That will make pipe-test go fugly -> pretty, and help very fast/light
> localhost network, but eat heavier localhost overlap recovery.  We need
> a working (and cheap) overlap detector scheme, so we can know when there
> is enough to be worth going after.
> 
> (I sent you some lmbench numbers offline a while back showing the
> two-faced little <b-word> in action, doing both good and evil)

It seems like the choice between the overhead and a little possibility
to balance the load :)

Like the case when we have:

	core0 sg		core1 sg
	cpu0	cpu1		cpu2	cpu3
	waker	busy		idle	idle

If the sync wakeup was on cpu0, we can:

1. choose cpu in core1 sg like we did usually
   some overhead but tend to make the load a little balance
	core0 sg		core1 sg
	cpu0	cpu1		cpu2	cpu3
	idle	busy		wakee	idle

2. choose cpu0 like the patch proposed
   no overhead but tend to make the load a little more unbalance
	core0 sg		core1 sg
	cpu0	cpu1		cpu2	cpu3
	wakee	busy		idle	idle

May be we should add a higher scope load balance check in wake_affine(),
but that means higher overhead which is just what the patch want to
reduce...

What about some discount for sync case inside select_idle_sibling()?
For example we consider sync cpu as idle and prefer it more than the others?

Regards,
Michael Wang


>> Suggested-by: Rik van Riel <riel@redhat.com>
>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>> ---
>>  kernel/sched/fair.c |   10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3461,6 +3461,16 @@ select_task_rq_fair(struct task_struct *
>>  		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>>  			prev_cpu = cpu;
>>  
>> +		/*
>> +		 * Don't bother with select_idle_sibling() in the case of a sync wakeup
>> +		 * where we know the only running task will soon go away. Going
>> +		 * through select_idle_sibling will only lead to pointless ping-pong.
>> +		 */
>> +		if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1) {
>> +			new_cpu = cpu;
>> +			goto unlock;
>> +		}
>> +
>>  		new_cpu = select_idle_sibling(p, prev_cpu);
>>  		goto unlock;
>>  	}
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  2:50   ` Michael wang
@ 2013-09-26  3:41     ` Mike Galbraith
  2013-09-26  5:12       ` Michael wang
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Galbraith @ 2013-09-26  3:41 UTC (permalink / raw)
  To: Michael wang
  Cc: Peter Zijlstra, Ingo Molnar, Paul Turner, Rik van Riel,
	linux-kernel

On Thu, 2013-09-26 at 10:50 +0800, Michael wang wrote: 
> On 09/25/2013 04:56 PM, Mike Galbraith wrote:
> > On Wed, 2013-09-25 at 09:53 +0200, Peter Zijlstra wrote: 
> >> Subject: sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
> >> From: Peter Zijlstra <peterz@infradead.org>
> >> Date: Wed Sep 25 08:28:39 CEST 2013
> >>
> >> When a task is the only running task and does a sync wakeup; avoid
> >> going through select_idle_sibling() as it doesn't know the current CPU
> >> is going to be idle shortly.
> >>
> >> Without this two sync wakers will ping-pong between CPUs for no
> >> reason.
> > 
> > That will make pipe-test go fugly -> pretty, and help very fast/light
> > localhost network, but eat heavier localhost overlap recovery.  We need
> > a working (and cheap) overlap detector scheme, so we can know when there
> > is enough to be worth going after.
> > 
> > (I sent you some lmbench numbers offline a while back showing the
> > two-faced little <b-word> in action, doing both good and evil)
> 
> It seems like the choice between the overhead and a little possibility
> to balance the load :)
> 
> Like the case when we have:
> 
> 	core0 sg		core1 sg
> 	cpu0	cpu1		cpu2	cpu3
> 	waker	busy		idle	idle
> 
> If the sync wakeup was on cpu0, we can:
> 
> 1. choose cpu in core1 sg like we did usually
>    some overhead but tend to make the load a little balance
> 	core0 sg		core1 sg
> 	cpu0	cpu1		cpu2	cpu3
> 	idle	busy		wakee	idle

Reducing latency and increasing throughput when the waker isn't really
really going to immediately schedule off as the hint implies.  Nice for
bursty loads and ramp.

The breakeven point is going up though.  If you don't have nohz
throttled, you eat tick start/stop overhead, and the menu governor
recently added yet more overhead, so maybe we should say hell with it.

> 2. choose cpu0 like the patch proposed
>    no overhead but tend to make the load a little more unbalance
> 	core0 sg		core1 sg
> 	cpu0	cpu1		cpu2	cpu3
> 	wakee	busy		idle	idle
> 
> May be we should add a higher scope load balance check in wake_affine(),
> but that means higher overhead which is just what the patch want to
> reduce...

Yeah, more overhead is the last thing we need.

> What about some discount for sync case inside select_idle_sibling()?
> For example we consider sync cpu as idle and prefer it more than the others?

That's what the sync hint does.  Problem is, it's a hint.  If it were
truth, there would be no point in calling select_idle_sibling().

-Mike


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  3:41     ` Mike Galbraith
@ 2013-09-26  5:12       ` Michael wang
  2013-09-26  5:34         ` Mike Galbraith
  0 siblings, 1 reply; 21+ messages in thread
From: Michael wang @ 2013-09-26  5:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Paul Turner, Rik van Riel,
	linux-kernel

On 09/26/2013 11:41 AM, Mike Galbraith wrote:
[snip]
>> Like the case when we have:
>>
>> 	core0 sg		core1 sg
>> 	cpu0	cpu1		cpu2	cpu3
>> 	waker	busy		idle	idle
>>
>> If the sync wakeup was on cpu0, we can:
>>
>> 1. choose cpu in core1 sg like we did usually
>>    some overhead but tend to make the load a little balance
>> 	core0 sg		core1 sg
>> 	cpu0	cpu1		cpu2	cpu3
>> 	idle	busy		wakee	idle
> 
> Reducing latency and increasing throughput when the waker isn't really
> really going to immediately schedule off as the hint implies.  Nice for
> bursty loads and ramp.
> 
> The breakeven point is going up though.  If you don't have nohz
> throttled, you eat tick start/stop overhead, and the menu governor
> recently added yet more overhead, so maybe we should say hell with it.

Exactly, more and more factors to be considered, we say things get
balanced but actually it's not the best choice...

> 
>> 2. choose cpu0 like the patch proposed
>>    no overhead but tend to make the load a little more unbalance
>> 	core0 sg		core1 sg
>> 	cpu0	cpu1		cpu2	cpu3
>> 	wakee	busy		idle	idle
>>
>> May be we should add a higher scope load balance check in wake_affine(),
>> but that means higher overhead which is just what the patch want to
>> reduce...
> 
> Yeah, more overhead is the last thing we need.
> 
>> What about some discount for sync case inside select_idle_sibling()?
>> For example we consider sync cpu as idle and prefer it more than the others?
> 
> That's what the sync hint does.  Problem is, it's a hint.  If it were
> truth, there would be no point in calling select_idle_sibling().

Just wondering if the hint was wrong in most of the time, then why don't
we remove it...

Otherwise I think we can still utilize it to make some decision tends to
be correct, don't we?

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  5:12       ` Michael wang
@ 2013-09-26  5:34         ` Mike Galbraith
  2013-09-26  6:15           ` Mike Galbraith
  2013-09-26  6:32           ` Michael wang
  0 siblings, 2 replies; 21+ messages in thread
From: Mike Galbraith @ 2013-09-26  5:34 UTC (permalink / raw)
  To: Michael wang
  Cc: Peter Zijlstra, Ingo Molnar, Paul Turner, Rik van Riel,
	linux-kernel

On Thu, 2013-09-26 at 13:12 +0800, Michael wang wrote: 
> On 09/26/2013 11:41 AM, Mike Galbraith wrote:
> [snip]
> >> Like the case when we have:
> >>
> >> 	core0 sg		core1 sg
> >> 	cpu0	cpu1		cpu2	cpu3
> >> 	waker	busy		idle	idle
> >>
> >> If the sync wakeup was on cpu0, we can:
> >>
> >> 1. choose cpu in core1 sg like we did usually
> >>    some overhead but tend to make the load a little balance
> >> 	core0 sg		core1 sg
> >> 	cpu0	cpu1		cpu2	cpu3
> >> 	idle	busy		wakee	idle
> > 
> > Reducing latency and increasing throughput when the waker isn't really
> > really going to immediately schedule off as the hint implies.  Nice for
> > bursty loads and ramp.
> > 
> > The breakeven point is going up though.  If you don't have nohz
> > throttled, you eat tick start/stop overhead, and the menu governor
> > recently added yet more overhead, so maybe we should say hell with it.
> 
> Exactly, more and more factors to be considered, we say things get
> balanced but actually it's not the best choice...
> 
> > 
> >> 2. choose cpu0 like the patch proposed
> >>    no overhead but tend to make the load a little more unbalance
> >> 	core0 sg		core1 sg
> >> 	cpu0	cpu1		cpu2	cpu3
> >> 	wakee	busy		idle	idle
> >>
> >> May be we should add a higher scope load balance check in wake_affine(),
> >> but that means higher overhead which is just what the patch want to
> >> reduce...
> > 
> > Yeah, more overhead is the last thing we need.
> > 
> >> What about some discount for sync case inside select_idle_sibling()?
> >> For example we consider sync cpu as idle and prefer it more than the others?
> > 
> > That's what the sync hint does.  Problem is, it's a hint.  If it were
> > truth, there would be no point in calling select_idle_sibling().
> 
> Just wondering if the hint was wrong in most of the time, then why don't
> we remove it...

For very fast/light network ping-pong micro-benchmarks, it is right.
For pipe-test, it's absolutely right, jabbering parties are 100%
synchronous, there is nada/nil/zip/diddly squat overlap reclaimable..
but in the real world, it ain't necessarily so.

> Otherwise I think we can still utilize it to make some decision tends to
> be correct, don't we?

Sometimes :)

-Mike


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  5:34         ` Mike Galbraith
@ 2013-09-26  6:15           ` Mike Galbraith
  2013-09-26  6:32           ` Michael wang
  1 sibling, 0 replies; 21+ messages in thread
From: Mike Galbraith @ 2013-09-26  6:15 UTC (permalink / raw)
  To: Michael wang
  Cc: Peter Zijlstra, Ingo Molnar, Paul Turner, Rik van Riel,
	linux-kernel

On Thu, 2013-09-26 at 07:34 +0200, Mike Galbraith wrote: 
> On Thu, 2013-09-26 at 13:12 +0800, Michael wang wrote: 
> > On 09/26/2013 11:41 AM, Mike Galbraith wrote:
> > [snip]
> > >> Like the case when we have:
> > >>
> > >> 	core0 sg		core1 sg
> > >> 	cpu0	cpu1		cpu2	cpu3
> > >> 	waker	busy		idle	idle
> > >>
> > >> If the sync wakeup was on cpu0, we can:
> > >>
> > >> 1. choose cpu in core1 sg like we did usually
> > >>    some overhead but tend to make the load a little balance
> > >> 	core0 sg		core1 sg
> > >> 	cpu0	cpu1		cpu2	cpu3
> > >> 	idle	busy		wakee	idle
> > > 
> > > Reducing latency and increasing throughput when the waker isn't really
> > > really going to immediately schedule off as the hint implies.  Nice for
> > > bursty loads and ramp.
> > > 
> > > The breakeven point is going up though.  If you don't have nohz
> > > throttled, you eat tick start/stop overhead, and the menu governor
> > > recently added yet more overhead, so maybe we should say hell with it.
> > 
> > Exactly, more and more factors to be considered, we say things get
> > balanced but actually it's not the best choice...
> > 
> > > 
> > >> 2. choose cpu0 like the patch proposed
> > >>    no overhead but tend to make the load a little more unbalance
> > >> 	core0 sg		core1 sg
> > >> 	cpu0	cpu1		cpu2	cpu3
> > >> 	wakee	busy		idle	idle
> > >>
> > >> May be we should add a higher scope load balance check in wake_affine(),
> > >> but that means higher overhead which is just what the patch want to
> > >> reduce...
> > > 
> > > Yeah, more overhead is the last thing we need.
> > > 
> > >> What about some discount for sync case inside select_idle_sibling()?
> > >> For example we consider sync cpu as idle and prefer it more than the others?
> > > 
> > > That's what the sync hint does.  Problem is, it's a hint.  If it were
> > > truth, there would be no point in calling select_idle_sibling().
> > 
> > Just wondering if the hint was wrong in most of the time, then why don't
> > we remove it...
> 
> For very fast/light network ping-pong micro-benchmarks, it is right.
> For pipe-test, it's absolutely right, jabbering parties are 100%
> synchronous, there is nada/nil/zip/diddly squat overlap reclaimable..
> but in the real world, it ain't necessarily so.
> 
> > Otherwise I think we can still utilize it to make some decision tends to
> > be correct, don't we?
> 
> Sometimes :)

P.S.  while we're slapping select_idle_sibling()'s _evil_ face, let's
give it a pat on the head too.  It showed regressions in bright red.
Put pipe-test on one core, you only see scheduler weight.. but entering
and exiting idle is part of the fast path, whether you're exercising it
by doing something silly or not.

-Mike


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  5:34         ` Mike Galbraith
  2013-09-26  6:15           ` Mike Galbraith
@ 2013-09-26  6:32           ` Michael wang
  2013-09-26  7:09             ` Mike Galbraith
  1 sibling, 1 reply; 21+ messages in thread
From: Michael wang @ 2013-09-26  6:32 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Paul Turner, Rik van Riel,
	linux-kernel

On 09/26/2013 01:34 PM, Mike Galbraith wrote:
> On Thu, 2013-09-26 at 13:12 +0800, Michael wang wrote: 
>> On 09/26/2013 11:41 AM, Mike Galbraith wrote:
>> [snip]
>>>> Like the case when we have:
>>>>
>>>> 	core0 sg		core1 sg
>>>> 	cpu0	cpu1		cpu2	cpu3
>>>> 	waker	busy		idle	idle
>>>>
>>>> If the sync wakeup was on cpu0, we can:
>>>>
>>>> 1. choose cpu in core1 sg like we did usually
>>>>    some overhead but tend to make the load a little balance
>>>> 	core0 sg		core1 sg
>>>> 	cpu0	cpu1		cpu2	cpu3
>>>> 	idle	busy		wakee	idle
>>>
>>> Reducing latency and increasing throughput when the waker isn't really
>>> really going to immediately schedule off as the hint implies.  Nice for
>>> bursty loads and ramp.
>>>
>>> The breakeven point is going up though.  If you don't have nohz
>>> throttled, you eat tick start/stop overhead, and the menu governor
>>> recently added yet more overhead, so maybe we should say hell with it.
>>
>> Exactly, more and more factors to be considered, we say things get
>> balanced but actually it's not the best choice...
>>
>>>
>>>> 2. choose cpu0 like the patch proposed
>>>>    no overhead but tend to make the load a little more unbalance
>>>> 	core0 sg		core1 sg
>>>> 	cpu0	cpu1		cpu2	cpu3
>>>> 	wakee	busy		idle	idle
>>>>
>>>> May be we should add a higher scope load balance check in wake_affine(),
>>>> but that means higher overhead which is just what the patch want to
>>>> reduce...
>>>
>>> Yeah, more overhead is the last thing we need.
>>>
>>>> What about some discount for sync case inside select_idle_sibling()?
>>>> For example we consider sync cpu as idle and prefer it more than the others?
>>>
>>> That's what the sync hint does.  Problem is, it's a hint.  If it were
>>> truth, there would be no point in calling select_idle_sibling().
>>
>> Just wondering if the hint was wrong in most of the time, then why don't
>> we remove it...
> 
> For very fast/light network ping-pong micro-benchmarks, it is right.
> For pipe-test, it's absolutely right, jabbering parties are 100%
> synchronous, there is nada/nil/zip/diddly squat overlap reclaimable..
> but in the real world, it ain't necessarily so.
> 
>> Otherwise I think we can still utilize it to make some decision tends to
>> be correct, don't we?
> 
> Sometimes :)

Ok, a double-edged sword I see :)

May be we can wave it carefully here, give the discount to a bigger
scope not the sync cpu, for example:

	sg1				sg2
	cpu0	cpu1	cpu2	cpu3	cpu4	cpu5	cpu6	cpu7
	waker	idle	idle	idle	idle	idle	idle	idle

If it's sync wakeup on cpu0 (only waker), and the sg is wide enough,
which means one cpu is not so influencial, then suppose cpu0 to be idle
could be more safe, also prefer sg1 than sg2 is more likely to be right.

And we can still choose idle-cpu at final step, like cpu1 in this case,
to avoid the risk that waker don't get off as it said.

The key point is to reduce the influence of sync, trust a little but not
totally ;-)

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  6:32           ` Michael wang
@ 2013-09-26  7:09             ` Mike Galbraith
  2013-09-26  7:26               ` Michael wang
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Galbraith @ 2013-09-26  7:09 UTC (permalink / raw)
  To: Michael wang
  Cc: Peter Zijlstra, Ingo Molnar, Paul Turner, Rik van Riel,
	linux-kernel

On Thu, 2013-09-26 at 14:32 +0800, Michael wang wrote: 
> On 09/26/2013 01:34 PM, Mike Galbraith wrote:
> > On Thu, 2013-09-26 at 13:12 +0800, Michael wang wrote: 
> >> On 09/26/2013 11:41 AM, Mike Galbraith wrote:
> >> [snip]
> >>>> Like the case when we have:
> >>>>
> >>>> 	core0 sg		core1 sg
> >>>> 	cpu0	cpu1		cpu2	cpu3
> >>>> 	waker	busy		idle	idle
> >>>>
> >>>> If the sync wakeup was on cpu0, we can:
> >>>>
> >>>> 1. choose cpu in core1 sg like we did usually
> >>>>    some overhead but tend to make the load a little balance
> >>>> 	core0 sg		core1 sg
> >>>> 	cpu0	cpu1		cpu2	cpu3
> >>>> 	idle	busy		wakee	idle
> >>>
> >>> Reducing latency and increasing throughput when the waker isn't really
> >>> really going to immediately schedule off as the hint implies.  Nice for
> >>> bursty loads and ramp.
> >>>
> >>> The breakeven point is going up though.  If you don't have nohz
> >>> throttled, you eat tick start/stop overhead, and the menu governor
> >>> recently added yet more overhead, so maybe we should say hell with it.
> >>
> >> Exactly, more and more factors to be considered, we say things get
> >> balanced but actually it's not the best choice...
> >>
> >>>
> >>>> 2. choose cpu0 like the patch proposed
> >>>>    no overhead but tend to make the load a little more unbalance
> >>>> 	core0 sg		core1 sg
> >>>> 	cpu0	cpu1		cpu2	cpu3
> >>>> 	wakee	busy		idle	idle
> >>>>
> >>>> May be we should add a higher scope load balance check in wake_affine(),
> >>>> but that means higher overhead which is just what the patch want to
> >>>> reduce...
> >>>
> >>> Yeah, more overhead is the last thing we need.
> >>>
> >>>> What about some discount for sync case inside select_idle_sibling()?
> >>>> For example we consider sync cpu as idle and prefer it more than the others?
> >>>
> >>> That's what the sync hint does.  Problem is, it's a hint.  If it were
> >>> truth, there would be no point in calling select_idle_sibling().
> >>
> >> Just wondering if the hint was wrong in most of the time, then why don't
> >> we remove it...
> > 
> > For very fast/light network ping-pong micro-benchmarks, it is right.
> > For pipe-test, it's absolutely right, jabbering parties are 100%
> > synchronous, there is nada/nil/zip/diddly squat overlap reclaimable..
> > but in the real world, it ain't necessarily so.
> > 
> >> Otherwise I think we can still utilize it to make some decision tends to
> >> be correct, don't we?
> > 
> > Sometimes :)
> 
> Ok, a double-edged sword I see :)
> 
> May be we can wave it carefully here, give the discount to a bigger
> scope not the sync cpu, for example:
> 
> 	sg1				sg2
> 	cpu0	cpu1	cpu2	cpu3	cpu4	cpu5	cpu6	cpu7
> 	waker	idle	idle	idle	idle	idle	idle	idle
> 
> If it's sync wakeup on cpu0 (only waker), and the sg is wide enough,
> which means one cpu is not so influencial, then suppose cpu0 to be idle
> could be more safe, also prefer sg1 than sg2 is more likely to be right.
> 
> And we can still choose idle-cpu at final step, like cpu1 in this case,
> to avoid the risk that waker don't get off as it said.
> 
> The key point is to reduce the influence of sync, trust a little but not
> totally ;-)

What we need is a dirt cheap way to fairly accurately predict overlap
potential (todo: write omniscience().. patent, buy planet).

-Mike


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  7:09             ` Mike Galbraith
@ 2013-09-26  7:26               ` Michael wang
  0 siblings, 0 replies; 21+ messages in thread
From: Michael wang @ 2013-09-26  7:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Paul Turner, Rik van Riel,
	linux-kernel

On 09/26/2013 03:09 PM, Mike Galbraith wrote:
[snip]
>>
>> Ok, a double-edged sword I see :)
>>
>> May be we can wave it carefully here, give the discount to a bigger
>> scope not the sync cpu, for example:
>>
>> 	sg1				sg2
>> 	cpu0	cpu1	cpu2	cpu3	cpu4	cpu5	cpu6	cpu7
>> 	waker	idle	idle	idle	idle	idle	idle	idle
>>
>> If it's sync wakeup on cpu0 (only waker), and the sg is wide enough,
>> which means one cpu is not so influencial, then suppose cpu0 to be idle
>> could be more safe, also prefer sg1 than sg2 is more likely to be right.
>>
>> And we can still choose idle-cpu at final step, like cpu1 in this case,
>> to avoid the risk that waker don't get off as it said.
>>
>> The key point is to reduce the influence of sync, trust a little but not
>> totally ;-)
> 
> What we need is a dirt cheap way to fairly accurately predict overlap
> potential (todo: write omniscience().. patent, buy planet).

Agree, solutions for such cases are usually incredible ;-)

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-25  8:56 ` Mike Galbraith
  2013-09-26  2:50   ` Michael wang
@ 2013-09-26  9:58   ` Peter Zijlstra
  2013-09-26 10:05     ` Peter Zijlstra
                       ` (3 more replies)
  1 sibling, 4 replies; 21+ messages in thread
From: Peter Zijlstra @ 2013-09-26  9:58 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Paul Turner, Rik van Riel, linux-kernel

On Wed, Sep 25, 2013 at 10:56:17AM +0200, Mike Galbraith wrote:
> That will make pipe-test go fugly -> pretty, and help very fast/light
> localhost network, but eat heavier localhost overlap recovery.  We need
> a working (and cheap) overlap detector scheme, so we can know when there
> is enough to be worth going after.

We used to have an overlap detectoring thing.. It went away though.

But see if you can make something like the below work?

You could make it a general overlap thing and try without the sync too I
suppose.. 

---
 include/linux/sched.h |  3 +++
 kernel/sched/fair.c   | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b5344de..5428016 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -974,6 +974,9 @@ struct sched_entity {
 	u64			vruntime;
 	u64			prev_sum_exec_runtime;
 
+	u64			last_sync_wakeup;
+	u64			avg_overlap;
+
 	u64			nr_migrations;
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b89cd2..47b0d0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2913,6 +2913,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct sched_entity *se = &p->se;
 	int task_sleep = flags & DEQUEUE_SLEEP;
 
+	if (se->last_sync_wakeup) {
+		u64 overlap;
+		s64 diff;
+	       
+		overlap = rq->clock - se->last_sync_wakeup;
+		se->last_sync_wakeup = 0;
+
+		diff = overlap - se->avg_overlap;
+		se->avg_overlap += diff >> 8;
+	}
+
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
@@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
+	if (sync)
+		p->se.last_sync_wakeup = sched_clock_cpu(cpu);
+
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
 
@@ -3461,6 +3475,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			prev_cpu = cpu;
 
+		/*
+		 * Don't bother with select_idle_sibling() in the case of a sync wakeup
+		 * where we know the only running task will soon go-away. Going
+		 * through select_idle_sibling will only lead to pointless ping-pong.
+		 */
+		if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
+		    current->se.avg_overlap < 10000) {
+			new_cpu = cpu;
+			goto unlock;
+		}
+
 		new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  9:58   ` Peter Zijlstra
@ 2013-09-26 10:05     ` Peter Zijlstra
  2013-09-26 10:55     ` Paul Turner
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2013-09-26 10:05 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Paul Turner, Rik van Riel, linux-kernel

On Thu, Sep 26, 2013 at 11:58:12AM +0200, Peter Zijlstra wrote:
> @@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
>  
> +	if (sync)
> +		p->se.last_sync_wakeup = sched_clock_cpu(cpu);
> +
>  	if (p->nr_cpus_allowed == 1)
>  		return prev_cpu;
>  

Oh, I suppose something like:

 if (sync && !p->se.last_sync_wakeup)
 	p->se.last_sync_wakeup = sched_clock_cpu(cpu);

is also a nice variation to try..

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  9:58   ` Peter Zijlstra
  2013-09-26 10:05     ` Peter Zijlstra
@ 2013-09-26 10:55     ` Paul Turner
  2013-09-26 11:16       ` Peter Zijlstra
  2013-09-26 13:46     ` Mike Galbraith
  2013-09-26 15:09     ` Michael wang
  3 siblings, 1 reply; 21+ messages in thread
From: Paul Turner @ 2013-09-26 10:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Ingo Molnar, Rik van Riel, LKML

On Thu, Sep 26, 2013 at 2:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Sep 25, 2013 at 10:56:17AM +0200, Mike Galbraith wrote:
>> That will make pipe-test go fugly -> pretty, and help very fast/light
>> localhost network, but eat heavier localhost overlap recovery.  We need
>> a working (and cheap) overlap detector scheme, so we can know when there
>> is enough to be worth going after.
>
> We used to have an overlap detectoring thing.. It went away though.
>
> But see if you can make something like the below work?
>
> You could make it a general overlap thing and try without the sync too I
> suppose..
>
> ---
>  include/linux/sched.h |  3 +++
>  kernel/sched/fair.c   | 25 +++++++++++++++++++++++++
>  2 files changed, 28 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b5344de..5428016 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -974,6 +974,9 @@ struct sched_entity {
>         u64                     vruntime;
>         u64                     prev_sum_exec_runtime;
>
> +       u64                     last_sync_wakeup;
> +       u64                     avg_overlap;
> +
>         u64                     nr_migrations;
>
>  #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2b89cd2..47b0d0f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2913,6 +2913,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>         struct sched_entity *se = &p->se;
>         int task_sleep = flags & DEQUEUE_SLEEP;
>
> +       if (se->last_sync_wakeup) {
> +               u64 overlap;
> +               s64 diff;
> +
> +               overlap = rq->clock - se->last_sync_wakeup;
> +               se->last_sync_wakeup = 0;
> +
> +               diff = overlap - se->avg_overlap;
> +               se->avg_overlap += diff >> 8;
> +       }
> +
>         for_each_sched_entity(se) {
>                 cfs_rq = cfs_rq_of(se);
>                 dequeue_entity(cfs_rq, se, flags);
> @@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>         int want_affine = 0;
>         int sync = wake_flags & WF_SYNC;
>
> +       if (sync)
> +               p->se.last_sync_wakeup = sched_clock_cpu(cpu);
> +
>         if (p->nr_cpus_allowed == 1)
>                 return prev_cpu;
>
> @@ -3461,6 +3475,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>                 if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>                         prev_cpu = cpu;
>
> +               /*
> +                * Don't bother with select_idle_sibling() in the case of a sync wakeup
> +                * where we know the only running task will soon go-away. Going
> +                * through select_idle_sibling will only lead to pointless ping-pong.
> +                */
> +               if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&

I've long thought of trying something like this.

I like the intent but I'd go a step further in that I think we want to
also implicitly extract WF_SYNC itself.  While pipe_test is a good
microbenchmark it's not entirely representative since, in reality,
overlap is most usefully applied to threads in the same process -- and
they rarely communicate using a pipe.

What we really then care about is predicting the overlap associated
with userspace synchronization objects, typically built on top of
futexes.  Unfortunately the existence/use of per-thread futexes
reduces how much state you could usefully associate with the futex.
One approach might be to hash (with some small saturating counter)
against rip.  But this gets more complicated quite quickly.

> +                   current->se.avg_overlap < 10000) {
> +                       new_cpu = cpu;
> +                       goto unlock;
> +               }
> +
>                 new_cpu = select_idle_sibling(p, prev_cpu);
>                 goto unlock;
>         }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26 10:55     ` Paul Turner
@ 2013-09-26 11:16       ` Peter Zijlstra
  2013-09-26 11:39         ` Paul Turner
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-09-26 11:16 UTC (permalink / raw)
  To: Paul Turner; +Cc: Mike Galbraith, Ingo Molnar, Rik van Riel, LKML

On Thu, Sep 26, 2013 at 03:55:55AM -0700, Paul Turner wrote:
> > +               /*
> > +                * Don't bother with select_idle_sibling() in the case of a sync wakeup
> > +                * where we know the only running task will soon go-away. Going
> > +                * through select_idle_sibling will only lead to pointless ping-pong.
> > +                */
> > +               if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
> 
> I've long thought of trying something like this.
> 
> I like the intent but I'd go a step further in that I think we want to
> also implicitly extract WF_SYNC itself.

I have vague memories of actually trying something like that a good
number of years ago.. sadly that's all I remember about it.

> What we really then care about is predicting the overlap associated
> with userspace synchronization objects, typically built on top of
> futexes.  Unfortunately the existence/use of per-thread futexes
> reduces how much state you could usefully associate with the futex.
> One approach might be to hash (with some small saturating counter)
> against rip.  But this gets more complicated quite quickly.

Why would you need per object storage? To further granulate the
predicted overlap? instead of having one per task, you have one per
object?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26 11:16       ` Peter Zijlstra
@ 2013-09-26 11:39         ` Paul Turner
  2013-09-26 14:35           ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Turner @ 2013-09-26 11:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Ingo Molnar, Rik van Riel, LKML

On Thu, Sep 26, 2013 at 4:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Sep 26, 2013 at 03:55:55AM -0700, Paul Turner wrote:
>> > +               /*
>> > +                * Don't bother with select_idle_sibling() in the case of a sync wakeup
>> > +                * where we know the only running task will soon go-away. Going
>> > +                * through select_idle_sibling will only lead to pointless ping-pong.
>> > +                */
>> > +               if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
>>
>> I've long thought of trying something like this.
>>
>> I like the intent but I'd go a step further in that I think we want to
>> also implicitly extract WF_SYNC itself.
>
> I have vague memories of actually trying something like that a good
> number of years ago.. sadly that's all I remember about it.
>
>> What we really then care about is predicting the overlap associated
>> with userspace synchronization objects, typically built on top of
>> futexes.  Unfortunately the existence/use of per-thread futexes
>> reduces how much state you could usefully associate with the futex.
>> One approach might be to hash (with some small saturating counter)
>> against rip.  But this gets more complicated quite quickly.
>
> Why would you need per object storage? To further granulate the
> predicted overlap? instead of having one per task, you have one per
> object?

It is my intuition that there are a few common objects with fairly
polarized behavior:  I.e. For condition variables and producer
consumer queues, a wakeup strongly predicts blocking.  Whereas for
locks protecting objects, e.g. a Mutex, would be expected to have the
opposite behavior.

For this hint to be beneficial you have to get it right frequently,
getting it wrong in the first case hurts cache and in the second hurts
parallelism.  Today we always err on the side of hurting locality
since the cost of getting it wrong is better bounded.  These are
sufficiently common, and likely to be interspersed, that I suspect
allowing them to interact on a thread-wide counter will basically give
a mush result (or even be an anti predictor since it will strongly
favor the last observation) an an input.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  9:58   ` Peter Zijlstra
  2013-09-26 10:05     ` Peter Zijlstra
  2013-09-26 10:55     ` Paul Turner
@ 2013-09-26 13:46     ` Mike Galbraith
  2013-09-26 15:09     ` Michael wang
  3 siblings, 0 replies; 21+ messages in thread
From: Mike Galbraith @ 2013-09-26 13:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Paul Turner, Rik van Riel, linux-kernel

On Thu, 2013-09-26 at 11:58 +0200, Peter Zijlstra wrote: 
> On Wed, Sep 25, 2013 at 10:56:17AM +0200, Mike Galbraith wrote:
> > That will make pipe-test go fugly -> pretty, and help very fast/light
> > localhost network, but eat heavier localhost overlap recovery.  We need
> > a working (and cheap) overlap detector scheme, so we can know when there
> > is enough to be worth going after.
> 
> We used to have an overlap detectoring thing.. It went away though.

Yup.. guilty as charged your honor.

> But see if you can make something like the below work?

It'll have the same, preemption etc., problems as the whacked.  Could
bring that back and have another go at making it actually work, maybe
it'll be good enough to cause more gain than pain.

> You could make it a general overlap thing and try without the sync too I
> suppose.. 
> 
> ---
>  include/linux/sched.h |  3 +++
>  kernel/sched/fair.c   | 25 +++++++++++++++++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b5344de..5428016 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -974,6 +974,9 @@ struct sched_entity {
>  	u64			vruntime;
>  	u64			prev_sum_exec_runtime;
>  
> +	u64			last_sync_wakeup;
> +	u64			avg_overlap;
> +
>  	u64			nr_migrations;
>  
>  #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2b89cd2..47b0d0f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2913,6 +2913,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	struct sched_entity *se = &p->se;
>  	int task_sleep = flags & DEQUEUE_SLEEP;
>  
> +	if (se->last_sync_wakeup) {
> +		u64 overlap;
> +		s64 diff;
> +	       
> +		overlap = rq->clock - se->last_sync_wakeup;
> +		se->last_sync_wakeup = 0;
> +
> +		diff = overlap - se->avg_overlap;
> +		se->avg_overlap += diff >> 8;
> +	}
> +
>  	for_each_sched_entity(se) {
>  		cfs_rq = cfs_rq_of(se);
>  		dequeue_entity(cfs_rq, se, flags);
> @@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
>  
> +	if (sync)
> +		p->se.last_sync_wakeup = sched_clock_cpu(cpu);
> +
>  	if (p->nr_cpus_allowed == 1)
>  		return prev_cpu;
>  
> @@ -3461,6 +3475,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>  			prev_cpu = cpu;
>  
> +		/*
> +		 * Don't bother with select_idle_sibling() in the case of a sync wakeup
> +		 * where we know the only running task will soon go-away. Going
> +		 * through select_idle_sibling will only lead to pointless ping-pong.
> +		 */
> +		if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
> +		    current->se.avg_overlap < 10000) {
> +			new_cpu = cpu;
> +			goto unlock;
> +		}
> +
>  		new_cpu = select_idle_sibling(p, prev_cpu);
>  		goto unlock;
>  	}



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26 11:39         ` Paul Turner
@ 2013-09-26 14:35           ` Peter Zijlstra
  2013-09-26 15:43             ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-09-26 14:35 UTC (permalink / raw)
  To: Paul Turner; +Cc: Mike Galbraith, Ingo Molnar, Rik van Riel, LKML

On Thu, Sep 26, 2013 at 04:39:30AM -0700, Paul Turner wrote:
> It is my intuition that there are a few common objects with fairly
> polarized behavior:  I.e. For condition variables and producer
> consumer queues, a wakeup strongly predicts blocking.  Whereas for
> locks protecting objects, e.g. a Mutex, would be expected to have the
> opposite behavior.

Agreed; however none of those seem to have the property we're looking
for.

Even produces consumer queues on their own don't generate the
alternating patterns we're looking for with the SYNC hint.

We need a 'guarantee' that the waker is going to stop until the wakee is
done.

What we're looking for is the typical synchronous request-reply like
pattern -- and that doesn't seem to correlate to any one locking object.

Rather it is an inter-task relation; so task state does make sense in
finding them. We could for instance try and infer which task is
servicing requests; and then we know that requesting tasks will sleep
until reply.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26  9:58   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2013-09-26 13:46     ` Mike Galbraith
@ 2013-09-26 15:09     ` Michael wang
  2013-09-26 15:44       ` Peter Zijlstra
  3 siblings, 1 reply; 21+ messages in thread
From: Michael wang @ 2013-09-26 15:09 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith
  Cc: Ingo Molnar, Paul Turner, Rik van Riel, linux-kernel

Hi, Peter

On 09/26/2013 05:58 PM, Peter Zijlstra wrote:
[snip]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2b89cd2..47b0d0f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2913,6 +2913,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	struct sched_entity *se = &p->se;
>  	int task_sleep = flags & DEQUEUE_SLEEP;
> 
> +	if (se->last_sync_wakeup) {
> +		u64 overlap;
> +		s64 diff;
> +	       
> +		overlap = rq->clock - se->last_sync_wakeup;
> +		se->last_sync_wakeup = 0;
> +
> +		diff = overlap - se->avg_overlap;
> +		se->avg_overlap += diff >> 8;
> +	}
> +
>  	for_each_sched_entity(se) {
>  		cfs_rq = cfs_rq_of(se);
>  		dequeue_entity(cfs_rq, se, flags);
> @@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
> 
> +	if (sync)
> +		p->se.last_sync_wakeup = sched_clock_cpu(cpu);

Forgive me but I'm trying to understand it... why not 'current' but 'p'
here? we want the get off speed of waker or the working time of wakee?

Regards,
Michael Wang

> +
>  	if (p->nr_cpus_allowed == 1)
>  		return prev_cpu;
> 
> @@ -3461,6 +3475,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>  			prev_cpu = cpu;
> 
> +		/*
> +		 * Don't bother with select_idle_sibling() in the case of a sync wakeup
> +		 * where we know the only running task will soon go-away. Going
> +		 * through select_idle_sibling will only lead to pointless ping-pong.
> +		 */
> +		if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
> +		    current->se.avg_overlap < 10000) {
> +			new_cpu = cpu;
> +			goto unlock;
> +		}
> +
>  		new_cpu = select_idle_sibling(p, prev_cpu);
>  		goto unlock;
>  	}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26 14:35           ` Peter Zijlstra
@ 2013-09-26 15:43             ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2013-09-26 15:43 UTC (permalink / raw)
  To: Paul Turner; +Cc: Mike Galbraith, Ingo Molnar, Rik van Riel, LKML

On Thu, Sep 26, 2013 at 04:35:33PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 26, 2013 at 04:39:30AM -0700, Paul Turner wrote:
> > It is my intuition that there are a few common objects with fairly
> > polarized behavior:  I.e. For condition variables and producer
> > consumer queues, a wakeup strongly predicts blocking.  Whereas for
> > locks protecting objects, e.g. a Mutex, would be expected to have the
> > opposite behavior.
> 
> Agreed; however none of those seem to have the property we're looking
> for.
> 
> Even produces consumer queues on their own don't generate the
> alternating patterns we're looking for with the SYNC hint.
> 
> We need a 'guarantee' that the waker is going to stop until the wakee is
> done.
> 
> What we're looking for is the typical synchronous request-reply like
> pattern -- and that doesn't seem to correlate to any one locking object.
> 
> Rather it is an inter-task relation; so task state does make sense in
> finding them. We could for instance try and infer which task is
> servicing requests; and then we know that requesting tasks will sleep
> until reply.
> 

Oh never mind, I see what you meant, the edges in that graph are the
locks.

Can't use RIPs for futexes though; you'd likely end up in the one
pthread_mutex_lock() implementation or such.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26 15:09     ` Michael wang
@ 2013-09-26 15:44       ` Peter Zijlstra
  2013-09-27  1:19         ` Michael wang
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-09-26 15:44 UTC (permalink / raw)
  To: Michael wang
  Cc: Mike Galbraith, Ingo Molnar, Paul Turner, Rik van Riel,
	linux-kernel

On Thu, Sep 26, 2013 at 11:09:46PM +0800, Michael wang wrote:
> > +	if (sync)
> > +		p->se.last_sync_wakeup = sched_clock_cpu(cpu);
> 
> Forgive me but I'm trying to understand it... why not 'current' but 'p'
> here? we want the get off speed of waker or the working time of wakee?

Because I'm an idiot? ;-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
  2013-09-26 15:44       ` Peter Zijlstra
@ 2013-09-27  1:19         ` Michael wang
  0 siblings, 0 replies; 21+ messages in thread
From: Michael wang @ 2013-09-27  1:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, Paul Turner, Rik van Riel,
	linux-kernel

On 09/26/2013 11:44 PM, Peter Zijlstra wrote:
> On Thu, Sep 26, 2013 at 11:09:46PM +0800, Michael wang wrote:
>>> +	if (sync)
>>> +		p->se.last_sync_wakeup = sched_clock_cpu(cpu);
>>
>> Forgive me but I'm trying to understand it... why not 'current' but 'p'
>> here? we want the get off speed of waker or the working time of wakee?
> 
> Because I'm an idiot? ;-)

Well, I know you are full of good ideas ;-)

Regards,
Michael Wang

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-09-27  1:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-25  7:53 [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true) Peter Zijlstra
2013-09-25  8:56 ` Mike Galbraith
2013-09-26  2:50   ` Michael wang
2013-09-26  3:41     ` Mike Galbraith
2013-09-26  5:12       ` Michael wang
2013-09-26  5:34         ` Mike Galbraith
2013-09-26  6:15           ` Mike Galbraith
2013-09-26  6:32           ` Michael wang
2013-09-26  7:09             ` Mike Galbraith
2013-09-26  7:26               ` Michael wang
2013-09-26  9:58   ` Peter Zijlstra
2013-09-26 10:05     ` Peter Zijlstra
2013-09-26 10:55     ` Paul Turner
2013-09-26 11:16       ` Peter Zijlstra
2013-09-26 11:39         ` Paul Turner
2013-09-26 14:35           ` Peter Zijlstra
2013-09-26 15:43             ` Peter Zijlstra
2013-09-26 13:46     ` Mike Galbraith
2013-09-26 15:09     ` Michael wang
2013-09-26 15:44       ` Peter Zijlstra
2013-09-27  1:19         ` Michael wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox