Delete scheduler SD_WAKE_AFFINE and SD_WAKE

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
@ 2005-07-28 23:08 Chen, Kenneth W
  2005-07-28 23:34 ` Nick Piggin
  2005-07-29 11:26 ` Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Ingo Molnar
  0 siblings, 2 replies; 26+ messages in thread
From: Chen, Kenneth W @ 2005-07-28 23:08 UTC (permalink / raw)
  To: Ingo Molnar, 'Nick Piggin'; +Cc: linux-kernel, linux-ia64

What sort of workload needs SD_WAKE_AFFINE and SD_WAKE_BALANCE?
SD_WAKE_AFFINE are not useful in conjunction with interrupt binding.
In fact, it creates more harm than usefulness, causing detrimental
process migration and destroy process cache affinity etc.  Also
SD_WAKE_BALANCE is giving us performance grief with our industry
standard OLTP workload.

To demonstrate the problem, we turned off these two flags in the cpu
sd domain and measured a stunning 2.15% performance gain!  And deleting
all the code in the try_to_wake_up() pertain to load balancing gives us
another 0.2% gain.

The wake up patch should be made simple, just put the waking task on
the previously ran cpu runqueue.  Simple and elegant.

I'm proposing we either delete these two flags or make them run time
configurable.

- Ken

--- linux-2.6.12/include/linux/topology.h.orig	2005-07-28 15:54:05.007399685 -0700
+++ linux-2.6.12/include/linux/topology.h	2005-07-28 15:54:39.292555515 -0700
@@ -118,9 +118,7 @@
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE	\
-				| SD_WAKE_IDLE		\
-				| SD_WAKE_BALANCE,	\
+				| SD_WAKE_IDLE,		\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-28 23:08 Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Chen, Kenneth W
@ 2005-07-28 23:34 ` Nick Piggin
  2005-07-28 23:48   ` Chen, Kenneth W
  2005-07-29 11:26 ` Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Ingo Molnar
  1 sibling, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2005-07-28 23:34 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:
> What sort of workload needs SD_WAKE_AFFINE and SD_WAKE_BALANCE?
> SD_WAKE_AFFINE are not useful in conjunction with interrupt binding.
> In fact, it creates more harm than usefulness, causing detrimental
> process migration and destroy process cache affinity etc.  Also
> SD_WAKE_BALANCE is giving us performance grief with our industry
> standard OLTP workload.
> 

The periodic load balancer basically makes completely undirected,
random choices when picking which tasks to move where.

Wake balancing provides an opportunity to provide some input bias
into the load balancer.

For example, if you started 100 pairs of tasks which communicate
through a pipe. On a 2 CPU system without wake balancing, probably
half of the pairs will be on different CPUs. With wake balancing,
it should be much better.

I've also been told that it impoves IO efficiency significantly -
obviously that depends on the system and workload.

> To demonstrate the problem, we turned off these two flags in the cpu
> sd domain and measured a stunning 2.15% performance gain!  And deleting
> all the code in the try_to_wake_up() pertain to load balancing gives us
> another 0.2% gain.
> 
> The wake up patch should be made simple, just put the waking task on
> the previously ran cpu runqueue.  Simple and elegant.
> 
> I'm proposing we either delete these two flags or make them run time
> configurable.
> 

There have been lots of changes since 2.6.12. Including less aggressive
wake balancing.

I hear you might be having problems with recent 2.6.13 kernels? If so,
it would be really good to have a look that before 2.6.13 goes out the
door.

I appreciate all the effort you're putting into this!

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-28 23:34 ` Nick Piggin
@ 2005-07-28 23:48   ` Chen, Kenneth W
  2005-07-29  1:25     ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Chen, Kenneth W @ 2005-07-28 23:48 UTC (permalink / raw)
  To: 'Nick Piggin'; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Nick Piggin wrote on Thursday, July 28, 2005 4:35 PM
> Wake balancing provides an opportunity to provide some input bias
> into the load balancer.
> 
> For example, if you started 100 pairs of tasks which communicate
> through a pipe. On a 2 CPU system without wake balancing, probably
> half of the pairs will be on different CPUs. With wake balancing,
> it should be much better.

Shouldn't the pipe code use synchronous wakeup?


> I hear you might be having problems with recent 2.6.13 kernels? If so,
> it would be really good to have a look that before 2.6.13 goes out the
> door.

Yes I do :-(, apparently bumping up cache_hot_time won't give us the
performance boost we used to see.

- Ken


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-28 23:48   ` Chen, Kenneth W
@ 2005-07-29  1:25     ` Nick Piggin
  2005-07-29  1:39       ` Chen, Kenneth W
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2005-07-29  1:25 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:

>Nick Piggin wrote on Thursday, July 28, 2005 4:35 PM
>
>>Wake balancing provides an opportunity to provide some input bias
>>into the load balancer.
>>
>>For example, if you started 100 pairs of tasks which communicate
>>through a pipe. On a 2 CPU system without wake balancing, probably
>>half of the pairs will be on different CPUs. With wake balancing,
>>it should be much better.
>>
>
>Shouldn't the pipe code use synchronous wakeup?
>
>

Well pipes are just an example. It could be any type of communication.
What's more, even the synchronous wakeup uses the wake balancing path
(although that could be modified to only do wake balancing for synch
wakeups, I'd have to be convinced we should special case pipes and not
eg. semaphores or AF_UNIX sockets).

>
>>I hear you might be having problems with recent 2.6.13 kernels? If so,
>>it would be really good to have a look that before 2.6.13 goes out the
>>door.
>>
>
>Yes I do :-(, apparently bumping up cache_hot_time won't give us the
>performance boost we used to see.
>
>
OK there are probably a number of things we can explore depending on
what are the symptoms (eg. excessive idle time, bad cache performance).

Unfortunately it is kind of difficult to tune 2.6.13 on the basis of
2.6.12 results - although that's not to say it won't indicate a good
avenue to investigate.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  1:25     ` Nick Piggin
@ 2005-07-29  1:39       ` Chen, Kenneth W
  2005-07-29  1:46         ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Chen, Kenneth W @ 2005-07-29  1:39 UTC (permalink / raw)
  To: 'Nick Piggin'; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Nick Piggin wrote on Thursday, July 28, 2005 6:25 PM
> Well pipes are just an example. It could be any type of communication.
> What's more, even the synchronous wakeup uses the wake balancing path
> (although that could be modified to only do wake balancing for synch
> wakeups, I'd have to be convinced we should special case pipes and not
> eg. semaphores or AF_UNIX sockets).

Why is the normal load balance path not enough (or not be able to do the
right thing)?  The reblance_tick and idle_balance ought be enough to take
care of the imbalance.  What makes load balancing in wake up path so special?

Oh, I'd like to hear your opinion on what to do with these two flags, make
them runtime configurable? (I'm of the opinion to delete them altogether)

- Ken

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  1:39       ` Chen, Kenneth W
@ 2005-07-29  1:46         ` Nick Piggin
  2005-07-29  1:53           ` Chen, Kenneth W
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2005-07-29  1:46 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:

>Nick Piggin wrote on Thursday, July 28, 2005 6:25 PM
>
>>Well pipes are just an example. It could be any type of communication.
>>What's more, even the synchronous wakeup uses the wake balancing path
>>(although that could be modified to only do wake balancing for synch
>>wakeups, I'd have to be convinced we should special case pipes and not
>>eg. semaphores or AF_UNIX sockets).
>>
>
>
>Why is the normal load balance path not enough (or not be able to do the
>right thing)?  The reblance_tick and idle_balance ought be enough to take
>care of the imbalance.  What makes load balancing in wake up path so special?
>
>

Well the normal load balancing path treats all tasks the same, while
the wake path knows if a CPU is waking a remote task and can attempt
to maximise the number of local wakeups.

>Oh, I'd like to hear your opinion on what to do with these two flags, make
>them runtime configurable? (I'm of the opinion to delete them altogether)
>
>

I'd like to try making them less aggressive first if possible.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  1:46         ` Nick Piggin
@ 2005-07-29  1:53           ` Chen, Kenneth W
  2005-07-29  2:01             ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Chen, Kenneth W @ 2005-07-29  1:53 UTC (permalink / raw)
  To: 'Nick Piggin'; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Nick Piggin wrote on Thursday, July 28, 2005 6:46 PM
> I'd like to try making them less aggressive first if possible.

Well, that's exactly what I'm trying to do: make them not aggressive
at all by not performing any load balance :-)  The workload gets maximum
benefit with zero aggressiveness.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  1:53           ` Chen, Kenneth W
@ 2005-07-29  2:01             ` Nick Piggin
  2005-07-29  6:27               ` Chen, Kenneth W
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2005-07-29  2:01 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:

>Nick Piggin wrote on Thursday, July 28, 2005 6:46 PM
>
>>I'd like to try making them less aggressive first if possible.
>>
>
>Well, that's exactly what I'm trying to do: make them not aggressive
>at all by not performing any load balance :-)  The workload gets maximum
>benefit with zero aggressiveness.
>
>

Unfortunately we can't forget about other workloads, and we're
trying to stay away from runtime tunables in the scheduler.

If we can get performance to within a couple of tenths of a percent
of the zero balancing case, then that would be preferable I think.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  2:01             ` Nick Piggin
@ 2005-07-29  6:27               ` Chen, Kenneth W
  2005-07-29  8:48                 ` Nick Piggin
  2005-07-29 11:48                 ` [patch] remove wake-balancing Ingo Molnar
  0 siblings, 2 replies; 26+ messages in thread
From: Chen, Kenneth W @ 2005-07-29  6:27 UTC (permalink / raw)
  To: 'Nick Piggin'; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Nick Piggin wrote on Thursday, July 28, 2005 7:01 PM
> Chen, Kenneth W wrote:
> >Well, that's exactly what I'm trying to do: make them not aggressive
> >at all by not performing any load balance :-)  The workload gets maximum
> >benefit with zero aggressiveness.
> 
> Unfortunately we can't forget about other workloads, and we're
> trying to stay away from runtime tunables in the scheduler.

This clearly outlines an issue with the implementation.  Optimize for one
type of workload has detrimental effect on another workload and vice versa.

> If we can get performance to within a couple of tenths of a percent
> of the zero balancing case, then that would be preferable I think.

I won't try to compromise between the two.  If you do so, we would end up
with two half baked raw turkey.  Making less aggressive load balance in the
wake up path would probably reduce performance for the type of workload you
quoted earlier and for db workload, we don't want any of them at all, not
even the code to determine whether it should be balanced or not.

Do you have an example workload you mentioned earlier that depends on
SD_WAKE_BALANCE?  I would like to experiment with it so we can move this
forward instead of paper talk.

- Ken

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  6:27               ` Chen, Kenneth W
@ 2005-07-29  8:48                 ` Nick Piggin
  2005-07-29  8:53                   ` Ingo Molnar
                                     ` (2 more replies)
  2005-07-29 11:48                 ` [patch] remove wake-balancing Ingo Molnar
  1 sibling, 3 replies; 26+ messages in thread
From: Nick Piggin @ 2005-07-29  8:48 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:
> Nick Piggin wrote on Thursday, July 28, 2005 7:01 PM

> This clearly outlines an issue with the implementation.  Optimize for one
> type of workload has detrimental effect on another workload and vice versa.
> 

Yep. That comes up fairly regularly when tuning the scheduler :(

> 
> I won't try to compromise between the two.  If you do so, we would end up
> with two half baked raw turkey.  Making less aggressive load balance in the
> wake up path would probably reduce performance for the type of workload you
> quoted earlier and for db workload, we don't want any of them at all, not
> even the code to determine whether it should be balanced or not.
> 

Well, that remains to be seen. If it can be made _smarter_, then you
may not have to take such a big compromise.

But either way, there will have to be some compromise made. At the
very least you have to find some acceptable default.

> Do you have an example workload you mentioned earlier that depends on
> SD_WAKE_BALANCE?  I would like to experiment with it so we can move this
> forward instead of paper talk.
> 

Well, you can easily see suboptimal scheduling decisions on many
programs with lots of interprocess communication. For example, tbench
on a dual Xeon:

processes    1               2               3              4

2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500

Numbers are MB/s, higher is better.

Networking or other IO workloads where processes are tightly coupled
to a specific adapter / interrupt source can also see pretty good
gains.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:48                 ` Nick Piggin
@ 2005-07-29  8:53                   ` Ingo Molnar
  2005-07-29  8:59                     ` Nick Piggin
  2005-07-29  9:07                   ` Ingo Molnar
  2005-07-29 16:40                   ` Ingo Molnar
  2 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2005-07-29  8:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Well, you can easily see suboptimal scheduling decisions on many 
> programs with lots of interprocess communication. For example, tbench 
> on a dual Xeon:
> 
> processes    1               2               3              4
> 
> 2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
> no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500
> 
> Numbers are MB/s, higher is better.

what type of network was used - localhost or a real one?

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:53                   ` Ingo Molnar
@ 2005-07-29  8:59                     ` Nick Piggin
  2005-07-29  9:01                       ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2005-07-29  8:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>processes    1               2               3              4
>>
>>2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
>>no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500
>>
>>Numbers are MB/s, higher is better.
> 
> 
> what type of network was used - localhost or a real one?
> 

Localhost. Yeah it isn't a real world test, but it does show the
erratic behaviour without wake affine.

I don't have a setup with multiple fast network adapters otherwise
I would have run a similar test using a real network.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:59                     ` Nick Piggin
@ 2005-07-29  9:01                       ` Ingo Molnar
  0 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2005-07-29  9:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> >>processes    1               2               3              4
> >>
> >>2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
> >>no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500
> >>
> >>Numbers are MB/s, higher is better.
> >
> >
> >what type of network was used - localhost or a real one?
> >
> 
> Localhost. Yeah it isn't a real world test, but it does show the 
> erratic behaviour without wake affine.

yeah - fine enough. (It's not representative for IO workloads, but it's 
representative for local IPC workloads, just wanted to know precisely 
which workload it is.)

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:48                 ` Nick Piggin
  2005-07-29  8:53                   ` Ingo Molnar
@ 2005-07-29  9:07                   ` Ingo Molnar
  2005-07-29 16:40                   ` Ingo Molnar
  2 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2005-07-29  9:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Well, you can easily see suboptimal scheduling decisions on many 
> programs with lots of interprocess communication. For example, tbench 
> on a dual Xeon:
> 
> processes    1               2               3              4
> 
> 2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
> no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500
> 
> Numbers are MB/s, higher is better.

i cannot see any difference with/without wake-balancing in this 
workload, on a dual Xeon. Could you try the quick hack below and do:

	echo 1 > /proc/sys/kernel/panic # turn on wake-balancing
	echo 0 > /proc/sys/kernel/panic # turn off wake-balancing

does the runtime switching show any effects on the throughput numbers 
tbench is showing? I'm using dbench-3.03. (i only checked the status 
numbers, didnt do full runs)

(did you have SCHED_SMT enabled?)

	Ingo

 kernel/sched.c |    2 ++
 1 files changed, 2 insertions(+)

Index: linux-prefetch-task/kernel/sched.c
===================================================================
--- linux-prefetch-task.orig/kernel/sched.c
+++ linux-prefetch-task/kernel/sched.c
@@ -1155,6 +1155,8 @@ static int try_to_wake_up(task_t * p, un
 		goto out_activate;
 
 	new_cpu = cpu;
+	if (!panic_timeout)
+		goto out_set_cpu;
 
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:48                 ` Nick Piggin
  2005-07-29  8:53                   ` Ingo Molnar
  2005-07-29  9:07                   ` Ingo Molnar
@ 2005-07-29 16:40                   ` Ingo Molnar
  2 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2005-07-29 16:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Chen, Kenneth W wrote:
> >Nick Piggin wrote on Thursday, July 28, 2005 7:01 PM
> 
> >This clearly outlines an issue with the implementation.  Optimize for one
> >type of workload has detrimental effect on another workload and vice versa.
> >
> 
> Yep. That comes up fairly regularly when tuning the scheduler :(

in this particular case we can clearly separate the two workloads 
though: CPU-overload (Ken's benchmark) vs. half-load (3-task tbench). So 
by checking for migration target/source idleness we can have a hard 
separator for wakeup balancing. (whether it works out for both types of 
workloads remains to be seen)

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [patch] remove wake-balancing
  2005-07-29  6:27               ` Chen, Kenneth W
  2005-07-29  8:48                 ` Nick Piggin
@ 2005-07-29 11:48                 ` Ingo Molnar
  2005-07-29 14:13                   ` [sched, patch] better wake-balancing Ingo Molnar
  1 sibling, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2005-07-29 11:48 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, linux-ia64, Andrew Morton


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> > If we can get performance to within a couple of tenths of a percent
> > of the zero balancing case, then that would be preferable I think.
> 
> I won't try to compromise between the two.  If you do so, we would end 
> up with two half baked raw turkey.  Making less aggressive load 
> balance in the wake up path would probably reduce performance for the 
> type of workload you quoted earlier and for db workload, we don't want 
> any of them at all, not even the code to determine whether it should 
> be balanced or not.

i think we could try to get rid of wakeup-time balancing altogether.

these days pretty much the only time we can sensibly do 'fast' (as in 
immediate) migration are fork/clone and exec. Furthermore, the gained 
simplicity of wakeup is quite compelling too. (Originally, when i 
introduced the first variant wakeup-time balancing eons ago, we didnt 
have anything like fork-time and exec-time balancing.)

i think we could try the patch below in -mm, it removes (non-)affine 
wakeup and passive wakeup-balancing, but keeps SD_WAKE_IDLE that is 
needed for efficient SMT scheduling. I test-booted the patch on x86, and 
it should work on all architectures. (I have tested various local-IPC 
and non-IPC workloads and only found performance improvements - but i'm 
sure regressions exist too, and need to be examined.)

	Ingo

------
remove wakeup-time balancing. It turns out exec-time and fork-time
balancing combined with periodic rebalancing ticks does a good enough
job.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 include/asm-i386/topology.h           |    3 -
 include/asm-ia64/topology.h           |    6 --
 include/asm-mips/mach-ip27/topology.h |    3 -
 include/asm-ppc64/topology.h          |    3 -
 include/asm-x86_64/topology.h         |    3 -
 include/linux/sched.h                 |    4 -
 include/linux/topology.h              |    4 -
 kernel/sched.c                        |   89 +++-------------------------------
 8 files changed, 16 insertions(+), 99 deletions(-)

Index: linux-prefetch-task/include/asm-i386/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-i386/topology.h
+++ linux-prefetch-task/include/asm-i386/topology.h
@@ -81,8 +81,7 @@ static inline int node_to_first_cpu(int 
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_WAKE_BALANCE,	\
+				| SD_BALANCE_FORK,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/asm-ia64/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-ia64/topology.h
+++ linux-prefetch-task/include/asm-ia64/topology.h
@@ -65,8 +65,7 @@ void build_cpu_to_node_map(void);
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE,	\
+				| SD_BALANCE_EXEC,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
@@ -91,8 +90,7 @@ void build_cpu_to_node_map(void);
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_WAKE_BALANCE,	\
+				| SD_BALANCE_FORK,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 64,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/asm-mips/mach-ip27/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-mips/mach-ip27/topology.h
+++ linux-prefetch-task/include/asm-mips/mach-ip27/topology.h
@@ -28,8 +28,7 @@ extern unsigned char __node_distances[MA
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_EXEC	\
-				| SD_WAKE_BALANCE,	\
+				| SD_BALANCE_EXEC,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/asm-ppc64/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-ppc64/topology.h
+++ linux-prefetch-task/include/asm-ppc64/topology.h
@@ -52,8 +52,7 @@ static inline int node_to_first_cpu(int 
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
 				| SD_BALANCE_NEWIDLE	\
-				| SD_WAKE_IDLE		\
-				| SD_WAKE_BALANCE,	\
+				| SD_WAKE_IDLE,		\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/asm-x86_64/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-x86_64/topology.h
+++ linux-prefetch-task/include/asm-x86_64/topology.h
@@ -48,8 +48,7 @@ extern int __node_distance(int, int);
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_FORK	\
-				| SD_BALANCE_EXEC	\
-				| SD_WAKE_BALANCE,	\
+				| SD_BALANCE_EXEC,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/linux/sched.h
===================================================================
--- linux-prefetch-task.orig/include/linux/sched.h
+++ linux-prefetch-task/include/linux/sched.h
@@ -471,9 +471,7 @@ enum idle_type
 #define SD_BALANCE_EXEC		4	/* Balance on exec */
 #define SD_BALANCE_FORK		8	/* Balance on fork, clone */
 #define SD_WAKE_IDLE		16	/* Wake to idle CPU on task wakeup */
-#define SD_WAKE_AFFINE		32	/* Wake task to waking CPU */
-#define SD_WAKE_BALANCE		64	/* Perform balancing at task wakeup */
-#define SD_SHARE_CPUPOWER	128	/* Domain members share cpu power */
+#define SD_SHARE_CPUPOWER	32	/* Domain members share cpu power */
 
 struct sched_group {
 	struct sched_group *next;	/* Must be a circular list */
Index: linux-prefetch-task/include/linux/topology.h
===================================================================
--- linux-prefetch-task.orig/include/linux/topology.h
+++ linux-prefetch-task/include/linux/topology.h
@@ -97,7 +97,6 @@
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE	\
 				| SD_WAKE_IDLE		\
 				| SD_SHARE_CPUPOWER,	\
 	.last_balance		= jiffies,		\
@@ -127,8 +126,7 @@
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE,	\
+				| SD_BALANCE_EXEC,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/kernel/sched.c
===================================================================
--- linux-prefetch-task.orig/kernel/sched.c
+++ linux-prefetch-task/kernel/sched.c
@@ -254,7 +254,6 @@ struct runqueue {
 
 	/* try_to_wake_up() stats */
 	unsigned long ttwu_cnt;
-	unsigned long ttwu_local;
 #endif
 };
 
@@ -373,7 +372,7 @@ static inline void task_rq_unlock(runque
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 12
+#define SCHEDSTAT_VERSION 13
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -390,11 +389,11 @@ static int show_schedstat(struct seq_fil
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
+		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
 		    cpu, rq->yld_both_empty,
 		    rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt,
 		    rq->sched_switch, rq->sched_cnt, rq->sched_goidle,
-		    rq->ttwu_cnt, rq->ttwu_local,
+		    rq->ttwu_cnt,
 		    rq->rq_sched_info.cpu_time,
 		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt);
 
@@ -424,8 +423,7 @@ static int show_schedstat(struct seq_fil
 			seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu\n",
 			    sd->alb_cnt, sd->alb_failed, sd->alb_pushed,
 			    sd->sbe_cnt, sd->sbe_balanced, sd->sbe_pushed,
-			    sd->sbf_cnt, sd->sbf_balanced, sd->sbf_pushed,
-			    sd->ttwu_wake_remote, sd->ttwu_move_affine, sd->ttwu_move_balance);
+			    sd->sbf_cnt, sd->sbf_balanced, sd->sbf_pushed);
 		}
 		preempt_enable();
 #endif
@@ -1134,8 +1132,6 @@ static int try_to_wake_up(task_t * p, un
 	long old_state;
 	runqueue_t *rq;
 #ifdef CONFIG_SMP
-	unsigned long load, this_load;
-	struct sched_domain *sd, *this_sd = NULL;
 	int new_cpu;
 #endif
 
@@ -1154,77 +1150,13 @@ static int try_to_wake_up(task_t * p, un
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-	new_cpu = cpu;
-
 	schedstat_inc(rq, ttwu_cnt);
-	if (cpu == this_cpu) {
-		schedstat_inc(rq, ttwu_local);
-		goto out_set_cpu;
-	}
-
-	for_each_domain(this_cpu, sd) {
-		if (cpu_isset(cpu, sd->span)) {
-			schedstat_inc(sd, ttwu_wake_remote);
-			this_sd = sd;
-			break;
-		}
-	}
-
-	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
-		goto out_set_cpu;
 
 	/*
-	 * Check for affine wakeup and passive balancing possibilities.
+	 * Wake to the CPU the task was last running on (or any
+	 * nearby SMT-equivalent idle CPU):
 	 */
-	if (this_sd) {
-		int idx = this_sd->wake_idx;
-		unsigned int imbalance;
-
-		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
-
-		load = source_load(cpu, idx);
-		this_load = target_load(this_cpu, idx);
-
-		new_cpu = this_cpu; /* Wake to this CPU if we can */
-
-		if (this_sd->flags & SD_WAKE_AFFINE) {
-			unsigned long tl = this_load;
-			/*
-			 * If sync wakeup then subtract the (maximum possible)
-			 * effect of the currently running task from the load
-			 * of the current CPU:
-			 */
-			if (sync)
-				tl -= SCHED_LOAD_SCALE;
-
-			if ((tl <= load &&
-				tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
-				100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
-				/*
-				 * This domain has SD_WAKE_AFFINE and
-				 * p is cache cold in this domain, and
-				 * there is no bad imbalance.
-				 */
-				schedstat_inc(this_sd, ttwu_move_affine);
-				goto out_set_cpu;
-			}
-		}
-
-		/*
-		 * Start passive balancing when half the imbalance_pct
-		 * limit is reached.
-		 */
-		if (this_sd->flags & SD_WAKE_BALANCE) {
-			if (imbalance*this_load <= 100*load) {
-				schedstat_inc(this_sd, ttwu_move_balance);
-				goto out_set_cpu;
-			}
-		}
-	}
-
-	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
-out_set_cpu:
-	new_cpu = wake_idle(new_cpu, p);
+	new_cpu = wake_idle(cpu, p);
 	if (new_cpu != cpu) {
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
@@ -4758,9 +4690,7 @@ static int sd_degenerate(struct sched_do
 	}
 
 	/* Following flags don't use groups */
-	if (sd->flags & (SD_WAKE_IDLE |
-			 SD_WAKE_AFFINE |
-			 SD_WAKE_BALANCE))
+	if (sd->flags & SD_WAKE_IDLE)
 		return 0;
 
 	return 1;
@@ -4778,9 +4708,6 @@ static int sd_parent_degenerate(struct s
 		return 0;
 
 	/* Does parent contain flags not in child? */
-	/* WAKE_BALANCE is a subset of WAKE_AFFINE */
-	if (cflags & SD_WAKE_AFFINE)
-		pflags &= ~SD_WAKE_BALANCE;
 	/* Flags needing groups don't count if only 1 group in parent */
 	if (parent->groups == parent->groups->next) {
 		pflags &= ~(SD_LOAD_BALANCE |

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [sched, patch] better wake-balancing
  2005-07-29 11:48                 ` [patch] remove wake-balancing Ingo Molnar
@ 2005-07-29 14:13                   ` Ingo Molnar
  2005-07-29 15:02                     ` [sched, patch] better wake-balancing, #2 Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2005-07-29 14:13 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, linux-ia64, Andrew Morton

another approach would be the patch below, to do wakeup-balancing only 
if the wakeup CPU or the task CPU is idle.

I've measured half-loaded tbench and unless total wakeup-balancing 
removal it does not degrade with this patch applied, while fully loaded 
tbench and other workloads clearly improve.

Ken, could you give this one a try? (It's against the current scheduler 
queue in -mm, but also applies fine to current Linus trees.)

	Ingo

---

do wakeup-balancing only if the wakeup-CPU or the task-CPU is idle.

this prevents excessive wakeup-balancing while the system is highly
loaded, but helps spread out the workload on partly idle systems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/sched.c |    6 ++++++
 1 files changed, 6 insertions(+)

Index: linux-sched-curr/kernel/sched.c
===================================================================
--- linux-sched-curr.orig/kernel/sched.c
+++ linux-sched-curr/kernel/sched.c
@@ -1252,7 +1252,13 @@ static int try_to_wake_up(task_t *p, uns
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;

+	/*
+	 * If neither this CPU, nor the previous CPU the task was
+	 * running on is idle then skip wakeup-balancing:
+	 */
 	new_cpu = cpu;
+	if (!idle_cpu(this_cpu) && !idle_cpu(cpu))
+		goto out_set_cpu;

 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [sched, patch] better wake-balancing, #2
  2005-07-29 14:13                   ` [sched, patch] better wake-balancing Ingo Molnar
@ 2005-07-29 15:02                     ` Ingo Molnar
  2005-07-29 16:21                       ` [sched, patch] better wake-balancing, #3 Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2005-07-29 15:02 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, linux-ia64, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> another approach would be the patch below, to do wakeup-balancing only 
> if the wakeup CPU or the task CPU is idle.

there's an even simpler way: only do wakeup-balancing if this_cpu is 
idle. (tbench results are still OK, and other workloads improved.)

	Ingo

--------
do wakeup-balancing only if the wakeup-CPU is idle.

this prevents excessive wakeup-balancing while the system is highly
loaded, but helps spread out the workload on partly idle systems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/sched.c |    6 ++++++
 1 files changed, 6 insertions(+)

Index: linux-sched-curr/kernel/sched.c
===================================================================
--- linux-sched-curr.orig/kernel/sched.c
+++ linux-sched-curr/kernel/sched.c
@@ -1253,7 +1253,13 @@ static int try_to_wake_up(task_t *p, uns
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
+	/*
+	 * Only do wakeup-balancing (== potentially migrate the task)
+	 * if this CPU is idle:
+	 */
 	new_cpu = cpu;
+	if (!idle_cpu(this_cpu))
+		goto out_set_cpu;
 
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [sched, patch] better wake-balancing, #3
  2005-07-29 15:02                     ` [sched, patch] better wake-balancing, #2 Ingo Molnar
@ 2005-07-29 16:21                       ` Ingo Molnar
  2005-07-30  0:08                         ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2005-07-29 16:21 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, linux-ia64, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> there's an even simpler way: only do wakeup-balancing if this_cpu is 
> idle. (tbench results are still OK, and other workloads improved.)

here's an updated patch. It handles one more detail: on SCHED_SMT we 
should check the idleness of siblings too. Benchmark numbers still look 
good.

	Ingo

----
do wakeup-balancing only if the wakeup-CPU (or any of its siblings)
is idle.

this prevents excessive wakeup-balancing while the system is highly
loaded, but helps spread out the workload on partly idle systems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/sched.c |    6 ++++++
 1 files changed, 6 insertions(+)

Index: linux-sched-curr/kernel/sched.c
===================================================================
--- linux-sched-curr.orig/kernel/sched.c
+++ linux-sched-curr/kernel/sched.c
@@ -1253,7 +1253,13 @@ static int try_to_wake_up(task_t *p, uns
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
+	/*
+	 * Only do wakeup-balancing (== potentially migrate the task)
+	 * if this CPU (or any SMT sibling) is idle:
+	 */
 	new_cpu = cpu;
+	if (!idle_cpu(this_cpu) && this_cpu == wake_idle(this_cpu, p))
+		goto out_set_cpu;
 
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [sched, patch] better wake-balancing, #3
  2005-07-29 16:21                       ` [sched, patch] better wake-balancing, #3 Ingo Molnar
@ 2005-07-30  0:08                         ` Nick Piggin
  2005-07-30  7:19                           ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2005-07-30  0:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64, Andrew Morton

Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> 
>>there's an even simpler way: only do wakeup-balancing if this_cpu is 
>>idle. (tbench results are still OK, and other workloads improved.)
> 
> 
> here's an updated patch. It handles one more detail: on SCHED_SMT we 
> should check the idleness of siblings too. Benchmark numbers still look 
> good.
> 

Maybe. Ken hasn't measured the effect of wake balancing in
2.6.13, which is quite a lot different to that found in 2.6.12.

I don't really like having a hard cutoff like that -wake
balancing can be important for IO workloads, though I haven't
measured for a long time. In IPC workloads, the cache affinity
of local wakeups becomes less apparent when the runqueue gets
lots of tasks on it, however benefits of IO affinity will
generally remain. Especially on NUMA systems.

fork/clone/exec/etc balancing really doesn't do anything to
capture this kind of relationship between tasks and between
tasks and IRQ sources. Without wake balancing we basically have
a completely random scattering of tasks.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [sched, patch] better wake-balancing, #3
  2005-07-30  0:08                         ` Nick Piggin
@ 2005-07-30  7:19                           ` Ingo Molnar
  2005-07-31  1:15                             ` Nick Piggin
  2005-08-08 23:18                             ` Chen, Kenneth W
  0 siblings, 2 replies; 26+ messages in thread
From: Ingo Molnar @ 2005-07-30  7:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chen, Kenneth W, linux-kernel, linux-ia64, Andrew Morton,
	John Hawkes, Martin J. Bligh, Paul Jackson

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > here's an updated patch. It handles one more detail: on SCHED_SMT we 
> > should check the idleness of siblings too. Benchmark numbers still 
> > look good.
> 
> Maybe. Ken hasn't measured the effect of wake balancing in 2.6.13, 
> which is quite a lot different to that found in 2.6.12.
> 
> I don't really like having a hard cutoff like that -wake balancing can 
> be important for IO workloads, though I haven't measured for a long 
> time. [...]

well, i have measured it, and it was a win for just about everything 
that is not idle, and even for an IPC (SysV semaphores) half-idle 
workload i've measured a 3% gain. No performance loss in tbench either, 
which is clearly the most sensitive to affine/passive balancing. But i'd 
like to see what Ken's (and others') numbers are.

the hard cutoff also has the benefit that it allows us to potentially 
make wakeup migration _more_ agressive in the future. So instead of 
having to think about weakening it due to the tradeoffs present in e.g.  
Ken's workload, we can actually make it stronger.

> [...] In IPC workloads, the cache affinity of local wakeups becomes 
> less apparent when the runqueue gets lots of tasks on it, however 
> benefits of IO affinity will generally remain. Especially on NUMA 
> systems.

especially on NUMA, if the migration-target CPU (this_cpu) is not at 
least partially idle, i'd be quite uneasy to passive balance from 
another node. I suspect this needs numbers from Martin and John?

> fork/clone/exec/etc balancing really doesn't do anything to capture 
> this kind of relationship between tasks and between tasks and IRQ 
> sources. Without wake balancing we basically have a completely random 
> scattering of tasks.

Ken's workload is a heavy IO one with lots of IRQ sources. And precisely 
for such type of workloads usually the best tactic is to leave the task 
alone and queue it wherever it last ran.

whenever there's a strong (and exclusive) relationship between tasks and 
individual interrupt sources, explicit binding to CPUs/groups of CPUs is 
the best method. In any case, more measurements are needed.

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [sched, patch] better wake-balancing, #3
  2005-07-30  7:19                           ` Ingo Molnar
@ 2005-07-31  1:15                             ` Nick Piggin
  2005-08-01 17:13                               ` Siddha, Suresh B
  2005-08-08 23:18                             ` Chen, Kenneth W
  1 sibling, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2005-07-31  1:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chen, Kenneth W, linux-kernel, linux-ia64, Andrew Morton,
	John Hawkes, Martin J. Bligh, Paul Jackson

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>I don't really like having a hard cutoff like that -wake balancing can 
>>be important for IO workloads, though I haven't measured for a long 
>>time. [...]
> 
> 
> well, i have measured it, and it was a win for just about everything 

I meant: measured for IO workloads.

I had one group tell me their IO efficiency went up by several
*times* on a 16-way NUMA system after generalising the wake
balancing to interrupts as well.

> that is not idle, and even for an IPC (SysV semaphores) half-idle 
> workload i've measured a 3% gain. No performance loss in tbench either, 
> which is clearly the most sensitive to affine/passive balancing. But i'd 
> like to see what Ken's (and others') numbers are.
> 
> the hard cutoff also has the benefit that it allows us to potentially 
> make wakeup migration _more_ agressive in the future. So instead of 
> having to think about weakening it due to the tradeoffs present in e.g.  
> Ken's workload, we can actually make it stronger.
> 

That would make the behaviour change even more violent, which is
what I dislike. I would much prefer to have code that handles both
workloads without introducing sudden cutoff points in behaviour.

> 
> especially on NUMA, if the migration-target CPU (this_cpu) is not at 
> least partially idle, i'd be quite uneasy to passive balance from 
> another node. I suspect this needs numbers from Martin and John?
> 

Passive balancing cuts in only when an imbalance is becoming apparent.
If the queue gets more imbalanced, periodic balancing will cut in,
and that is much worse than wake balancing.

> 
>>fork/clone/exec/etc balancing really doesn't do anything to capture 
>>this kind of relationship between tasks and between tasks and IRQ 
>>sources. Without wake balancing we basically have a completely random 
>>scattering of tasks.
> 
> 
> Ken's workload is a heavy IO one with lots of IRQ sources. And precisely 
> for such type of workloads usually the best tactic is to leave the task 
> alone and queue it wherever it last ran.
> 

Yep, I agree the wake balancing code in 2.6.12 wasn't ideal. That's
why I changed it in 2.6.13 - precisely because it moved things around
too much. It probably still isn't ideal though.

> whenever there's a strong (and exclusive) relationship between tasks and 
> individual interrupt sources, explicit binding to CPUs/groups of CPUs is 
> the best method. In any case, more measurements are needed.
> 

Well, I wouldn't say it is always the best method. Especially not when
there is a big variation in the CPU consumption of the groups of tasks.
But anyway, even in the cases where it definitely is the best method,
we really should try to handle them properly without binding too.

I do agree that more measurements are needed :)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [sched, patch] better wake-balancing, #3
  2005-07-31  1:15                             ` Nick Piggin
@ 2005-08-01 17:13                               ` Siddha, Suresh B
  0 siblings, 0 replies; 26+ messages in thread
From: Siddha, Suresh B @ 2005-08-01 17:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Chen, Kenneth W, linux-kernel, linux-ia64,
	Andrew Morton, John Hawkes, Martin J. Bligh, Paul Jackson

On Sun, Jul 31, 2005 at 11:15:16AM +1000, Nick Piggin wrote:
> Ingo Molnar wrote:
> > especially on NUMA, if the migration-target CPU (this_cpu) is not at 
> > least partially idle, i'd be quite uneasy to passive balance from 
> > another node. I suspect this needs numbers from Martin and John?
> 
> Passive balancing cuts in only when an imbalance is becoming apparent.
> If the queue gets more imbalanced, periodic balancing will cut in,
> and that is much worse than wake balancing.

Another point to note about the current wake balance. Imbalance calculation
is not taking the complete load of the sched group into account. I think
there might be scenario's where the current wake balance will actually
result in some imbalances corrected later by periodic balancing.

thanks,
suresh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [sched, patch] better wake-balancing, #3
  2005-07-30  7:19                           ` Ingo Molnar
  2005-07-31  1:15                             ` Nick Piggin
@ 2005-08-08 23:18                             ` Chen, Kenneth W
  1 sibling, 0 replies; 26+ messages in thread
From: Chen, Kenneth W @ 2005-08-08 23:18 UTC (permalink / raw)
  To: 'Ingo Molnar', Nick Piggin
  Cc: linux-kernel, linux-ia64, Andrew Morton, John Hawkes,
	Martin J. Bligh, Paul Jackson

Ingo Molnar wrote on Saturday, July 30, 2005 12:19 AM
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > > here's an updated patch. It handles one more detail: on SCHED_SMT we 
> > > should check the idleness of siblings too. Benchmark numbers still 
> > > look good.
> > 
> > Maybe. Ken hasn't measured the effect of wake balancing in 2.6.13, 
> > which is quite a lot different to that found in 2.6.12.
> > 
> > I don't really like having a hard cutoff like that -wake balancing can 
> > be important for IO workloads, though I haven't measured for a long 
> > time. [...]
> 
> well, i have measured it, and it was a win for just about everything 
> that is not idle, and even for an IPC (SysV semaphores) half-idle 
> workload i've measured a 3% gain. No performance loss in tbench either, 
> which is clearly the most sensitive to affine/passive balancing. But i'd 
> like to see what Ken's (and others') numbers are.
> 
> the hard cutoff also has the benefit that it allows us to potentially 
> make wakeup migration _more_ agressive in the future. So instead of 
> having to think about weakening it due to the tradeoffs present in e.g.  
> Ken's workload, we can actually make it stronger.


Sorry it took us a while to get the experiment done on our large db setup.
This patch has the same effectiveness compare to turning off both
SD_WAKE_BALANCE and SD_WAKE_AFFINE, (+2.2% on db OLTP workload).  We like
it a lot.

- Ken


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-28 23:08 Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Chen, Kenneth W
  2005-07-28 23:34 ` Nick Piggin
@ 2005-07-29 11:26 ` Ingo Molnar
  2005-07-29 17:30   ` Chen, Kenneth W
  1 sibling, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2005-07-29 11:26 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Nick Piggin', linux-kernel, linux-ia64


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> To demonstrate the problem, we turned off these two flags in the cpu 
> sd domain and measured a stunning 2.15% performance gain!  And 
> deleting all the code in the try_to_wake_up() pertain to load 
> balancing gives us another 0.2% gain.

another thing: do you have a HT-capable ia64 CPU, and do you have 
CONFIG_SCHED_SMT turned on? If yes then could you try to turn off 
SD_WAKE_IDLE too, i found it to bring further performance improvements 
in certain workloads.

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29 11:26 ` Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Ingo Molnar
@ 2005-07-29 17:30   ` Chen, Kenneth W
  0 siblings, 0 replies; 26+ messages in thread
From: Chen, Kenneth W @ 2005-07-29 17:30 UTC (permalink / raw)
  To: 'Ingo Molnar'; +Cc: 'Nick Piggin', linux-kernel, linux-ia64

Ingo Molnar wrote on Friday, July 29, 2005 4:26 AM
> * Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> > To demonstrate the problem, we turned off these two flags in the cpu 
> > sd domain and measured a stunning 2.15% performance gain!  And 
> > deleting all the code in the try_to_wake_up() pertain to load 
> > balancing gives us another 0.2% gain.
> 
> another thing: do you have a HT-capable ia64 CPU, and do you have 
> CONFIG_SCHED_SMT turned on? If yes then could you try to turn off 
> SD_WAKE_IDLE too, i found it to bring further performance improvements 
> in certain workloads.

The scheduler experiments done so far are on non-SMT CPU (Madison processor).
We have another db setup with multi-thread capable ia64 cpu (montecito, and to
be precise, it is SOEMT capable).  We are just about to do scheduler experiments
on that setup.



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2005-08-08 23:19 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-28 23:08 Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Chen, Kenneth W
2005-07-28 23:34 ` Nick Piggin
2005-07-28 23:48   ` Chen, Kenneth W
2005-07-29  1:25     ` Nick Piggin
2005-07-29  1:39       ` Chen, Kenneth W
2005-07-29  1:46         ` Nick Piggin
2005-07-29  1:53           ` Chen, Kenneth W
2005-07-29  2:01             ` Nick Piggin
2005-07-29  6:27               ` Chen, Kenneth W
2005-07-29  8:48                 ` Nick Piggin
2005-07-29  8:53                   ` Ingo Molnar
2005-07-29  8:59                     ` Nick Piggin
2005-07-29  9:01                       ` Ingo Molnar
2005-07-29  9:07                   ` Ingo Molnar
2005-07-29 16:40                   ` Ingo Molnar
2005-07-29 11:48                 ` [patch] remove wake-balancing Ingo Molnar
2005-07-29 14:13                   ` [sched, patch] better wake-balancing Ingo Molnar
2005-07-29 15:02                     ` [sched, patch] better wake-balancing, #2 Ingo Molnar
2005-07-29 16:21                       ` [sched, patch] better wake-balancing, #3 Ingo Molnar
2005-07-30  0:08                         ` Nick Piggin
2005-07-30  7:19                           ` Ingo Molnar
2005-07-31  1:15                             ` Nick Piggin
2005-08-01 17:13                               ` Siddha, Suresh B
2005-08-08 23:18                             ` Chen, Kenneth W
2005-07-29 11:26 ` Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Ingo Molnar
2005-07-29 17:30   ` Chen, Kenneth W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox