public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
@ 2005-07-28 23:08 Chen, Kenneth W
  2005-07-28 23:34 ` Nick Piggin
  2005-07-29 11:26 ` Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Ingo Molnar
  0 siblings, 2 replies; 30+ messages in thread
From: Chen, Kenneth W @ 2005-07-28 23:08 UTC (permalink / raw)
  To: Ingo Molnar, 'Nick Piggin'; +Cc: linux-kernel, linux-ia64

What sort of workload needs SD_WAKE_AFFINE and SD_WAKE_BALANCE?
SD_WAKE_AFFINE are not useful in conjunction with interrupt binding.
In fact, it creates more harm than usefulness, causing detrimental
process migration and destroy process cache affinity etc.  Also
SD_WAKE_BALANCE is giving us performance grief with our industry
standard OLTP workload.

To demonstrate the problem, we turned off these two flags in the cpu
sd domain and measured a stunning 2.15% performance gain!  And deleting
all the code in the try_to_wake_up() pertain to load balancing gives us
another 0.2% gain.

The wake up patch should be made simple, just put the waking task on
the previously ran cpu runqueue.  Simple and elegant.

I'm proposing we either delete these two flags or make them run time
configurable.

- Ken



--- linux-2.6.12/include/linux/topology.h.orig	2005-07-28 15:54:05.007399685 -0700
+++ linux-2.6.12/include/linux/topology.h	2005-07-28 15:54:39.292555515 -0700
@@ -118,9 +118,7 @@
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE	\
-				| SD_WAKE_IDLE		\
-				| SD_WAKE_BALANCE,	\
+				| SD_WAKE_IDLE,		\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-28 23:08 Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Chen, Kenneth W
@ 2005-07-28 23:34 ` Nick Piggin
  2005-07-28 23:48   ` Chen, Kenneth W
  2005-07-29 11:26 ` Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Ingo Molnar
  1 sibling, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2005-07-28 23:34 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:
> What sort of workload needs SD_WAKE_AFFINE and SD_WAKE_BALANCE?
> SD_WAKE_AFFINE are not useful in conjunction with interrupt binding.
> In fact, it creates more harm than usefulness, causing detrimental
> process migration and destroy process cache affinity etc.  Also
> SD_WAKE_BALANCE is giving us performance grief with our industry
> standard OLTP workload.
> 

The periodic load balancer basically makes completely undirected,
random choices when picking which tasks to move where.

Wake balancing provides an opportunity to provide some input bias
into the load balancer.

For example, if you started 100 pairs of tasks which communicate
through a pipe. On a 2 CPU system without wake balancing, probably
half of the pairs will be on different CPUs. With wake balancing,
it should be much better.

I've also been told that it impoves IO efficiency significantly -
obviously that depends on the system and workload.

> To demonstrate the problem, we turned off these two flags in the cpu
> sd domain and measured a stunning 2.15% performance gain!  And deleting
> all the code in the try_to_wake_up() pertain to load balancing gives us
> another 0.2% gain.
> 
> The wake up patch should be made simple, just put the waking task on
> the previously ran cpu runqueue.  Simple and elegant.
> 
> I'm proposing we either delete these two flags or make them run time
> configurable.
> 

There have been lots of changes since 2.6.12. Including less aggressive
wake balancing.

I hear you might be having problems with recent 2.6.13 kernels? If so,
it would be really good to have a look that before 2.6.13 goes out the
door.

I appreciate all the effort you're putting into this!

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-28 23:34 ` Nick Piggin
@ 2005-07-28 23:48   ` Chen, Kenneth W
  2005-07-29  1:25     ` Nick Piggin
  0 siblings, 1 reply; 30+ messages in thread
From: Chen, Kenneth W @ 2005-07-28 23:48 UTC (permalink / raw)
  To: 'Nick Piggin'; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Nick Piggin wrote on Thursday, July 28, 2005 4:35 PM
> Wake balancing provides an opportunity to provide some input bias
> into the load balancer.
> 
> For example, if you started 100 pairs of tasks which communicate
> through a pipe. On a 2 CPU system without wake balancing, probably
> half of the pairs will be on different CPUs. With wake balancing,
> it should be much better.

Shouldn't the pipe code use synchronous wakeup?


> I hear you might be having problems with recent 2.6.13 kernels? If so,
> it would be really good to have a look that before 2.6.13 goes out the
> door.

Yes I do :-(, apparently bumping up cache_hot_time won't give us the
performance boost we used to see.

- Ken


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-28 23:48   ` Chen, Kenneth W
@ 2005-07-29  1:25     ` Nick Piggin
  2005-07-29  1:39       ` Chen, Kenneth W
  0 siblings, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2005-07-29  1:25 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:

>Nick Piggin wrote on Thursday, July 28, 2005 4:35 PM
>
>>Wake balancing provides an opportunity to provide some input bias
>>into the load balancer.
>>
>>For example, if you started 100 pairs of tasks which communicate
>>through a pipe. On a 2 CPU system without wake balancing, probably
>>half of the pairs will be on different CPUs. With wake balancing,
>>it should be much better.
>>
>
>Shouldn't the pipe code use synchronous wakeup?
>
>

Well pipes are just an example. It could be any type of communication.
What's more, even the synchronous wakeup uses the wake balancing path
(although that could be modified to only do wake balancing for synch
wakeups, I'd have to be convinced we should special case pipes and not
eg. semaphores or AF_UNIX sockets).

>
>>I hear you might be having problems with recent 2.6.13 kernels? If so,
>>it would be really good to have a look that before 2.6.13 goes out the
>>door.
>>
>
>Yes I do :-(, apparently bumping up cache_hot_time won't give us the
>performance boost we used to see.
>
>
OK there are probably a number of things we can explore depending on
what are the symptoms (eg. excessive idle time, bad cache performance).

Unfortunately it is kind of difficult to tune 2.6.13 on the basis of
2.6.12 results - although that's not to say it won't indicate a good
avenue to investigate.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  1:25     ` Nick Piggin
@ 2005-07-29  1:39       ` Chen, Kenneth W
  2005-07-29  1:46         ` Nick Piggin
  0 siblings, 1 reply; 30+ messages in thread
From: Chen, Kenneth W @ 2005-07-29  1:39 UTC (permalink / raw)
  To: 'Nick Piggin'; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Nick Piggin wrote on Thursday, July 28, 2005 6:25 PM
> Well pipes are just an example. It could be any type of communication.
> What's more, even the synchronous wakeup uses the wake balancing path
> (although that could be modified to only do wake balancing for synch
> wakeups, I'd have to be convinced we should special case pipes and not
> eg. semaphores or AF_UNIX sockets).


Why is the normal load balance path not enough (or not be able to do the
right thing)?  The reblance_tick and idle_balance ought be enough to take
care of the imbalance.  What makes load balancing in wake up path so special?

Oh, I'd like to hear your opinion on what to do with these two flags, make
them runtime configurable? (I'm of the opinion to delete them altogether)

- Ken


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  1:39       ` Chen, Kenneth W
@ 2005-07-29  1:46         ` Nick Piggin
  2005-07-29  1:53           ` Chen, Kenneth W
  0 siblings, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2005-07-29  1:46 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:

>Nick Piggin wrote on Thursday, July 28, 2005 6:25 PM
>
>>Well pipes are just an example. It could be any type of communication.
>>What's more, even the synchronous wakeup uses the wake balancing path
>>(although that could be modified to only do wake balancing for synch
>>wakeups, I'd have to be convinced we should special case pipes and not
>>eg. semaphores or AF_UNIX sockets).
>>
>
>
>Why is the normal load balance path not enough (or not be able to do the
>right thing)?  The reblance_tick and idle_balance ought be enough to take
>care of the imbalance.  What makes load balancing in wake up path so special?
>
>

Well the normal load balancing path treats all tasks the same, while
the wake path knows if a CPU is waking a remote task and can attempt
to maximise the number of local wakeups.

>Oh, I'd like to hear your opinion on what to do with these two flags, make
>them runtime configurable? (I'm of the opinion to delete them altogether)
>
>

I'd like to try making them less aggressive first if possible.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  1:46         ` Nick Piggin
@ 2005-07-29  1:53           ` Chen, Kenneth W
  2005-07-29  2:01             ` Nick Piggin
  0 siblings, 1 reply; 30+ messages in thread
From: Chen, Kenneth W @ 2005-07-29  1:53 UTC (permalink / raw)
  To: 'Nick Piggin'; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Nick Piggin wrote on Thursday, July 28, 2005 6:46 PM
> I'd like to try making them less aggressive first if possible.

Well, that's exactly what I'm trying to do: make them not aggressive
at all by not performing any load balance :-)  The workload gets maximum
benefit with zero aggressiveness.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  1:53           ` Chen, Kenneth W
@ 2005-07-29  2:01             ` Nick Piggin
  2005-07-29  6:27               ` Chen, Kenneth W
  0 siblings, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2005-07-29  2:01 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:

>Nick Piggin wrote on Thursday, July 28, 2005 6:46 PM
>
>>I'd like to try making them less aggressive first if possible.
>>
>
>Well, that's exactly what I'm trying to do: make them not aggressive
>at all by not performing any load balance :-)  The workload gets maximum
>benefit with zero aggressiveness.
>
>

Unfortunately we can't forget about other workloads, and we're
trying to stay away from runtime tunables in the scheduler.

If we can get performance to within a couple of tenths of a percent
of the zero balancing case, then that would be preferable I think.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  2:01             ` Nick Piggin
@ 2005-07-29  6:27               ` Chen, Kenneth W
  2005-07-29  8:48                 ` Nick Piggin
  2005-07-29 11:48                 ` [patch] remove wake-balancing Ingo Molnar
  0 siblings, 2 replies; 30+ messages in thread
From: Chen, Kenneth W @ 2005-07-29  6:27 UTC (permalink / raw)
  To: 'Nick Piggin'; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Nick Piggin wrote on Thursday, July 28, 2005 7:01 PM
> Chen, Kenneth W wrote:
> >Well, that's exactly what I'm trying to do: make them not aggressive
> >at all by not performing any load balance :-)  The workload gets maximum
> >benefit with zero aggressiveness.
> 
> Unfortunately we can't forget about other workloads, and we're
> trying to stay away from runtime tunables in the scheduler.


This clearly outlines an issue with the implementation.  Optimize for one
type of workload has detrimental effect on another workload and vice versa.


> If we can get performance to within a couple of tenths of a percent
> of the zero balancing case, then that would be preferable I think.

I won't try to compromise between the two.  If you do so, we would end up
with two half baked raw turkey.  Making less aggressive load balance in the
wake up path would probably reduce performance for the type of workload you
quoted earlier and for db workload, we don't want any of them at all, not
even the code to determine whether it should be balanced or not.

Do you have an example workload you mentioned earlier that depends on
SD_WAKE_BALANCE?  I would like to experiment with it so we can move this
forward instead of paper talk.

- Ken


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  6:27               ` Chen, Kenneth W
@ 2005-07-29  8:48                 ` Nick Piggin
  2005-07-29  8:53                   ` Ingo Molnar
                                     ` (2 more replies)
  2005-07-29 11:48                 ` [patch] remove wake-balancing Ingo Molnar
  1 sibling, 3 replies; 30+ messages in thread
From: Nick Piggin @ 2005-07-29  8:48 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Ingo Molnar, linux-kernel, linux-ia64

Chen, Kenneth W wrote:
> Nick Piggin wrote on Thursday, July 28, 2005 7:01 PM

> This clearly outlines an issue with the implementation.  Optimize for one
> type of workload has detrimental effect on another workload and vice versa.
> 

Yep. That comes up fairly regularly when tuning the scheduler :(

> 
> I won't try to compromise between the two.  If you do so, we would end up
> with two half baked raw turkey.  Making less aggressive load balance in the
> wake up path would probably reduce performance for the type of workload you
> quoted earlier and for db workload, we don't want any of them at all, not
> even the code to determine whether it should be balanced or not.
> 

Well, that remains to be seen. If it can be made _smarter_, then you
may not have to take such a big compromise.

But either way, there will have to be some compromise made. At the
very least you have to find some acceptable default.

> Do you have an example workload you mentioned earlier that depends on
> SD_WAKE_BALANCE?  I would like to experiment with it so we can move this
> forward instead of paper talk.
> 

Well, you can easily see suboptimal scheduling decisions on many
programs with lots of interprocess communication. For example, tbench
on a dual Xeon:

processes    1               2               3              4

2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500

Numbers are MB/s, higher is better.

Networking or other IO workloads where processes are tightly coupled
to a specific adapter / interrupt source can also see pretty good
gains.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:48                 ` Nick Piggin
@ 2005-07-29  8:53                   ` Ingo Molnar
  2005-07-29  8:59                     ` Nick Piggin
  2005-07-29  9:07                   ` Ingo Molnar
  2005-07-29 16:40                   ` Ingo Molnar
  2 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2005-07-29  8:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Well, you can easily see suboptimal scheduling decisions on many 
> programs with lots of interprocess communication. For example, tbench 
> on a dual Xeon:
> 
> processes    1               2               3              4
> 
> 2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
> no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500
> 
> Numbers are MB/s, higher is better.

what type of network was used - localhost or a real one?

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:53                   ` Ingo Molnar
@ 2005-07-29  8:59                     ` Nick Piggin
  2005-07-29  9:01                       ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2005-07-29  8:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>processes    1               2               3              4
>>
>>2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
>>no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500
>>
>>Numbers are MB/s, higher is better.
> 
> 
> what type of network was used - localhost or a real one?
> 

Localhost. Yeah it isn't a real world test, but it does show the
erratic behaviour without wake affine.

I don't have a setup with multiple fast network adapters otherwise
I would have run a similar test using a real network.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:59                     ` Nick Piggin
@ 2005-07-29  9:01                       ` Ingo Molnar
  0 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2005-07-29  9:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> >>processes    1               2               3              4
> >>
> >>2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
> >>no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500
> >>
> >>Numbers are MB/s, higher is better.
> >
> >
> >what type of network was used - localhost or a real one?
> >
> 
> Localhost. Yeah it isn't a real world test, but it does show the 
> erratic behaviour without wake affine.

yeah - fine enough. (It's not representative for IO workloads, but it's 
representative for local IPC workloads, just wanted to know precisely 
which workload it is.)

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:48                 ` Nick Piggin
  2005-07-29  8:53                   ` Ingo Molnar
@ 2005-07-29  9:07                   ` Ingo Molnar
  2005-07-29 16:40                   ` Ingo Molnar
  2 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2005-07-29  9:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Well, you can easily see suboptimal scheduling decisions on many 
> programs with lots of interprocess communication. For example, tbench 
> on a dual Xeon:
> 
> processes    1               2               3              4
> 
> 2.6.13-rc4:  187, 183, 179   260, 259, 256   340, 320, 349  504, 496, 500
> no wake-bal: 180, 180, 177   254, 254, 253   268, 270, 348  345, 290, 500
> 
> Numbers are MB/s, higher is better.

i cannot see any difference with/without wake-balancing in this 
workload, on a dual Xeon. Could you try the quick hack below and do:

	echo 1 > /proc/sys/kernel/panic # turn on wake-balancing
	echo 0 > /proc/sys/kernel/panic # turn off wake-balancing

does the runtime switching show any effects on the throughput numbers 
tbench is showing? I'm using dbench-3.03. (i only checked the status 
numbers, didnt do full runs)

(did you have SCHED_SMT enabled?)

	Ingo

 kernel/sched.c |    2 ++
 1 files changed, 2 insertions(+)

Index: linux-prefetch-task/kernel/sched.c
===================================================================
--- linux-prefetch-task.orig/kernel/sched.c
+++ linux-prefetch-task/kernel/sched.c
@@ -1155,6 +1155,8 @@ static int try_to_wake_up(task_t * p, un
 		goto out_activate;
 
 	new_cpu = cpu;
+	if (!panic_timeout)
+		goto out_set_cpu;
 
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-28 23:08 Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Chen, Kenneth W
  2005-07-28 23:34 ` Nick Piggin
@ 2005-07-29 11:26 ` Ingo Molnar
  2005-07-29 17:30   ` Chen, Kenneth W
  1 sibling, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2005-07-29 11:26 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Nick Piggin', linux-kernel, linux-ia64


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> To demonstrate the problem, we turned off these two flags in the cpu 
> sd domain and measured a stunning 2.15% performance gain!  And 
> deleting all the code in the try_to_wake_up() pertain to load 
> balancing gives us another 0.2% gain.

another thing: do you have a HT-capable ia64 CPU, and do you have 
CONFIG_SCHED_SMT turned on? If yes then could you try to turn off 
SD_WAKE_IDLE too, i found it to bring further performance improvements 
in certain workloads.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch] remove wake-balancing
  2005-07-29  6:27               ` Chen, Kenneth W
  2005-07-29  8:48                 ` Nick Piggin
@ 2005-07-29 11:48                 ` Ingo Molnar
  2005-07-29 14:13                   ` [sched, patch] better wake-balancing Ingo Molnar
  1 sibling, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2005-07-29 11:48 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, linux-ia64, Andrew Morton


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> > If we can get performance to within a couple of tenths of a percent
> > of the zero balancing case, then that would be preferable I think.
> 
> I won't try to compromise between the two.  If you do so, we would end 
> up with two half baked raw turkey.  Making less aggressive load 
> balance in the wake up path would probably reduce performance for the 
> type of workload you quoted earlier and for db workload, we don't want 
> any of them at all, not even the code to determine whether it should 
> be balanced or not.

i think we could try to get rid of wakeup-time balancing altogether.

these days pretty much the only time we can sensibly do 'fast' (as in 
immediate) migration are fork/clone and exec. Furthermore, the gained 
simplicity of wakeup is quite compelling too. (Originally, when i 
introduced the first variant wakeup-time balancing eons ago, we didnt 
have anything like fork-time and exec-time balancing.)

i think we could try the patch below in -mm, it removes (non-)affine 
wakeup and passive wakeup-balancing, but keeps SD_WAKE_IDLE that is 
needed for efficient SMT scheduling. I test-booted the patch on x86, and 
it should work on all architectures. (I have tested various local-IPC 
and non-IPC workloads and only found performance improvements - but i'm 
sure regressions exist too, and need to be examined.)

	Ingo

------
remove wakeup-time balancing. It turns out exec-time and fork-time
balancing combined with periodic rebalancing ticks does a good enough
job.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 include/asm-i386/topology.h           |    3 -
 include/asm-ia64/topology.h           |    6 --
 include/asm-mips/mach-ip27/topology.h |    3 -
 include/asm-ppc64/topology.h          |    3 -
 include/asm-x86_64/topology.h         |    3 -
 include/linux/sched.h                 |    4 -
 include/linux/topology.h              |    4 -
 kernel/sched.c                        |   89 +++-------------------------------
 8 files changed, 16 insertions(+), 99 deletions(-)

Index: linux-prefetch-task/include/asm-i386/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-i386/topology.h
+++ linux-prefetch-task/include/asm-i386/topology.h
@@ -81,8 +81,7 @@ static inline int node_to_first_cpu(int 
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_WAKE_BALANCE,	\
+				| SD_BALANCE_FORK,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/asm-ia64/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-ia64/topology.h
+++ linux-prefetch-task/include/asm-ia64/topology.h
@@ -65,8 +65,7 @@ void build_cpu_to_node_map(void);
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE,	\
+				| SD_BALANCE_EXEC,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
@@ -91,8 +90,7 @@ void build_cpu_to_node_map(void);
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_WAKE_BALANCE,	\
+				| SD_BALANCE_FORK,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 64,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/asm-mips/mach-ip27/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-mips/mach-ip27/topology.h
+++ linux-prefetch-task/include/asm-mips/mach-ip27/topology.h
@@ -28,8 +28,7 @@ extern unsigned char __node_distances[MA
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_EXEC	\
-				| SD_WAKE_BALANCE,	\
+				| SD_BALANCE_EXEC,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/asm-ppc64/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-ppc64/topology.h
+++ linux-prefetch-task/include/asm-ppc64/topology.h
@@ -52,8 +52,7 @@ static inline int node_to_first_cpu(int 
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
 				| SD_BALANCE_NEWIDLE	\
-				| SD_WAKE_IDLE		\
-				| SD_WAKE_BALANCE,	\
+				| SD_WAKE_IDLE,		\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/asm-x86_64/topology.h
===================================================================
--- linux-prefetch-task.orig/include/asm-x86_64/topology.h
+++ linux-prefetch-task/include/asm-x86_64/topology.h
@@ -48,8 +48,7 @@ extern int __node_distance(int, int);
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_FORK	\
-				| SD_BALANCE_EXEC	\
-				| SD_WAKE_BALANCE,	\
+				| SD_BALANCE_EXEC,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/include/linux/sched.h
===================================================================
--- linux-prefetch-task.orig/include/linux/sched.h
+++ linux-prefetch-task/include/linux/sched.h
@@ -471,9 +471,7 @@ enum idle_type
 #define SD_BALANCE_EXEC		4	/* Balance on exec */
 #define SD_BALANCE_FORK		8	/* Balance on fork, clone */
 #define SD_WAKE_IDLE		16	/* Wake to idle CPU on task wakeup */
-#define SD_WAKE_AFFINE		32	/* Wake task to waking CPU */
-#define SD_WAKE_BALANCE		64	/* Perform balancing at task wakeup */
-#define SD_SHARE_CPUPOWER	128	/* Domain members share cpu power */
+#define SD_SHARE_CPUPOWER	32	/* Domain members share cpu power */
 
 struct sched_group {
 	struct sched_group *next;	/* Must be a circular list */
Index: linux-prefetch-task/include/linux/topology.h
===================================================================
--- linux-prefetch-task.orig/include/linux/topology.h
+++ linux-prefetch-task/include/linux/topology.h
@@ -97,7 +97,6 @@
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE	\
 				| SD_WAKE_IDLE		\
 				| SD_SHARE_CPUPOWER,	\
 	.last_balance		= jiffies,		\
@@ -127,8 +126,7 @@
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE,	\
+				| SD_BALANCE_EXEC,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
Index: linux-prefetch-task/kernel/sched.c
===================================================================
--- linux-prefetch-task.orig/kernel/sched.c
+++ linux-prefetch-task/kernel/sched.c
@@ -254,7 +254,6 @@ struct runqueue {
 
 	/* try_to_wake_up() stats */
 	unsigned long ttwu_cnt;
-	unsigned long ttwu_local;
 #endif
 };
 
@@ -373,7 +372,7 @@ static inline void task_rq_unlock(runque
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 12
+#define SCHEDSTAT_VERSION 13
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -390,11 +389,11 @@ static int show_schedstat(struct seq_fil
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
+		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
 		    cpu, rq->yld_both_empty,
 		    rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt,
 		    rq->sched_switch, rq->sched_cnt, rq->sched_goidle,
-		    rq->ttwu_cnt, rq->ttwu_local,
+		    rq->ttwu_cnt,
 		    rq->rq_sched_info.cpu_time,
 		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt);
 
@@ -424,8 +423,7 @@ static int show_schedstat(struct seq_fil
 			seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu\n",
 			    sd->alb_cnt, sd->alb_failed, sd->alb_pushed,
 			    sd->sbe_cnt, sd->sbe_balanced, sd->sbe_pushed,
-			    sd->sbf_cnt, sd->sbf_balanced, sd->sbf_pushed,
-			    sd->ttwu_wake_remote, sd->ttwu_move_affine, sd->ttwu_move_balance);
+			    sd->sbf_cnt, sd->sbf_balanced, sd->sbf_pushed);
 		}
 		preempt_enable();
 #endif
@@ -1134,8 +1132,6 @@ static int try_to_wake_up(task_t * p, un
 	long old_state;
 	runqueue_t *rq;
 #ifdef CONFIG_SMP
-	unsigned long load, this_load;
-	struct sched_domain *sd, *this_sd = NULL;
 	int new_cpu;
 #endif
 
@@ -1154,77 +1150,13 @@ static int try_to_wake_up(task_t * p, un
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-	new_cpu = cpu;
-
 	schedstat_inc(rq, ttwu_cnt);
-	if (cpu == this_cpu) {
-		schedstat_inc(rq, ttwu_local);
-		goto out_set_cpu;
-	}
-
-	for_each_domain(this_cpu, sd) {
-		if (cpu_isset(cpu, sd->span)) {
-			schedstat_inc(sd, ttwu_wake_remote);
-			this_sd = sd;
-			break;
-		}
-	}
-
-	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
-		goto out_set_cpu;
 
 	/*
-	 * Check for affine wakeup and passive balancing possibilities.
+	 * Wake to the CPU the task was last running on (or any
+	 * nearby SMT-equivalent idle CPU):
 	 */
-	if (this_sd) {
-		int idx = this_sd->wake_idx;
-		unsigned int imbalance;
-
-		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
-
-		load = source_load(cpu, idx);
-		this_load = target_load(this_cpu, idx);
-
-		new_cpu = this_cpu; /* Wake to this CPU if we can */
-
-		if (this_sd->flags & SD_WAKE_AFFINE) {
-			unsigned long tl = this_load;
-			/*
-			 * If sync wakeup then subtract the (maximum possible)
-			 * effect of the currently running task from the load
-			 * of the current CPU:
-			 */
-			if (sync)
-				tl -= SCHED_LOAD_SCALE;
-
-			if ((tl <= load &&
-				tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
-				100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
-				/*
-				 * This domain has SD_WAKE_AFFINE and
-				 * p is cache cold in this domain, and
-				 * there is no bad imbalance.
-				 */
-				schedstat_inc(this_sd, ttwu_move_affine);
-				goto out_set_cpu;
-			}
-		}
-
-		/*
-		 * Start passive balancing when half the imbalance_pct
-		 * limit is reached.
-		 */
-		if (this_sd->flags & SD_WAKE_BALANCE) {
-			if (imbalance*this_load <= 100*load) {
-				schedstat_inc(this_sd, ttwu_move_balance);
-				goto out_set_cpu;
-			}
-		}
-	}
-
-	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
-out_set_cpu:
-	new_cpu = wake_idle(new_cpu, p);
+	new_cpu = wake_idle(cpu, p);
 	if (new_cpu != cpu) {
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
@@ -4758,9 +4690,7 @@ static int sd_degenerate(struct sched_do
 	}
 
 	/* Following flags don't use groups */
-	if (sd->flags & (SD_WAKE_IDLE |
-			 SD_WAKE_AFFINE |
-			 SD_WAKE_BALANCE))
+	if (sd->flags & SD_WAKE_IDLE)
 		return 0;
 
 	return 1;
@@ -4778,9 +4708,6 @@ static int sd_parent_degenerate(struct s
 		return 0;
 
 	/* Does parent contain flags not in child? */
-	/* WAKE_BALANCE is a subset of WAKE_AFFINE */
-	if (cflags & SD_WAKE_AFFINE)
-		pflags &= ~SD_WAKE_BALANCE;
 	/* Flags needing groups don't count if only 1 group in parent */
 	if (parent->groups == parent->groups->next) {
 		pflags &= ~(SD_LOAD_BALANCE |

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [sched, patch] better wake-balancing
  2005-07-29 11:48                 ` [patch] remove wake-balancing Ingo Molnar
@ 2005-07-29 14:13                   ` Ingo Molnar
  2005-07-29 15:02                     ` [sched, patch] better wake-balancing, #2 Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2005-07-29 14:13 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, linux-ia64, Andrew Morton


another approach would be the patch below, to do wakeup-balancing only 
if the wakeup CPU or the task CPU is idle.

I've measured half-loaded tbench and unless total wakeup-balancing 
removal it does not degrade with this patch applied, while fully loaded 
tbench and other workloads clearly improve.

Ken, could you give this one a try? (It's against the current scheduler 
queue in -mm, but also applies fine to current Linus trees.)

	Ingo

---

do wakeup-balancing only if the wakeup-CPU or the task-CPU is idle.

this prevents excessive wakeup-balancing while the system is highly
loaded, but helps spread out the workload on partly idle systems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/sched.c |    6 ++++++
 1 files changed, 6 insertions(+)

Index: linux-sched-curr/kernel/sched.c
===================================================================
--- linux-sched-curr.orig/kernel/sched.c
+++ linux-sched-curr/kernel/sched.c
@@ -1252,7 +1252,13 @@ static int try_to_wake_up(task_t *p, uns
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
+	/*
+	 * If neither this CPU, nor the previous CPU the task was
+	 * running on is idle then skip wakeup-balancing:
+	 */
 	new_cpu = cpu;
+	if (!idle_cpu(this_cpu) && !idle_cpu(cpu))
+		goto out_set_cpu;
 
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [sched, patch] better wake-balancing, #2
  2005-07-29 14:13                   ` [sched, patch] better wake-balancing Ingo Molnar
@ 2005-07-29 15:02                     ` Ingo Molnar
  2005-07-29 16:21                       ` [sched, patch] better wake-balancing, #3 Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2005-07-29 15:02 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, linux-ia64, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> another approach would be the patch below, to do wakeup-balancing only 
> if the wakeup CPU or the task CPU is idle.

there's an even simpler way: only do wakeup-balancing if this_cpu is 
idle. (tbench results are still OK, and other workloads improved.)

	Ingo

--------
do wakeup-balancing only if the wakeup-CPU is idle.

this prevents excessive wakeup-balancing while the system is highly
loaded, but helps spread out the workload on partly idle systems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/sched.c |    6 ++++++
 1 files changed, 6 insertions(+)

Index: linux-sched-curr/kernel/sched.c
===================================================================
--- linux-sched-curr.orig/kernel/sched.c
+++ linux-sched-curr/kernel/sched.c
@@ -1253,7 +1253,13 @@ static int try_to_wake_up(task_t *p, uns
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
+	/*
+	 * Only do wakeup-balancing (== potentially migrate the task)
+	 * if this CPU is idle:
+	 */
 	new_cpu = cpu;
+	if (!idle_cpu(this_cpu))
+		goto out_set_cpu;
 
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [sched, patch] better wake-balancing, #3
  2005-07-29 15:02                     ` [sched, patch] better wake-balancing, #2 Ingo Molnar
@ 2005-07-29 16:21                       ` Ingo Molnar
  2005-07-30  0:08                         ` Nick Piggin
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2005-07-29 16:21 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, linux-ia64, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> there's an even simpler way: only do wakeup-balancing if this_cpu is 
> idle. (tbench results are still OK, and other workloads improved.)

here's an updated patch. It handles one more detail: on SCHED_SMT we 
should check the idleness of siblings too. Benchmark numbers still look 
good.

	Ingo

----
do wakeup-balancing only if the wakeup-CPU (or any of its siblings)
is idle.

this prevents excessive wakeup-balancing while the system is highly
loaded, but helps spread out the workload on partly idle systems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/sched.c |    6 ++++++
 1 files changed, 6 insertions(+)

Index: linux-sched-curr/kernel/sched.c
===================================================================
--- linux-sched-curr.orig/kernel/sched.c
+++ linux-sched-curr/kernel/sched.c
@@ -1253,7 +1253,13 @@ static int try_to_wake_up(task_t *p, uns
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
+	/*
+	 * Only do wakeup-balancing (== potentially migrate the task)
+	 * if this CPU (or any SMT sibling) is idle:
+	 */
 	new_cpu = cpu;
+	if (!idle_cpu(this_cpu) && this_cpu == wake_idle(this_cpu, p))
+		goto out_set_cpu;
 
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29  8:48                 ` Nick Piggin
  2005-07-29  8:53                   ` Ingo Molnar
  2005-07-29  9:07                   ` Ingo Molnar
@ 2005-07-29 16:40                   ` Ingo Molnar
  2 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2005-07-29 16:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Chen, Kenneth W wrote:
> >Nick Piggin wrote on Thursday, July 28, 2005 7:01 PM
> 
> >This clearly outlines an issue with the implementation.  Optimize for one
> >type of workload has detrimental effect on another workload and vice versa.
> >
> 
> Yep. That comes up fairly regularly when tuning the scheduler :(

in this particular case we can clearly separate the two workloads 
though: CPU-overload (Ken's benchmark) vs. half-load (3-task tbench). So 
by checking for migration target/source idleness we can have a hard 
separator for wakeup balancing. (whether it works out for both types of 
workloads remains to be seen)

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags
  2005-07-29 11:26 ` Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Ingo Molnar
@ 2005-07-29 17:30   ` Chen, Kenneth W
  0 siblings, 0 replies; 30+ messages in thread
From: Chen, Kenneth W @ 2005-07-29 17:30 UTC (permalink / raw)
  To: 'Ingo Molnar'; +Cc: 'Nick Piggin', linux-kernel, linux-ia64

Ingo Molnar wrote on Friday, July 29, 2005 4:26 AM
> * Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> > To demonstrate the problem, we turned off these two flags in the cpu 
> > sd domain and measured a stunning 2.15% performance gain!  And 
> > deleting all the code in the try_to_wake_up() pertain to load 
> > balancing gives us another 0.2% gain.
> 
> another thing: do you have a HT-capable ia64 CPU, and do you have 
> CONFIG_SCHED_SMT turned on? If yes then could you try to turn off 
> SD_WAKE_IDLE too, i found it to bring further performance improvements 
> in certain workloads.

The scheduler experiments done so far are on non-SMT CPU (Madison processor).
We have another db setup with multi-thread capable ia64 cpu (montecito, and to
be precise, it is SOEMT capable).  We are just about to do scheduler experiments
on that setup.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [sched, patch] better wake-balancing, #3
  2005-07-29 16:21                       ` [sched, patch] better wake-balancing, #3 Ingo Molnar
@ 2005-07-30  0:08                         ` Nick Piggin
  2005-07-30  7:19                           ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2005-07-30  0:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chen, Kenneth W, linux-kernel, linux-ia64, Andrew Morton

Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> 
>>there's an even simpler way: only do wakeup-balancing if this_cpu is 
>>idle. (tbench results are still OK, and other workloads improved.)
> 
> 
> here's an updated patch. It handles one more detail: on SCHED_SMT we 
> should check the idleness of siblings too. Benchmark numbers still look 
> good.
> 

Maybe. Ken hasn't measured the effect of wake balancing in
2.6.13, which is quite a lot different to that found in 2.6.12.

I don't really like having a hard cutoff like that -wake
balancing can be important for IO workloads, though I haven't
measured for a long time. In IPC workloads, the cache affinity
of local wakeups becomes less apparent when the runqueue gets
lots of tasks on it, however benefits of IO affinity will
generally remain. Especially on NUMA systems.

fork/clone/exec/etc balancing really doesn't do anything to
capture this kind of relationship between tasks and between
tasks and IRQ sources. Without wake balancing we basically have
a completely random scattering of tasks.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [sched, patch] better wake-balancing, #3
  2005-07-30  0:08                         ` Nick Piggin
@ 2005-07-30  7:19                           ` Ingo Molnar
  2005-07-31  1:15                             ` Nick Piggin
  2005-08-08 23:18                             ` Chen, Kenneth W
  0 siblings, 2 replies; 30+ messages in thread
From: Ingo Molnar @ 2005-07-30  7:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chen, Kenneth W, linux-kernel, linux-ia64, Andrew Morton,
	John Hawkes, Martin J. Bligh, Paul Jackson


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > here's an updated patch. It handles one more detail: on SCHED_SMT we 
> > should check the idleness of siblings too. Benchmark numbers still 
> > look good.
> 
> Maybe. Ken hasn't measured the effect of wake balancing in 2.6.13, 
> which is quite a lot different to that found in 2.6.12.
> 
> I don't really like having a hard cutoff like that -wake balancing can 
> be important for IO workloads, though I haven't measured for a long 
> time. [...]

well, i have measured it, and it was a win for just about everything 
that is not idle, and even for an IPC (SysV semaphores) half-idle 
workload i've measured a 3% gain. No performance loss in tbench either, 
which is clearly the most sensitive to affine/passive balancing. But i'd 
like to see what Ken's (and others') numbers are.

the hard cutoff also has the benefit that it allows us to potentially 
make wakeup migration _more_ agressive in the future. So instead of 
having to think about weakening it due to the tradeoffs present in e.g.  
Ken's workload, we can actually make it stronger.

> [...] In IPC workloads, the cache affinity of local wakeups becomes 
> less apparent when the runqueue gets lots of tasks on it, however 
> benefits of IO affinity will generally remain. Especially on NUMA 
> systems.

especially on NUMA, if the migration-target CPU (this_cpu) is not at 
least partially idle, i'd be quite uneasy to passive balance from 
another node. I suspect this needs numbers from Martin and John?

> fork/clone/exec/etc balancing really doesn't do anything to capture 
> this kind of relationship between tasks and between tasks and IRQ 
> sources. Without wake balancing we basically have a completely random 
> scattering of tasks.

Ken's workload is a heavy IO one with lots of IRQ sources. And precisely 
for such type of workloads usually the best tactic is to leave the task 
alone and queue it wherever it last ran.

whenever there's a strong (and exclusive) relationship between tasks and 
individual interrupt sources, explicit binding to CPUs/groups of CPUs is 
the best method. In any case, more measurements are needed.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [sched, patch] better wake-balancing, #2
@ 2005-07-30 23:26 Chuck Ebbert
  2005-07-31  4:35 ` Con Kolivas
  2005-07-31  6:29 ` Ingo Molnar
  0 siblings, 2 replies; 30+ messages in thread
From: Chuck Ebbert @ 2005-07-30 23:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chen, Kenneth W, Andrew Morton, Nick Piggin, linux-kernel,
	linux-ia64

On Fri, 29 Jul 2005 at 17:02:07 +0200, Ingo Molnar wrote:

> do wakeup-balancing only if the wakeup-CPU is idle.
>
> this prevents excessive wakeup-balancing while the system is highly
> loaded, but helps spread out the workload on partly idle systems.

I tested this with Volanomark on dual-processor PII Xeon -- the
results were very bad:

Before: 5863 messages per second

124169 schedule                                  64.1369
 64663 _spin_unlock_irqrestore                  4041.4375
  7949 tcp_clean_rtx_queue                        6.5370
  6787 net_rx_action                             24.9522
 
After: 5569 messages per second

139417 schedule                                  72.0129
 82169 _spin_unlock_irqrestore                  5135.5625
  9949 tcp_clean_rtx_queue                        8.1817
  7917 net_rx_action                             29.1066

__
Chuck

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [sched, patch] better wake-balancing, #3
  2005-07-30  7:19                           ` Ingo Molnar
@ 2005-07-31  1:15                             ` Nick Piggin
  2005-08-01 17:13                               ` Siddha, Suresh B
  2005-08-08 23:18                             ` Chen, Kenneth W
  1 sibling, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2005-07-31  1:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chen, Kenneth W, linux-kernel, linux-ia64, Andrew Morton,
	John Hawkes, Martin J. Bligh, Paul Jackson

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>I don't really like having a hard cutoff like that -wake balancing can 
>>be important for IO workloads, though I haven't measured for a long 
>>time. [...]
> 
> 
> well, i have measured it, and it was a win for just about everything 

I meant: measured for IO workloads.

I had one group tell me their IO efficiency went up by several
*times* on a 16-way NUMA system after generalising the wake
balancing to interrupts as well.

> that is not idle, and even for an IPC (SysV semaphores) half-idle 
> workload i've measured a 3% gain. No performance loss in tbench either, 
> which is clearly the most sensitive to affine/passive balancing. But i'd 
> like to see what Ken's (and others') numbers are.
> 
> the hard cutoff also has the benefit that it allows us to potentially 
> make wakeup migration _more_ agressive in the future. So instead of 
> having to think about weakening it due to the tradeoffs present in e.g.  
> Ken's workload, we can actually make it stronger.
> 

That would make the behaviour change even more violent, which is
what I dislike. I would much prefer to have code that handles both
workloads without introducing sudden cutoff points in behaviour.

> 
> especially on NUMA, if the migration-target CPU (this_cpu) is not at 
> least partially idle, i'd be quite uneasy to passive balance from 
> another node. I suspect this needs numbers from Martin and John?
> 

Passive balancing cuts in only when an imbalance is becoming apparent.
If the queue gets more imbalanced, periodic balancing will cut in,
and that is much worse than wake balancing.

> 
>>fork/clone/exec/etc balancing really doesn't do anything to capture 
>>this kind of relationship between tasks and between tasks and IRQ 
>>sources. Without wake balancing we basically have a completely random 
>>scattering of tasks.
> 
> 
> Ken's workload is a heavy IO one with lots of IRQ sources. And precisely 
> for such type of workloads usually the best tactic is to leave the task 
> alone and queue it wherever it last ran.
> 

Yep, I agree the wake balancing code in 2.6.12 wasn't ideal. That's
why I changed it in 2.6.13 - precisely because it moved things around
too much. It probably still isn't ideal though.

> whenever there's a strong (and exclusive) relationship between tasks and 
> individual interrupt sources, explicit binding to CPUs/groups of CPUs is 
> the best method. In any case, more measurements are needed.
> 

Well, I wouldn't say it is always the best method. Especially not when
there is a big variation in the CPU consumption of the groups of tasks.
But anyway, even in the cases where it definitely is the best method,
we really should try to handle them properly without binding too.

I do agree that more measurements are needed :)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [sched, patch] better wake-balancing, #2
  2005-07-30 23:26 [sched, patch] better wake-balancing, #2 Chuck Ebbert
@ 2005-07-31  4:35 ` Con Kolivas
  2005-07-31  6:29 ` Ingo Molnar
  1 sibling, 0 replies; 30+ messages in thread
From: Con Kolivas @ 2005-07-31  4:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chuck Ebbert, Ingo Molnar, Chen, Kenneth W, Andrew Morton,
	Nick Piggin, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 786 bytes --]

On Sun, 31 Jul 2005 09:26, Chuck Ebbert wrote:
> On Fri, 29 Jul 2005 at 17:02:07 +0200, Ingo Molnar wrote:
> > do wakeup-balancing only if the wakeup-CPU is idle.
> >
> > this prevents excessive wakeup-balancing while the system is highly
> > loaded, but helps spread out the workload on partly idle systems.
>
> I tested this with Volanomark on dual-processor PII Xeon -- the
> results were very bad:
>
> Before: 5863 messages per second

> After: 5569 messages per second

Can you check schedstats or otherwise to find if volanomark uses 
sched_yield() ? When last this benchmark came up, it appeared that no jvm 
used futexes and left locking to yielding. We really should find out if that 
is the case before trying to optimise for this benchmark.

Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [sched, patch] better wake-balancing, #2
  2005-07-30 23:26 [sched, patch] better wake-balancing, #2 Chuck Ebbert
  2005-07-31  4:35 ` Con Kolivas
@ 2005-07-31  6:29 ` Ingo Molnar
  1 sibling, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2005-07-31  6:29 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Chen, Kenneth W, Andrew Morton, Nick Piggin, linux-kernel,
	linux-ia64


* Chuck Ebbert <76306.1226@compuserve.com> wrote:

> On Fri, 29 Jul 2005 at 17:02:07 +0200, Ingo Molnar wrote:
> 
> > do wakeup-balancing only if the wakeup-CPU is idle.
> >
> > this prevents excessive wakeup-balancing while the system is highly
> > loaded, but helps spread out the workload on partly idle systems.
> 
> I tested this with Volanomark on dual-processor PII Xeon -- the 
> results were very bad:

which patch have you tested? The mail you replied to above is for patch
#2, while on SMT/HT boxes it's patch #3 that is the correct approach.

furthermore, which base kernel have you applied the patch to? Best would 
be to test the following kernels:

 2.6.13-rc4 + sched-rollup
 2.6.13-rc4 + sched-rollup + better-wake-balance-#3

the sched-rollup and the latest better-wake-balance patches can be found 
at:

  http://redhat.com/~mingo/scheduler-patches/

(sched-rollup is the current scheduler patch-queue in -mm. And if you 
have time, it would also be nice to have a 2.6.13-rc4 baseline for 
VolanoMark, and perhaps a 2.6.12 measurement too, so that we can see how 
things changed.)

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [sched, patch] better wake-balancing, #2
@ 2005-07-31 13:35 Chuck Ebbert
  0 siblings, 0 replies; 30+ messages in thread
From: Chuck Ebbert @ 2005-07-31 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-ia64, linux-kernel, Nick Piggin, Andrew Morton,
	Chen, Kenneth W

On Sun, 31 Jul 2005 at 08:29:27 +0200, Ingo Molnar wrote:

> > I tested this with Volanomark on dual-processor PII Xeon -- the 
> > results were very bad:
>
> which patch have you tested? The mail you replied to above is for patch
> #2, while on SMT/HT boxes it's patch #3 that is the correct approach.

 Since my system is not HT, I used patch #2.

> furthermore, which base kernel have you applied the patch to?

 2.6.13-rc3

 Results for -rc4 follow, with the latest patchsets:

Volanomark results for 2.6.13-rc4
System: Dell Workstation 610 (i440GX)
2 x Pentium II Xeon 2MB cache 350MHz

[Sun Jul 31 06:58:31 EDT 2005] Test started.
Kernel: 2.6.13-rc4a #1 SMP
Patches: sched-rollup + better-wake-balance
test-1.log: Average throughput = 4905 messages per second
test-2.log: Average throughput = 5583 messages per second
test-3.log: Average throughput = 5624 messages per second
test-4.log: Average throughput = 5526 messages per second
test-1.log: Average throughput = 5584 messages per second
test-2.log: Average throughput = 5430 messages per second
test-3.log: Average throughput = 5263 messages per second
test-4.log: Average throughput = 5425 messages per second
[Sun Jul 31 07:17:02 EDT 2005] Test ended.
timestamp 76174
cpu0 0 0 6 6 78 17319 6612 5663 5236 6510 621 10707
domain0 3 144251 144099 138 3156 15 3 0 144099 84 82 2 18 0 0 0 82 6622 6546 66 631 10 1 0 6546 2 2 0 0 0 0 0 0 0 486 449 0
cpu1 0 0 0 0 50 7888 2914 1984 1498 2008 1576 4974
domain0 3 146154 146000 121 2226 33 17 0 146000 62 58 2 24 2 0 0 58 2926 2824 89 953 15 5 0 2824 0 0 0 0 0 0 0 0 0 427 407 0
version 12
timestamp 357112
cpu0 6903 32226 44787 3345829 56425 6652018 20466 14097 11629 162123 21377675 6631552
domain0 3 269777 267819 1353 194144 6129 602 0 267819 4041 2818 70 1305809 164558 75 0 2818 37869 15488 4978 14472687 1280712 1583 0 15488 2 2 0 0 0 0 0 0 0 2198 1290 0
cpu1 7764 33269 44559 3354402 57092 6697864 19910 12189 9991 155123 21297442 6677954
domain0 3 274148 272109 1433 180072 4066 441 0 272109 3981 2775 60 1259541 167938 99 0 2775 37372 14157 5752 14238933 1278568 1334 0 14157 0 0 0 0 0 0 0 0 0 2468 1438 0

[Sun Jul 31 07:33:09 EDT 2005] Test started.
Kernel: 2.6.13-rc4a #2 SMP
Patches: sched-rollup
test-1.log: Average throughput = 5112 messages per second
test-2.log: Average throughput = 5662 messages per second
test-3.log: Average throughput = 5809 messages per second
test-4.log: Average throughput = 5977 messages per second
test-1.log: Average throughput = 5976 messages per second
test-2.log: Average throughput = 6008 messages per second
test-3.log: Average throughput = 5855 messages per second
test-4.log: Average throughput = 6017 messages per second
[Sun Jul 31 07:51:00 EDT 2005] Test ended.
version 12
timestamp 4294911410
cpu0 0 0 0 0 56 6018 1969 3037 1846 4008 5751 4049
domain0 3 14739 14634 99 2000 8 4 0 14634 31 30 1 10 0 0 0 30 1971 1921 48 443 2 0 0 1921 2 1 1 0 0 0 0 0 0 1247 502 0
cpu1 0 0 0 0 40 5357 1792 2832 1583 1176 1568 3565
domain0 3 14867 14788 76 1520 3 1 0 14788 35 34 1 5 0 0 0 34 1797 1749 42 411 7 3 0 1749 0 0 0 0 0 0 0 0 0 1191 469 0
version 12
timestamp 212533
cpu0 10030 29290 30736 3372251 41704 6164156 19216 2636963 2026635 148591 23876540 6144940
domain0 3 138859 136778 1343 139015 3507 558 0 136778 3404 2644 49 704633 103491 32 0 2644 28623 15546 3670 4623415 467816 1395 0 15546 2 1 1 0 0 0 0 0 0 595792 264363 0
cpu1 4739 24137 31111 3387783 36850 6143087 12792 2610468 2014674 145416 24188585 6130295
domain0 3 139219 137155 1287 133527 3930 457 0 137155 3366 2714 46 569294 85214 46 0 2714 22259 8839 3952 4783355 487041 1084 0 8839 0 0 0 0 0 0 0 0 0 610328 262829 0

[Sun Jul 31 08:39:05 EDT 2005] Test started.
Kernel: 2.6.13-rc4a #3 SMP
Patches: none
test-1.log: Average throughput = 5243 messages per second
test-2.log: Average throughput = 5816 messages per second
test-3.log: Average throughput = 5886 messages per second
test-4.log: Average throughput = 6039 messages per second
test-1.log: Average throughput = 5911 messages per second
test-2.log: Average throughput = 5934 messages per second
test-3.log: Average throughput = 5928 messages per second
test-4.log: Average throughput = 6053 messages per second
[Sun Jul 31 08:56:52 EDT 2005] Test ended.
version 12
timestamp 4294911037
cpu0 0 0 0 0 44 5715 1877 2877 1656 1196 1427 3838
domain0 3 14886 14817 57 70 13 0 0 14817 29 29 0 0 0 0 0 29 1878 1866 11 12 1 0 0 1866 0 0 0 0 0 0 0 0 0 1168 551 0
cpu1 0 0 0 0 55 4498 1522 2269 1099 550 131 2976
domain0 3 15108 15066 37 42 5 0 0 15066 16 16 0 0 0 0 0 16 1523 1513 8 10 2 0 0 1513 2 2 0 0 0 0 0 0 0 1221 532 0
version 12
timestamp 211196
cpu0 1784 20283 27831 3378283 31894 6122681 18841 2586384 2080058 145291 24152181 6103840
domain0 3 138711 136342 402 22486 21019 25 0 136342 3689 3115 17 16077 15996 0 0 3115 21229 18330 511 21079 19823 19 0 18330 0 0 0 0 0 0 0 0 0 502182 219444 0
cpu1 10295 29015 28139 3391580 40743 6142466 18965 2589921 2087737 143278 24294054 6123501
domain0 3 140333 137972 378 22362 20974 11 0 137972 3623 3075 20 15786 15695 1 0 3075 21435 18517 447 19642 18459 48 0 18517 2 2 0 0 0 0 0 0 0 506326 221459 0

__
Chuck

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [sched, patch] better wake-balancing, #3
  2005-07-31  1:15                             ` Nick Piggin
@ 2005-08-01 17:13                               ` Siddha, Suresh B
  0 siblings, 0 replies; 30+ messages in thread
From: Siddha, Suresh B @ 2005-08-01 17:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Chen, Kenneth W, linux-kernel, linux-ia64,
	Andrew Morton, John Hawkes, Martin J. Bligh, Paul Jackson

On Sun, Jul 31, 2005 at 11:15:16AM +1000, Nick Piggin wrote:
> Ingo Molnar wrote:
> > especially on NUMA, if the migration-target CPU (this_cpu) is not at 
> > least partially idle, i'd be quite uneasy to passive balance from 
> > another node. I suspect this needs numbers from Martin and John?
> 
> Passive balancing cuts in only when an imbalance is becoming apparent.
> If the queue gets more imbalanced, periodic balancing will cut in,
> and that is much worse than wake balancing.

Another point to note about the current wake balance. Imbalance calculation
is not taking the complete load of the sched group into account. I think
there might be scenario's where the current wake balance will actually
result in some imbalances corrected later by periodic balancing.

thanks,
suresh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [sched, patch] better wake-balancing, #3
  2005-07-30  7:19                           ` Ingo Molnar
  2005-07-31  1:15                             ` Nick Piggin
@ 2005-08-08 23:18                             ` Chen, Kenneth W
  1 sibling, 0 replies; 30+ messages in thread
From: Chen, Kenneth W @ 2005-08-08 23:18 UTC (permalink / raw)
  To: 'Ingo Molnar', Nick Piggin
  Cc: linux-kernel, linux-ia64, Andrew Morton, John Hawkes,
	Martin J. Bligh, Paul Jackson

Ingo Molnar wrote on Saturday, July 30, 2005 12:19 AM
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > > here's an updated patch. It handles one more detail: on SCHED_SMT we 
> > > should check the idleness of siblings too. Benchmark numbers still 
> > > look good.
> > 
> > Maybe. Ken hasn't measured the effect of wake balancing in 2.6.13, 
> > which is quite a lot different to that found in 2.6.12.
> > 
> > I don't really like having a hard cutoff like that -wake balancing can 
> > be important for IO workloads, though I haven't measured for a long 
> > time. [...]
> 
> well, i have measured it, and it was a win for just about everything 
> that is not idle, and even for an IPC (SysV semaphores) half-idle 
> workload i've measured a 3% gain. No performance loss in tbench either, 
> which is clearly the most sensitive to affine/passive balancing. But i'd 
> like to see what Ken's (and others') numbers are.
> 
> the hard cutoff also has the benefit that it allows us to potentially 
> make wakeup migration _more_ agressive in the future. So instead of 
> having to think about weakening it due to the tradeoffs present in e.g.  
> Ken's workload, we can actually make it stronger.


Sorry it took us a while to get the experiment done on our large db setup.
This patch has the same effectiveness compare to turning off both
SD_WAKE_BALANCE and SD_WAKE_AFFINE, (+2.2% on db OLTP workload).  We like
it a lot.

- Ken


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2005-08-08 23:19 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-28 23:08 Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Chen, Kenneth W
2005-07-28 23:34 ` Nick Piggin
2005-07-28 23:48   ` Chen, Kenneth W
2005-07-29  1:25     ` Nick Piggin
2005-07-29  1:39       ` Chen, Kenneth W
2005-07-29  1:46         ` Nick Piggin
2005-07-29  1:53           ` Chen, Kenneth W
2005-07-29  2:01             ` Nick Piggin
2005-07-29  6:27               ` Chen, Kenneth W
2005-07-29  8:48                 ` Nick Piggin
2005-07-29  8:53                   ` Ingo Molnar
2005-07-29  8:59                     ` Nick Piggin
2005-07-29  9:01                       ` Ingo Molnar
2005-07-29  9:07                   ` Ingo Molnar
2005-07-29 16:40                   ` Ingo Molnar
2005-07-29 11:48                 ` [patch] remove wake-balancing Ingo Molnar
2005-07-29 14:13                   ` [sched, patch] better wake-balancing Ingo Molnar
2005-07-29 15:02                     ` [sched, patch] better wake-balancing, #2 Ingo Molnar
2005-07-29 16:21                       ` [sched, patch] better wake-balancing, #3 Ingo Molnar
2005-07-30  0:08                         ` Nick Piggin
2005-07-30  7:19                           ` Ingo Molnar
2005-07-31  1:15                             ` Nick Piggin
2005-08-01 17:13                               ` Siddha, Suresh B
2005-08-08 23:18                             ` Chen, Kenneth W
2005-07-29 11:26 ` Delete scheduler SD_WAKE_AFFINE and SD_WAKE_BALANCE flags Ingo Molnar
2005-07-29 17:30   ` Chen, Kenneth W
  -- strict thread matches above, loose matches on Subject: below --
2005-07-30 23:26 [sched, patch] better wake-balancing, #2 Chuck Ebbert
2005-07-31  4:35 ` Con Kolivas
2005-07-31  6:29 ` Ingo Molnar
2005-07-31 13:35 Chuck Ebbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox