better wake-balancing: respin

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* better wake-balancing: respin
@ 2005-10-27  1:24 Chen, Kenneth W
  2005-10-28  7:40 ` Ingo Molnar
  2005-10-28  9:17 ` Nick Piggin
  0 siblings, 2 replies; 8+ messages in thread
From: Chen, Kenneth W @ 2005-10-27  1:24 UTC (permalink / raw)
  To: 'Ingo Molnar', Nick Piggin, Andrew Morton; +Cc: linux-kernel

Once upon a time, this patch was in -mm tree (2.6.13-mm1):
http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2

It is neither in Linus's official tree, nor it is in -mm anymore.

I guess I missed the objection for dropping the patch.  I'm bringing
up this discussion again.  The wake-up path is a lot hotter on numa
system running database benchmark.  Even on a moderate 8P numa box,
__wake_up and try_to_wake_up is showing up as #1 and #4 hottest kernel
functions.  While on a comparable 4P smp box, these two functions are
#5 and #9 respectively.

I think situation will be worse on 32P numa box in the wake up path.
I don't have any measurement on 32P setup yet, because 8P numa
performance sucks at the moment and it is a blocker for us before
proceed any bigger setup.

Execution profile for 8P numa box [1]:

   Symbol              Clockticks Inst. Retired   L3 Misses
#1 __wake_up           8.08%      1.88%           4.67%
#2 finish_task_switch  7.53%      18.11%          5.82%
#3 __make_request      6.87%      2.09%           4.35%
#4 try_to_wake_up      5.57%      0.64%           3.10%

Execution profile for 4P SMP box [2]:

   Symbol              Clockticks
#5 __wake_up           3.57%
#9 try_to_wake_up      2.38%

My question is: what was the reason this patch is dropped and what
can we do to improve wake-up performance?  In my opinion, we should
simply put the task on the CPU it was previously ran and have
rebalance_tick and load_balance_newidle to balance out the load.

- Ken

[1] 8 processor: 1.6 GHz Itanium2 processor, 9M L3. 256 GB memory
[2] 4 processor: 1.6 GHz Itanium2 processor, 9M L3. 128 GB memory

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: better wake-balancing: respin
  2005-10-27  1:24 better wake-balancing: respin Chen, Kenneth W
@ 2005-10-28  7:40 ` Ingo Molnar
  2005-10-28 22:34   ` Chen, Kenneth W
  2005-10-28  9:17 ` Nick Piggin
  1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2005-10-28  7:40 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: Nick Piggin, Andrew Morton, linux-kernel


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Once upon a time, this patch was in -mm tree (2.6.13-mm1):
> http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
> 
> It is neither in Linus's official tree, nor it is in -mm anymore.

it's sched-better-wake-balancing-3.patch, in 2.6.14-rc5-mm1.

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: better wake-balancing: respin
  2005-10-28  7:40 ` Ingo Molnar
@ 2005-10-28 22:34   ` Chen, Kenneth W
  0 siblings, 0 replies; 8+ messages in thread
From: Chen, Kenneth W @ 2005-10-28 22:34 UTC (permalink / raw)
  To: 'Ingo Molnar'; +Cc: Nick Piggin, Andrew Morton, linux-kernel

Ingo Molnar wrote on Friday, October 28, 2005 12:40 AM
> * Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> > Once upon a time, this patch was in -mm tree (2.6.13-mm1):
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
> > 
> > It is neither in Linus's official tree, nor it is in -mm anymore.
> 
> it's sched-better-wake-balancing-3.patch, in 2.6.14-rc5-mm1.


Sorry for the noise.  I went through 2.6.14-rc5-mm1 a couple of time.
I couldn't believe myself how I could missed it. (I must be sleeping
or don't know what I was doing at that time).

- Ken


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: better wake-balancing: respin
  2005-10-27  1:24 better wake-balancing: respin Chen, Kenneth W
  2005-10-28  7:40 ` Ingo Molnar
@ 2005-10-28  9:17 ` Nick Piggin
  2005-10-28 10:08   ` Ingo Molnar
  1 sibling, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2005-10-28  9:17 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Ingo Molnar', Andrew Morton, linux-kernel

Chen, Kenneth W wrote:
> Once upon a time, this patch was in -mm tree (2.6.13-mm1):
> http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
> 
> It is neither in Linus's official tree, nor it is in -mm anymore.
> 
> I guess I missed the objection for dropping the patch.  I'm bringing

My objection for the patch is that it seems to be designed just to
improve your TPC - and I don't think we've seen results yet... or
did I miss that?

Also - by no means do I think improving TPC is wrong, but I think
such a patch may not be the right way to go. It doesn't seem to solve
your problem well.

Now you may have one of two problems. Well it definitely looks like
you are taking a lot of cache misses in try_to_wake_up - however this
won't be due to the load balancing stuff, but rather from locking the
remote CPUs runqueue and touching its runqueues, and cachelines in
the task_struct that had been last touched by the remote CPU.

In fact, if the balancing stuff in try_to_wake_up is working as it
should, then it will result in fewer "remote wakups" because tasks
will be moved to the same CPU that wakes them. Schedstats can tell
us a lot about this, BTW.

The second problem you may have is that the balancing stuff is going
haywire and actually causing tasks to move around too much. If this
is the case, then I really need to look at your workload (at least
schedstats output) and try to get things working a bit better. Knocking
half its brains out with a hammer is just going to make it perform
poorly in more cases without fixing your underlying problem.

Well - you may have a 3rd problem: that schedule and wake_up are simply
being called too often. What's going on with your workload? How many
context switches? What's the schedule profile look like? (we should get
a wake up profile too).

Basically I'd like to see a lot more information.

> up this discussion again.  The wake-up path is a lot hotter on numa
> system running database benchmark.  Even on a moderate 8P numa box,
> __wake_up and try_to_wake_up is showing up as #1 and #4 hottest kernel
> functions.  While on a comparable 4P smp box, these two functions are
> #5 and #9 respectively.
> 

With all else being equal, an 8P box is going to have 133% more remote
wakeups than a 4P box, and each of those cacheline transfers is going
to have a higher latency. The difference in numbers when moving from
4 to 8 way isn't very surprising.

> I think situation will be worse on 32P numa box in the wake up path.
> I don't have any measurement on 32P setup yet, because 8P numa
> performance sucks at the moment and it is a blocker for us before
> proceed any bigger setup.
> 

That's a pity. What do we suck in comparison to? What's our ratio
between actual and expected performance?

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: better wake-balancing: respin
  2005-10-28  9:17 ` Nick Piggin
@ 2005-10-28 10:08   ` Ingo Molnar
  2005-10-28 11:03     ` Nick Piggin
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2005-10-28 10:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, Andrew Morton, linux-kernel

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Chen, Kenneth W wrote:
> >Once upon a time, this patch was in -mm tree (2.6.13-mm1):
> >http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
> >
> >It is neither in Linus's official tree, nor it is in -mm anymore.
> >
> >I guess I missed the objection for dropping the patch.  I'm bringing
> 
> My objection for the patch is that it seems to be designed just to 
> improve your TPC - and I don't think we've seen results yet... or did 
> I miss that?
> 
> Also - by no means do I think improving TPC is wrong, but I think such 
> a patch may not be the right way to go. It doesn't seem to solve your 
> problem well.

Nick, the TPC workload is simple and has been described before: lots of 
interrupts arriving on many CPUs, and waking up tasks randomly, which do 
short amount of work and then go back to sleep again. There is no 
correlation between the CPU the interrupt arrives on and the CPU the 
task gets woken up on. There is no point in immediate balancing either: 
the IRQs are well-balanced themselves so there are no load transients to 
take care of (except for idle CPUs, which my patch handles), and the 
next wakeup for that task wont arrive on the same CPU anyway.

in such a workload, my patch will clearly improve things, by not 
bouncing tasks around wildly.

> Now you may have one of two problems. Well it definitely looks like 
> you are taking a lot of cache misses in try_to_wake_up - however this 
> won't be due to the load balancing stuff, but rather from locking the 
> remote CPUs runqueue and touching its runqueues, and cachelines in the 
> task_struct that had been last touched by the remote CPU.

no, because you are not considering a fundamentally random workload like 
TPC. There is only a 1:8 chance to hit the right CPU with the interrupt, 
and there is no benefit from moving the task to the CPU it got woken up 
from. In fact, it hurts by doing pointless migrations.

my patch adds the rule that we only consider 'fast' migration when 
provably beneficial: if the target CPU is idle. Any other case will have 
to go over the 'slow' migration paths.

> In fact, if the balancing stuff in try_to_wake_up is working as it 
> should, then it will result in fewer "remote wakups" because tasks 
> will be moved to the same CPU that wakes them. Schedstats can tell us 
> a lot about this, BTW.

wrong. Even if the balancing stuff in try_to_wake_up is working as it 
should, it can easily happen that moving a task is not worthwhile: if 
there is little or no further relationship between the wakeup CPU and 
the IRQ CPU, i.e. when the migration cost is larger than the 
relationship-win between the wakeup CPU and the IRQ CPU.

so for me the decision logic is simple: the balancing code logic is 
migrating over-eagerly, and this simple and straightforward patch makes 
it less eager for an important workload class. You are welcome to 
suggest other approaches, but simply saying "I dont like this" wont 
bring us further, as the damage on TPC workloads is clearly 
demonstrated. If this patch hurts other workloads (and please 
demonstrate them instead of calling my patch a hammer - the patch has 
been in -mm for many months already) then simply provide the logic that 
will do the balancing for those workloads only, without hurting this 
workload!

When we have to pick between two workloads (only one of which is 
identified at the moment!) that have have to balance out against each 
other then we will go towards the simpler solution (all other factors 
being equal). I.e. in this case by not doing the balancing. Migration is 
a fundamentally intrusive act and should be done carefully. If you can 
pull it off without hurting other workloads then fine, but otherwise it 
needs refinement. This rule is not a hard limit: we obviously will do 
changes that hurt some rare workloads only a bit while helping other, 
more common workloads enormously. This has not been demonstrated for 
this case yet. There is also a simplicity factor: not doing a complex 
balancing decision is obviously simpler, so we have a bias towards it.  
I.e. do something complex and costly only if we can prove it most likely 
OK.

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: better wake-balancing: respin
  2005-10-28 10:08   ` Ingo Molnar
@ 2005-10-28 11:03     ` Nick Piggin
  2005-10-28 14:37       ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2005-10-28 11:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chen, Kenneth W, Andrew Morton, linux-kernel

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Chen, Kenneth W wrote:
>>
>>>Once upon a time, this patch was in -mm tree (2.6.13-mm1):
>>>http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
>>>
>>>It is neither in Linus's official tree, nor it is in -mm anymore.
>>>
>>>I guess I missed the objection for dropping the patch.  I'm bringing
>>
>>My objection for the patch is that it seems to be designed just to 
>>improve your TPC - and I don't think we've seen results yet... or did 
>>I miss that?
>>
>>Also - by no means do I think improving TPC is wrong, but I think such 
>>a patch may not be the right way to go. It doesn't seem to solve your 
>>problem well.
> 
> 
> Nick, the TPC workload is simple and has been described before: lots of 
> interrupts arriving on many CPUs, and waking up tasks randomly, which do 
> short amount of work and then go back to sleep again. There is no 
> correlation between the CPU the interrupt arrives on and the CPU the 
> task gets woken up on. There is no point in immediate balancing either: 
> the IRQs are well-balanced themselves so there are no load transients to 
> take care of (except for idle CPUs, which my patch handles), and the 
> next wakeup for that task wont arrive on the same CPU anyway.
> 
> in such a workload, my patch will clearly improve things, by not 
> bouncing tasks around wildly.
> 

Ingo, I wasn't aware that tasks are bouncing around wildly; does
your patch improve things? Then by definition it must penalise
workloads where the pairings are more predictable?

I would prefer to try fixing wake balancing before giving up and
turning it off for busy CPUs.

> 
>>Now you may have one of two problems. Well it definitely looks like 
>>you are taking a lot of cache misses in try_to_wake_up - however this 
>>won't be due to the load balancing stuff, but rather from locking the 
>>remote CPUs runqueue and touching its runqueues, and cachelines in the 
>>task_struct that had been last touched by the remote CPU.
> 
> 
> no, because you are not considering a fundamentally random workload like 
> TPC. There is only a 1:8 chance to hit the right CPU with the interrupt, 
> and there is no benefit from moving the task to the CPU it got woken up 
> from. In fact, it hurts by doing pointless migrations.
> 

It doesn't always migrate though. That's the point of all the heuristics.

> my patch adds the rule that we only consider 'fast' migration when 
> provably beneficial: if the target CPU is idle. Any other case will have 
> to go over the 'slow' migration paths.
> 

wrong. There is no way you can "prove" that a migration is beneficial!

> 
>>In fact, if the balancing stuff in try_to_wake_up is working as it 
>>should, then it will result in fewer "remote wakups" because tasks 
>>will be moved to the same CPU that wakes them. Schedstats can tell us 
>>a lot about this, BTW.
> 
> 
> wrong. Even if the balancing stuff in try_to_wake_up is working as it 
> should, it can easily happen that moving a task is not worthwhile: if 
> there is little or no further relationship between the wakeup CPU and 
> the IRQ CPU, i.e. when the migration cost is larger than the 
> relationship-win between the wakeup CPU and the IRQ CPU.
> 
> so for me the decision logic is simple: the balancing code logic is 
> migrating over-eagerly, and this simple and straightforward patch makes 
> it less eager for an important workload class. You are welcome to 
> suggest other approaches, but simply saying "I dont like this" wont 
> bring us further, as the damage on TPC workloads is clearly 
> demonstrated. If this patch hurts other workloads (and please 

Ken mentioned it was worth 2%. Not a bad improvement, but if our
performance "sucks" then it sounds like we need to look elsewhere.

> demonstrate them instead of calling my patch a hammer - the patch has 
> been in -mm for many months already) then simply provide the logic that 
> will do the balancing for those workloads only, without hurting this 
> workload!
> 

No doubt that if it is doing pointless migrations that your patch
prevents, then that will improve performance here. However I'd rather
try to fix the actual balancing code.

Without any form of wake balancing, then a multiprocessor system will
tend to have a completely random distribution of tasks over CPUs over
time. I prefer to add a driver so it is not completely random for
amenable workloads.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: better wake-balancing: respin
  2005-10-28 11:03     ` Nick Piggin
@ 2005-10-28 14:37       ` Ingo Molnar
  2005-10-28 15:04         ` Nick Piggin
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2005-10-28 14:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, Andrew Morton, linux-kernel

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Ingo, I wasn't aware that tasks are bouncing around wildly; does your 
> patch improve things? Then by definition it must penalise workloads 
> where the pairings are more predictable?

for TPC, most of the non-to-idle migrations are 'wrong'. So basically 
any change that gets rid of extra migrations is a win. This does not 
mean that it is all bouncing madly.

> I would prefer to try fixing wake balancing before giving up and 
> turning it off for busy CPUs.

agreed, and that was my suggestion: improve the heuristics to not hurt 
workloads where there is no natural pairing.

one possible way would be to do a task_hot() check in the passive 
balancing code, and only migrate the task when it's been inactive for a 
long time: that should be the case for most TPC wakeups. (This assumes 
an accurate cache-hot estimator, for which another patch exists.)

> Without any form of wake balancing, then a multiprocessor system will 
> tend to have a completely random distribution of tasks over CPUs over 
> time. I prefer to add a driver so it is not completely random for 
> amenable workloads.

but my patch does not do 'no form of wake balancing'. It will do 
non-load-related wake balancing if the target CPU is idle. Arguably, 
that can easily be 'never' under common workloads.

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: better wake-balancing: respin
  2005-10-28 14:37       ` Ingo Molnar
@ 2005-10-28 15:04         ` Nick Piggin
  0 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2005-10-28 15:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chen, Kenneth W, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2416 bytes --]

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Ingo, I wasn't aware that tasks are bouncing around wildly; does your 
>>patch improve things? Then by definition it must penalise workloads 
>>where the pairings are more predictable?
> 
> 
> for TPC, most of the non-to-idle migrations are 'wrong'. So basically

Sure, OK.

> any change that gets rid of extra migrations is a win. This does not 
> mean that it is all bouncing madly.
> 

Yes... semantics aside it is obviously something we want to eliminate
if possible.

> 
>>I would prefer to try fixing wake balancing before giving up and 
>>turning it off for busy CPUs.
> 
> 
> agreed, and that was my suggestion: improve the heuristics to not hurt 
> workloads where there is no natural pairing.
> 
> one possible way would be to do a task_hot() check in the passive 
> balancing code, and only migrate the task when it's been inactive for a 
> long time: that should be the case for most TPC wakeups. (This assumes 
> an accurate cache-hot estimator, for which another patch exists.)
> 

That sounds like a reasonable avenue to investigate. There was
something similar there before, but I found that there was a sharp
point where runqueue lengths got just the right length for performance
characteristics of some things to change.

This could have been made worse by inadequate cache-hot estimate though.

One thing I will look into is to check whether we can use previous
wakeup patterns to check whether the pattern looks random, or affine to
some group of CPUs. This could potentially allow us to have stronger
affinity than we currently do for the "good" workloads, while eliminating
task movement completely for bad cases.

Something along the lines of the attached patch may be 'good enough',
though I'd like to try a few other things as well.

> 
>>Without any form of wake balancing, then a multiprocessor system will 
>>tend to have a completely random distribution of tasks over CPUs over 
>>time. I prefer to add a driver so it is not completely random for 
>>amenable workloads.
> 
> 
> but my patch does not do 'no form of wake balancing'. It will do 
> non-load-related wake balancing if the target CPU is idle. Arguably, 
> that can easily be 'never' under common workloads.
> 

No that's true - I didn't mean to imply that your patch would
completely eliminate this input.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: sched-less-affine.patch --]
[-- Type: text/plain, Size: 1723 bytes --]

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2005-10-29 00:52:31.000000000 +1000
+++ linux-2.6/include/linux/sched.h	2005-10-29 00:58:51.000000000 +1000
@@ -648,9 +648,12 @@ struct task_struct {
 
 	int lock_depth;		/* BKL lock depth */
 
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
+#if defined(CONFIG_SMP)
+	int last_waker_cpu;	/* CPU that last woke this task up */
+#if defined(__ARCH_WANT_UNLOCKED_CTXSW)
 	int oncpu;
 #endif
+#endif
 	int prio, static_prio;
 	struct list_head run_list;
 	prio_array_t *array;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-10-29 00:52:34.000000000 +1000
+++ linux-2.6/kernel/sched.c	2005-10-29 01:00:53.000000000 +1000
@@ -1183,6 +1183,10 @@ static int try_to_wake_up(task_t *p, uns
 		}
 	}
 
+	if (p->last_waker_cpu != this_cpu)
+		goto out_set_cpu;
+
+
 	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
 		goto out_set_cpu;
 
@@ -1253,6 +1257,8 @@ out_set_cpu:
 		cpu = task_cpu(p);
 	}
 
+	p->last_waker_cpu = this_cpu;
+	
 out_activate:
 #endif /* CONFIG_SMP */
 	if (old_state == TASK_UNINTERRUPTIBLE) {
@@ -1334,9 +1340,12 @@ void fastcall sched_fork(task_t *p, int 
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->sched_info, 0, sizeof(p->sched_info));
 #endif
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
+#if defined(CONFIG_SMP)
+	p->last_waker_cpu = cpu;
+#if defined(__ARCH_WANT_UNLOCKED_CTXSW)
 	p->oncpu = 0;
 #endif
+#endif
 #ifdef CONFIG_PREEMPT
 	/* Want to start with kernel preemption disabled. */
 	p->thread_info->preempt_count = 1;

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-10-28 22:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-27  1:24 better wake-balancing: respin Chen, Kenneth W
2005-10-28  7:40 ` Ingo Molnar
2005-10-28 22:34   ` Chen, Kenneth W
2005-10-28  9:17 ` Nick Piggin
2005-10-28 10:08   ` Ingo Molnar
2005-10-28 11:03     ` Nick Piggin
2005-10-28 14:37       ` Ingo Molnar
2005-10-28 15:04         ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox