* better wake-balancing: respin @ 2005-10-27 1:24 Chen, Kenneth W 2005-10-28 7:40 ` Ingo Molnar 2005-10-28 9:17 ` Nick Piggin 0 siblings, 2 replies; 8+ messages in thread From: Chen, Kenneth W @ 2005-10-27 1:24 UTC (permalink / raw) To: 'Ingo Molnar', Nick Piggin, Andrew Morton; +Cc: linux-kernel Once upon a time, this patch was in -mm tree (2.6.13-mm1): http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2 It is neither in Linus's official tree, nor it is in -mm anymore. I guess I missed the objection for dropping the patch. I'm bringing up this discussion again. The wake-up path is a lot hotter on numa system running database benchmark. Even on a moderate 8P numa box, __wake_up and try_to_wake_up is showing up as #1 and #4 hottest kernel functions. While on a comparable 4P smp box, these two functions are #5 and #9 respectively. I think situation will be worse on 32P numa box in the wake up path. I don't have any measurement on 32P setup yet, because 8P numa performance sucks at the moment and it is a blocker for us before proceed any bigger setup. Execution profile for 8P numa box [1]: Symbol Clockticks Inst. Retired L3 Misses #1 __wake_up 8.08% 1.88% 4.67% #2 finish_task_switch 7.53% 18.11% 5.82% #3 __make_request 6.87% 2.09% 4.35% #4 try_to_wake_up 5.57% 0.64% 3.10% Execution profile for 4P SMP box [2]: Symbol Clockticks #5 __wake_up 3.57% #9 try_to_wake_up 2.38% My question is: what was the reason this patch is dropped and what can we do to improve wake-up performance? In my opinion, we should simply put the task on the CPU it was previously ran and have rebalance_tick and load_balance_newidle to balance out the load. - Ken [1] 8 processor: 1.6 GHz Itanium2 processor, 9M L3. 256 GB memory [2] 4 processor: 1.6 GHz Itanium2 processor, 9M L3. 128 GB memory ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: better wake-balancing: respin 2005-10-27 1:24 better wake-balancing: respin Chen, Kenneth W @ 2005-10-28 7:40 ` Ingo Molnar 2005-10-28 22:34 ` Chen, Kenneth W 2005-10-28 9:17 ` Nick Piggin 1 sibling, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2005-10-28 7:40 UTC (permalink / raw) To: Chen, Kenneth W; +Cc: Nick Piggin, Andrew Morton, linux-kernel * Chen, Kenneth W <kenneth.w.chen@intel.com> wrote: > Once upon a time, this patch was in -mm tree (2.6.13-mm1): > http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2 > > It is neither in Linus's official tree, nor it is in -mm anymore. it's sched-better-wake-balancing-3.patch, in 2.6.14-rc5-mm1. Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: better wake-balancing: respin 2005-10-28 7:40 ` Ingo Molnar @ 2005-10-28 22:34 ` Chen, Kenneth W 0 siblings, 0 replies; 8+ messages in thread From: Chen, Kenneth W @ 2005-10-28 22:34 UTC (permalink / raw) To: 'Ingo Molnar'; +Cc: Nick Piggin, Andrew Morton, linux-kernel Ingo Molnar wrote on Friday, October 28, 2005 12:40 AM > * Chen, Kenneth W <kenneth.w.chen@intel.com> wrote: > > Once upon a time, this patch was in -mm tree (2.6.13-mm1): > > http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2 > > > > It is neither in Linus's official tree, nor it is in -mm anymore. > > it's sched-better-wake-balancing-3.patch, in 2.6.14-rc5-mm1. Sorry for the noise. I went through 2.6.14-rc5-mm1 a couple of time. I couldn't believe myself how I could missed it. (I must be sleeping or don't know what I was doing at that time). - Ken ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: better wake-balancing: respin 2005-10-27 1:24 better wake-balancing: respin Chen, Kenneth W 2005-10-28 7:40 ` Ingo Molnar @ 2005-10-28 9:17 ` Nick Piggin 2005-10-28 10:08 ` Ingo Molnar 1 sibling, 1 reply; 8+ messages in thread From: Nick Piggin @ 2005-10-28 9:17 UTC (permalink / raw) To: Chen, Kenneth W; +Cc: 'Ingo Molnar', Andrew Morton, linux-kernel Chen, Kenneth W wrote: > Once upon a time, this patch was in -mm tree (2.6.13-mm1): > http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2 > > It is neither in Linus's official tree, nor it is in -mm anymore. > > I guess I missed the objection for dropping the patch. I'm bringing My objection for the patch is that it seems to be designed just to improve your TPC - and I don't think we've seen results yet... or did I miss that? Also - by no means do I think improving TPC is wrong, but I think such a patch may not be the right way to go. It doesn't seem to solve your problem well. Now you may have one of two problems. Well it definitely looks like you are taking a lot of cache misses in try_to_wake_up - however this won't be due to the load balancing stuff, but rather from locking the remote CPUs runqueue and touching its runqueues, and cachelines in the task_struct that had been last touched by the remote CPU. In fact, if the balancing stuff in try_to_wake_up is working as it should, then it will result in fewer "remote wakups" because tasks will be moved to the same CPU that wakes them. Schedstats can tell us a lot about this, BTW. The second problem you may have is that the balancing stuff is going haywire and actually causing tasks to move around too much. If this is the case, then I really need to look at your workload (at least schedstats output) and try to get things working a bit better. Knocking half its brains out with a hammer is just going to make it perform poorly in more cases without fixing your underlying problem. Well - you may have a 3rd problem: that schedule and wake_up are simply being called too often. What's going on with your workload? How many context switches? What's the schedule profile look like? (we should get a wake up profile too). Basically I'd like to see a lot more information. > up this discussion again. The wake-up path is a lot hotter on numa > system running database benchmark. Even on a moderate 8P numa box, > __wake_up and try_to_wake_up is showing up as #1 and #4 hottest kernel > functions. While on a comparable 4P smp box, these two functions are > #5 and #9 respectively. > With all else being equal, an 8P box is going to have 133% more remote wakeups than a 4P box, and each of those cacheline transfers is going to have a higher latency. The difference in numbers when moving from 4 to 8 way isn't very surprising. > I think situation will be worse on 32P numa box in the wake up path. > I don't have any measurement on 32P setup yet, because 8P numa > performance sucks at the moment and it is a blocker for us before > proceed any bigger setup. > That's a pity. What do we suck in comparison to? What's our ratio between actual and expected performance? Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: better wake-balancing: respin 2005-10-28 9:17 ` Nick Piggin @ 2005-10-28 10:08 ` Ingo Molnar 2005-10-28 11:03 ` Nick Piggin 0 siblings, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2005-10-28 10:08 UTC (permalink / raw) To: Nick Piggin; +Cc: Chen, Kenneth W, Andrew Morton, linux-kernel * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Chen, Kenneth W wrote: > >Once upon a time, this patch was in -mm tree (2.6.13-mm1): > >http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2 > > > >It is neither in Linus's official tree, nor it is in -mm anymore. > > > >I guess I missed the objection for dropping the patch. I'm bringing > > My objection for the patch is that it seems to be designed just to > improve your TPC - and I don't think we've seen results yet... or did > I miss that? > > Also - by no means do I think improving TPC is wrong, but I think such > a patch may not be the right way to go. It doesn't seem to solve your > problem well. Nick, the TPC workload is simple and has been described before: lots of interrupts arriving on many CPUs, and waking up tasks randomly, which do short amount of work and then go back to sleep again. There is no correlation between the CPU the interrupt arrives on and the CPU the task gets woken up on. There is no point in immediate balancing either: the IRQs are well-balanced themselves so there are no load transients to take care of (except for idle CPUs, which my patch handles), and the next wakeup for that task wont arrive on the same CPU anyway. in such a workload, my patch will clearly improve things, by not bouncing tasks around wildly. > Now you may have one of two problems. Well it definitely looks like > you are taking a lot of cache misses in try_to_wake_up - however this > won't be due to the load balancing stuff, but rather from locking the > remote CPUs runqueue and touching its runqueues, and cachelines in the > task_struct that had been last touched by the remote CPU. no, because you are not considering a fundamentally random workload like TPC. There is only a 1:8 chance to hit the right CPU with the interrupt, and there is no benefit from moving the task to the CPU it got woken up from. In fact, it hurts by doing pointless migrations. my patch adds the rule that we only consider 'fast' migration when provably beneficial: if the target CPU is idle. Any other case will have to go over the 'slow' migration paths. > In fact, if the balancing stuff in try_to_wake_up is working as it > should, then it will result in fewer "remote wakups" because tasks > will be moved to the same CPU that wakes them. Schedstats can tell us > a lot about this, BTW. wrong. Even if the balancing stuff in try_to_wake_up is working as it should, it can easily happen that moving a task is not worthwhile: if there is little or no further relationship between the wakeup CPU and the IRQ CPU, i.e. when the migration cost is larger than the relationship-win between the wakeup CPU and the IRQ CPU. so for me the decision logic is simple: the balancing code logic is migrating over-eagerly, and this simple and straightforward patch makes it less eager for an important workload class. You are welcome to suggest other approaches, but simply saying "I dont like this" wont bring us further, as the damage on TPC workloads is clearly demonstrated. If this patch hurts other workloads (and please demonstrate them instead of calling my patch a hammer - the patch has been in -mm for many months already) then simply provide the logic that will do the balancing for those workloads only, without hurting this workload! When we have to pick between two workloads (only one of which is identified at the moment!) that have have to balance out against each other then we will go towards the simpler solution (all other factors being equal). I.e. in this case by not doing the balancing. Migration is a fundamentally intrusive act and should be done carefully. If you can pull it off without hurting other workloads then fine, but otherwise it needs refinement. This rule is not a hard limit: we obviously will do changes that hurt some rare workloads only a bit while helping other, more common workloads enormously. This has not been demonstrated for this case yet. There is also a simplicity factor: not doing a complex balancing decision is obviously simpler, so we have a bias towards it. I.e. do something complex and costly only if we can prove it most likely OK. Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: better wake-balancing: respin 2005-10-28 10:08 ` Ingo Molnar @ 2005-10-28 11:03 ` Nick Piggin 2005-10-28 14:37 ` Ingo Molnar 0 siblings, 1 reply; 8+ messages in thread From: Nick Piggin @ 2005-10-28 11:03 UTC (permalink / raw) To: Ingo Molnar; +Cc: Chen, Kenneth W, Andrew Morton, linux-kernel Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Chen, Kenneth W wrote: >> >>>Once upon a time, this patch was in -mm tree (2.6.13-mm1): >>>http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2 >>> >>>It is neither in Linus's official tree, nor it is in -mm anymore. >>> >>>I guess I missed the objection for dropping the patch. I'm bringing >> >>My objection for the patch is that it seems to be designed just to >>improve your TPC - and I don't think we've seen results yet... or did >>I miss that? >> >>Also - by no means do I think improving TPC is wrong, but I think such >>a patch may not be the right way to go. It doesn't seem to solve your >>problem well. > > > Nick, the TPC workload is simple and has been described before: lots of > interrupts arriving on many CPUs, and waking up tasks randomly, which do > short amount of work and then go back to sleep again. There is no > correlation between the CPU the interrupt arrives on and the CPU the > task gets woken up on. There is no point in immediate balancing either: > the IRQs are well-balanced themselves so there are no load transients to > take care of (except for idle CPUs, which my patch handles), and the > next wakeup for that task wont arrive on the same CPU anyway. > > in such a workload, my patch will clearly improve things, by not > bouncing tasks around wildly. > Ingo, I wasn't aware that tasks are bouncing around wildly; does your patch improve things? Then by definition it must penalise workloads where the pairings are more predictable? I would prefer to try fixing wake balancing before giving up and turning it off for busy CPUs. > >>Now you may have one of two problems. Well it definitely looks like >>you are taking a lot of cache misses in try_to_wake_up - however this >>won't be due to the load balancing stuff, but rather from locking the >>remote CPUs runqueue and touching its runqueues, and cachelines in the >>task_struct that had been last touched by the remote CPU. > > > no, because you are not considering a fundamentally random workload like > TPC. There is only a 1:8 chance to hit the right CPU with the interrupt, > and there is no benefit from moving the task to the CPU it got woken up > from. In fact, it hurts by doing pointless migrations. > It doesn't always migrate though. That's the point of all the heuristics. > my patch adds the rule that we only consider 'fast' migration when > provably beneficial: if the target CPU is idle. Any other case will have > to go over the 'slow' migration paths. > wrong. There is no way you can "prove" that a migration is beneficial! > >>In fact, if the balancing stuff in try_to_wake_up is working as it >>should, then it will result in fewer "remote wakups" because tasks >>will be moved to the same CPU that wakes them. Schedstats can tell us >>a lot about this, BTW. > > > wrong. Even if the balancing stuff in try_to_wake_up is working as it > should, it can easily happen that moving a task is not worthwhile: if > there is little or no further relationship between the wakeup CPU and > the IRQ CPU, i.e. when the migration cost is larger than the > relationship-win between the wakeup CPU and the IRQ CPU. > > so for me the decision logic is simple: the balancing code logic is > migrating over-eagerly, and this simple and straightforward patch makes > it less eager for an important workload class. You are welcome to > suggest other approaches, but simply saying "I dont like this" wont > bring us further, as the damage on TPC workloads is clearly > demonstrated. If this patch hurts other workloads (and please Ken mentioned it was worth 2%. Not a bad improvement, but if our performance "sucks" then it sounds like we need to look elsewhere. > demonstrate them instead of calling my patch a hammer - the patch has > been in -mm for many months already) then simply provide the logic that > will do the balancing for those workloads only, without hurting this > workload! > No doubt that if it is doing pointless migrations that your patch prevents, then that will improve performance here. However I'd rather try to fix the actual balancing code. Without any form of wake balancing, then a multiprocessor system will tend to have a completely random distribution of tasks over CPUs over time. I prefer to add a driver so it is not completely random for amenable workloads. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: better wake-balancing: respin 2005-10-28 11:03 ` Nick Piggin @ 2005-10-28 14:37 ` Ingo Molnar 2005-10-28 15:04 ` Nick Piggin 0 siblings, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2005-10-28 14:37 UTC (permalink / raw) To: Nick Piggin; +Cc: Chen, Kenneth W, Andrew Morton, linux-kernel * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Ingo, I wasn't aware that tasks are bouncing around wildly; does your > patch improve things? Then by definition it must penalise workloads > where the pairings are more predictable? for TPC, most of the non-to-idle migrations are 'wrong'. So basically any change that gets rid of extra migrations is a win. This does not mean that it is all bouncing madly. > I would prefer to try fixing wake balancing before giving up and > turning it off for busy CPUs. agreed, and that was my suggestion: improve the heuristics to not hurt workloads where there is no natural pairing. one possible way would be to do a task_hot() check in the passive balancing code, and only migrate the task when it's been inactive for a long time: that should be the case for most TPC wakeups. (This assumes an accurate cache-hot estimator, for which another patch exists.) > Without any form of wake balancing, then a multiprocessor system will > tend to have a completely random distribution of tasks over CPUs over > time. I prefer to add a driver so it is not completely random for > amenable workloads. but my patch does not do 'no form of wake balancing'. It will do non-load-related wake balancing if the target CPU is idle. Arguably, that can easily be 'never' under common workloads. Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: better wake-balancing: respin 2005-10-28 14:37 ` Ingo Molnar @ 2005-10-28 15:04 ` Nick Piggin 0 siblings, 0 replies; 8+ messages in thread From: Nick Piggin @ 2005-10-28 15:04 UTC (permalink / raw) To: Ingo Molnar; +Cc: Chen, Kenneth W, Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2416 bytes --] Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Ingo, I wasn't aware that tasks are bouncing around wildly; does your >>patch improve things? Then by definition it must penalise workloads >>where the pairings are more predictable? > > > for TPC, most of the non-to-idle migrations are 'wrong'. So basically Sure, OK. > any change that gets rid of extra migrations is a win. This does not > mean that it is all bouncing madly. > Yes... semantics aside it is obviously something we want to eliminate if possible. > >>I would prefer to try fixing wake balancing before giving up and >>turning it off for busy CPUs. > > > agreed, and that was my suggestion: improve the heuristics to not hurt > workloads where there is no natural pairing. > > one possible way would be to do a task_hot() check in the passive > balancing code, and only migrate the task when it's been inactive for a > long time: that should be the case for most TPC wakeups. (This assumes > an accurate cache-hot estimator, for which another patch exists.) > That sounds like a reasonable avenue to investigate. There was something similar there before, but I found that there was a sharp point where runqueue lengths got just the right length for performance characteristics of some things to change. This could have been made worse by inadequate cache-hot estimate though. One thing I will look into is to check whether we can use previous wakeup patterns to check whether the pattern looks random, or affine to some group of CPUs. This could potentially allow us to have stronger affinity than we currently do for the "good" workloads, while eliminating task movement completely for bad cases. Something along the lines of the attached patch may be 'good enough', though I'd like to try a few other things as well. > >>Without any form of wake balancing, then a multiprocessor system will >>tend to have a completely random distribution of tasks over CPUs over >>time. I prefer to add a driver so it is not completely random for >>amenable workloads. > > > but my patch does not do 'no form of wake balancing'. It will do > non-load-related wake balancing if the target CPU is idle. Arguably, > that can easily be 'never' under common workloads. > No that's true - I didn't mean to imply that your patch would completely eliminate this input. Thanks, Nick -- SUSE Labs, Novell Inc. [-- Attachment #2: sched-less-affine.patch --] [-- Type: text/plain, Size: 1723 bytes --] Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h 2005-10-29 00:52:31.000000000 +1000 +++ linux-2.6/include/linux/sched.h 2005-10-29 00:58:51.000000000 +1000 @@ -648,9 +648,12 @@ struct task_struct { int lock_depth; /* BKL lock depth */ -#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) +#if defined(CONFIG_SMP) + int last_waker_cpu; /* CPU that last woke this task up */ +#if defined(__ARCH_WANT_UNLOCKED_CTXSW) int oncpu; #endif +#endif int prio, static_prio; struct list_head run_list; prio_array_t *array; Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c 2005-10-29 00:52:34.000000000 +1000 +++ linux-2.6/kernel/sched.c 2005-10-29 01:00:53.000000000 +1000 @@ -1183,6 +1183,10 @@ static int try_to_wake_up(task_t *p, uns } } + if (p->last_waker_cpu != this_cpu) + goto out_set_cpu; + + if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed))) goto out_set_cpu; @@ -1253,6 +1257,8 @@ out_set_cpu: cpu = task_cpu(p); } + p->last_waker_cpu = this_cpu; + out_activate: #endif /* CONFIG_SMP */ if (old_state == TASK_UNINTERRUPTIBLE) { @@ -1334,9 +1340,12 @@ void fastcall sched_fork(task_t *p, int #ifdef CONFIG_SCHEDSTATS memset(&p->sched_info, 0, sizeof(p->sched_info)); #endif -#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) +#if defined(CONFIG_SMP) + p->last_waker_cpu = cpu; +#if defined(__ARCH_WANT_UNLOCKED_CTXSW) p->oncpu = 0; #endif +#endif #ifdef CONFIG_PREEMPT /* Want to start with kernel preemption disabled. */ p->thread_info->preempt_count = 1; ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2005-10-28 22:34 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-10-27 1:24 better wake-balancing: respin Chen, Kenneth W 2005-10-28 7:40 ` Ingo Molnar 2005-10-28 22:34 ` Chen, Kenneth W 2005-10-28 9:17 ` Nick Piggin 2005-10-28 10:08 ` Ingo Molnar 2005-10-28 11:03 ` Nick Piggin 2005-10-28 14:37 ` Ingo Molnar 2005-10-28 15:04 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox