* [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
@ 2011-10-19 21:45 Tim Chen
2011-10-20 4:18 ` Eric Dumazet
2011-10-20 4:24 ` Andi Kleen
0 siblings, 2 replies; 15+ messages in thread
From: Tim Chen @ 2011-10-19 21:45 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, Andi Kleen, Suresh Siddha, Venki Pallipadi
Idle load balancing makes use of a global structure nohz to keep track
of the cpu doing the idle load balancing, first and second busy cpu and
the cpus that are idle. This leads to scalability issue.
For workload that has processes waking up and going to sleep often, the
load_balancer, first_pick_cpu, second_cpu and idle_cpus_mask in the
no_hz structure get updated very frequently. This causes lots of cache
bouncing and slowing down the idle and wakeup path for large system with
many cores/sockets. This is evident from up to 41% of cpu cycles spent
in the function select_nohz_load_balancer from a test work load I ran.
By putting these fields in their own cache line, the problem can be
mitigated.
The test workload has multiple pairs of processes. Within a process
pair, each process receive and then send message back and forth to the
other process via a pipe connecting them. So at any one time, half the
processes are active.
I found that for 32 pairs of processes, I got an increase of the rate of
context switching between the processes by 37% and by 24% for 64 process
pairs. The test was run on a 8 socket 64 cores NHM-EX system, where
hyper-threading has been turned on.
Tim
Workload cpu cycle profile on vanilla kernel:
41.19% swapper [kernel.kallsyms] [k] select_nohz_load_balancer
- select_nohz_load_balancer
+ 54.91% tick_nohz_restart_sched_tick
+ 45.04% tick_nohz_stop_sched_tick
18.96% swapper [kernel.kallsyms] [k] mwait_idle_with_hints
3.50% swapper [kernel.kallsyms] [k] tick_nohz_restart_sched_tick
3.36% swapper [kernel.kallsyms] [k] tick_check_idle
2.96% swapper [kernel.kallsyms] [k] rcu_enter_nohz
2.40% swapper [kernel.kallsyms] [k] _raw_spin_lock
2.11% swapper [kernel.kallsyms] [k] tick_nohz_stop_sched_tick
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bc8ee99..26ea877 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3639,10 +3639,10 @@ static inline void init_sched_softirq_csd(struct call_single_data *csd)
* load balancing for all the idle CPUs.
*/
static struct {
- atomic_t load_balancer;
- atomic_t first_pick_cpu;
- atomic_t second_pick_cpu;
- cpumask_var_t idle_cpus_mask;
+ atomic_t load_balancer ____cacheline_aligned;
+ atomic_t first_pick_cpu ____cacheline_aligned;
+ atomic_t second_pick_cpu ____cacheline_aligned;
+ cpumask_var_t idle_cpus_mask ____cacheline_aligned;
cpumask_var_t grp_idle_mask;
unsigned long next_balance; /* in jiffy units */
} nohz ____cacheline_aligned;
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-10-19 21:45 [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability Tim Chen @ 2011-10-20 4:18 ` Eric Dumazet 2011-10-20 5:57 ` Suresh Siddha 2011-10-20 17:19 ` Tim Chen 2011-10-20 4:24 ` Andi Kleen 1 sibling, 2 replies; 15+ messages in thread From: Eric Dumazet @ 2011-10-20 4:18 UTC (permalink / raw) To: Tim Chen Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Andi Kleen, Suresh Siddha, Venki Pallipadi Le mercredi 19 octobre 2011 à 14:45 -0700, Tim Chen a écrit : > Idle load balancing makes use of a global structure nohz to keep track > of the cpu doing the idle load balancing, first and second busy cpu and > the cpus that are idle. This leads to scalability issue. > > For workload that has processes waking up and going to sleep often, the > load_balancer, first_pick_cpu, second_cpu and idle_cpus_mask in the > no_hz structure get updated very frequently. This causes lots of cache > bouncing and slowing down the idle and wakeup path for large system with > many cores/sockets. This is evident from up to 41% of cpu cycles spent > in the function select_nohz_load_balancer from a test work load I ran. > By putting these fields in their own cache line, the problem can be > mitigated. > > The test workload has multiple pairs of processes. Within a process > pair, each process receive and then send message back and forth to the > other process via a pipe connecting them. So at any one time, half the > processes are active. > > I found that for 32 pairs of processes, I got an increase of the rate of > context switching between the processes by 37% and by 24% for 64 process > pairs. The test was run on a 8 socket 64 cores NHM-EX system, where > hyper-threading has been turned on. > > Tim > > Workload cpu cycle profile on vanilla kernel: > 41.19% swapper [kernel.kallsyms] [k] select_nohz_load_balancer > - select_nohz_load_balancer > + 54.91% tick_nohz_restart_sched_tick > + 45.04% tick_nohz_stop_sched_tick > 18.96% swapper [kernel.kallsyms] [k] mwait_idle_with_hints > 3.50% swapper [kernel.kallsyms] [k] tick_nohz_restart_sched_tick > 3.36% swapper [kernel.kallsyms] [k] tick_check_idle > 2.96% swapper [kernel.kallsyms] [k] rcu_enter_nohz > 2.40% swapper [kernel.kallsyms] [k] _raw_spin_lock > 2.11% swapper [kernel.kallsyms] [k] tick_nohz_stop_sched_tick > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index bc8ee99..26ea877 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -3639,10 +3639,10 @@ static inline void init_sched_softirq_csd(struct call_single_data *csd) > * load balancing for all the idle CPUs. > */ > static struct { > - atomic_t load_balancer; > - atomic_t first_pick_cpu; > - atomic_t second_pick_cpu; > - cpumask_var_t idle_cpus_mask; > + atomic_t load_balancer ____cacheline_aligned; > + atomic_t first_pick_cpu ____cacheline_aligned; > + atomic_t second_pick_cpu ____cacheline_aligned; > + cpumask_var_t idle_cpus_mask ____cacheline_aligned; > cpumask_var_t grp_idle_mask; > unsigned long next_balance; /* in jiffy units */ > } nohz ____cacheline_aligned; > Dont you increase cache footprint, say for an Uniprocessor machine ? (CONFIG_SMP=n) ____cacheline_aligned_in_smp seems more suitable in this case. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-10-20 4:18 ` Eric Dumazet @ 2011-10-20 5:57 ` Suresh Siddha 2011-10-20 6:43 ` Eric Dumazet 2011-10-20 17:19 ` Tim Chen 1 sibling, 1 reply; 15+ messages in thread From: Suresh Siddha @ 2011-10-20 5:57 UTC (permalink / raw) To: Eric Dumazet Cc: Tim Chen, Ingo Molnar, Peter Zijlstra, linux-kernel@vger.kernel.org, Andi Kleen, Venki Pallipadi On Wed, 2011-10-19 at 21:18 -0700, Eric Dumazet wrote: > > + atomic_t load_balancer ____cacheline_aligned; > > + atomic_t first_pick_cpu ____cacheline_aligned; > > + atomic_t second_pick_cpu ____cacheline_aligned; > > + cpumask_var_t idle_cpus_mask ____cacheline_aligned; > > cpumask_var_t grp_idle_mask; > > unsigned long next_balance; /* in jiffy units */ > > } nohz ____cacheline_aligned; > > > > Dont you increase cache footprint, say for an Uniprocessor machine ? > > (CONFIG_SMP=n) > > ____cacheline_aligned_in_smp seems more suitable in this case. I believe this code is already under ifdef CONFIG_SMP. thanks, suresh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-10-20 5:57 ` Suresh Siddha @ 2011-10-20 6:43 ` Eric Dumazet 0 siblings, 0 replies; 15+ messages in thread From: Eric Dumazet @ 2011-10-20 6:43 UTC (permalink / raw) To: Suresh Siddha Cc: Tim Chen, Ingo Molnar, Peter Zijlstra, linux-kernel@vger.kernel.org, Andi Kleen, Venki Pallipadi Le mercredi 19 octobre 2011 à 22:57 -0700, Suresh Siddha a écrit : > On Wed, 2011-10-19 at 21:18 -0700, Eric Dumazet wrote: > > > + atomic_t load_balancer ____cacheline_aligned; > > > + atomic_t first_pick_cpu ____cacheline_aligned; > > > + atomic_t second_pick_cpu ____cacheline_aligned; > > > + cpumask_var_t idle_cpus_mask ____cacheline_aligned; > > > cpumask_var_t grp_idle_mask; > > > unsigned long next_balance; /* in jiffy units */ > > > } nohz ____cacheline_aligned; > > > > > > > Dont you increase cache footprint, say for an Uniprocessor machine ? > > > > (CONFIG_SMP=n) > > > > ____cacheline_aligned_in_smp seems more suitable in this case. > > I believe this code is already under ifdef CONFIG_SMP. Right now yes, but it might change in the future after some code cleanup / consolidation. This kind of things are copy/pasted, and the right semantic here is ____cacheline_aligned_in_smp . It matches the changelog better. Note: the ____cacheline_aligned on global "nohz" structure was quite different, and in contradiction of what this patch wants to do. Previously, we wanted to group all fields on a single cache line. We now know it was not a scalable strategy, thanks to Tim. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-10-20 4:18 ` Eric Dumazet 2011-10-20 5:57 ` Suresh Siddha @ 2011-10-20 17:19 ` Tim Chen 1 sibling, 0 replies; 15+ messages in thread From: Tim Chen @ 2011-10-20 17:19 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Andi Kleen, Suresh Siddha, Venki Pallipadi On Thu, 2011-10-20 at 06:18 +0200, Eric Dumazet wrote: > Dont you increase cache footprint, say for an Uniprocessor machine ? > > (CONFIG_SMP=n) > > ____cacheline_aligned_in_smp seems more suitable in this case. > > > Okay, using the smp version of the cache align in this updated patch. Thanks. ------------ Idle load balancing makes use of a global structure nohz to keep track of the cpu doing the idle load balancing, first and second busy cpu and the cpus that are idle. This leads to scalability issue. For workload that has processes waking up and going to sleep often, the load_balancer, first_pick_cpu, second_cpu and idle_cpus_mask in the no_hz structure get updated very frequently. This causes lots of cache bouncing and slowing down the idle and wakeup path for large system with many cores/sockets. This is evident from up to 41% of cpu cycles spent in the function select_nohz_load_balancer from a test work load I ran. By putting these fields in their own cache line, the problem can be mitigated. The test workload has multiple pairs of processes. Within a process pair, each process receive and then send message back and forth to the other process via a pipe connecting them. So at any one time, half the processes are active. I found that for 32 pairs of processes, I got an increase of the rate of context switching between the processes by 37% and by 24% for 64 process pairs. The test was run on a 8 socket 64 cores NHM-EX system, where hyper-threading has been turned on. Tim Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index bc8ee99..4ae4b7d 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -3639,10 +3639,10 @@ static inline void init_sched_softirq_csd(struct call_single_data *csd) * load balancing for all the idle CPUs. */ static struct { - atomic_t load_balancer; - atomic_t first_pick_cpu; - atomic_t second_pick_cpu; - cpumask_var_t idle_cpus_mask; + atomic_t load_balancer ____cacheline_aligned_in_smp; + atomic_t first_pick_cpu ____cacheline_aligned_in_smp; + atomic_t second_pick_cpu ____cacheline_aligned_in_smp; + cpumask_var_t idle_cpus_mask ____cacheline_aligned_in_smp; cpumask_var_t grp_idle_mask; unsigned long next_balance; /* in jiffy units */ } nohz ____cacheline_aligned; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-10-19 21:45 [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability Tim Chen 2011-10-20 4:18 ` Eric Dumazet @ 2011-10-20 4:24 ` Andi Kleen 2011-10-20 12:26 ` Venki Pallipadi 1 sibling, 1 reply; 15+ messages in thread From: Andi Kleen @ 2011-10-20 4:24 UTC (permalink / raw) To: Tim Chen Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Suresh Siddha, Venki Pallipadi Tim Chen <tim.c.chen@linux.intel.com> writes: > */ > static struct { > - atomic_t load_balancer; > - atomic_t first_pick_cpu; > - atomic_t second_pick_cpu; > - cpumask_var_t idle_cpus_mask; > + atomic_t load_balancer ____cacheline_aligned; > + atomic_t first_pick_cpu ____cacheline_aligned; > + atomic_t second_pick_cpu ____cacheline_aligned; > + cpumask_var_t idle_cpus_mask ____cacheline_aligned; On large configs idle_cpu_masks may be allocated. May need more changes to tell the allocator to cache align/pad too? -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-10-20 4:24 ` Andi Kleen @ 2011-10-20 12:26 ` Venki Pallipadi 2011-10-20 17:31 ` Suresh Siddha ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Venki Pallipadi @ 2011-10-20 12:26 UTC (permalink / raw) To: Andi Kleen Cc: Tim Chen, Ingo Molnar, Peter Zijlstra, linux-kernel, Suresh Siddha On Wed, Oct 19, 2011 at 9:24 PM, Andi Kleen <andi@firstfloor.org> wrote: > Tim Chen <tim.c.chen@linux.intel.com> writes: >> */ >> static struct { >> - atomic_t load_balancer; >> - atomic_t first_pick_cpu; >> - atomic_t second_pick_cpu; >> - cpumask_var_t idle_cpus_mask; >> + atomic_t load_balancer ____cacheline_aligned; >> + atomic_t first_pick_cpu ____cacheline_aligned; >> + atomic_t second_pick_cpu ____cacheline_aligned; >> + cpumask_var_t idle_cpus_mask ____cacheline_aligned; > > On large configs idle_cpu_masks may be allocated. May need > more changes to tell the allocator to cache align/pad too? > An alternate approach is to split this struct per node/socket and do the nohz idle balancing logic at that level. That should be more scalable in terms of nohz balancing (ensure one CPU wont be doing nohz balancing for huge number of idle CPUs). I had looked at that approach couple of years earlier and couldn't measure that much of a gain. May be it is time to revisit that with increased core count. Thanks, Venki ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-10-20 12:26 ` Venki Pallipadi @ 2011-10-20 17:31 ` Suresh Siddha 2011-10-20 17:38 ` Peter Zijlstra [not found] ` <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com> 2 siblings, 0 replies; 15+ messages in thread From: Suresh Siddha @ 2011-10-20 17:31 UTC (permalink / raw) To: Venki Pallipadi Cc: Andi Kleen, Tim Chen, Ingo Molnar, Peter Zijlstra, linux-kernel@vger.kernel.org On Thu, 2011-10-20 at 05:26 -0700, Venki Pallipadi wrote: > On Wed, Oct 19, 2011 at 9:24 PM, Andi Kleen <andi@firstfloor.org> wrote: > > Tim Chen <tim.c.chen@linux.intel.com> writes: > >> */ > >> static struct { > >> - atomic_t load_balancer; > >> - atomic_t first_pick_cpu; > >> - atomic_t second_pick_cpu; > >> - cpumask_var_t idle_cpus_mask; > >> + atomic_t load_balancer ____cacheline_aligned; > >> + atomic_t first_pick_cpu ____cacheline_aligned; > >> + atomic_t second_pick_cpu ____cacheline_aligned; > >> + cpumask_var_t idle_cpus_mask ____cacheline_aligned; > > > > On large configs idle_cpu_masks may be allocated. May need > > more changes to tell the allocator to cache align/pad too? > > > > An alternate approach is to split this struct per node/socket and do > the nohz idle balancing logic at that level. That should be more > scalable in terms of nohz balancing (ensure one CPU wont be doing nohz > balancing for huge number of idle CPUs). I had looked at that approach > couple of years earlier and couldn't measure that much of a gain. May > be it is time to revisit that with increased core count. I am actually looking at it. This is a quick fix for the meantime. Even with this cacheline aligned, the first_pick_cpu and second_pick_cpu cacheline bouncing is still causing ~8% more performance impact on this synthetic lat_ctx heavy workload. For this experiment to measure the potential scope we have here, We left the cacheline bouncing associated with the nohz.idle_cpus_mask update in place. thanks, suresh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-10-20 12:26 ` Venki Pallipadi 2011-10-20 17:31 ` Suresh Siddha @ 2011-10-20 17:38 ` Peter Zijlstra [not found] ` <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com> 2 siblings, 0 replies; 15+ messages in thread From: Peter Zijlstra @ 2011-10-20 17:38 UTC (permalink / raw) To: Venki Pallipadi Cc: Andi Kleen, Tim Chen, Ingo Molnar, linux-kernel, Suresh Siddha On Thu, 2011-10-20 at 05:26 -0700, Venki Pallipadi wrote: > On Wed, Oct 19, 2011 at 9:24 PM, Andi Kleen <andi@firstfloor.org> wrote: > > Tim Chen <tim.c.chen@linux.intel.com> writes: > >> */ > >> static struct { > >> - atomic_t load_balancer; > >> - atomic_t first_pick_cpu; > >> - atomic_t second_pick_cpu; > >> - cpumask_var_t idle_cpus_mask; > >> + atomic_t load_balancer ____cacheline_aligned; > >> + atomic_t first_pick_cpu ____cacheline_aligned; > >> + atomic_t second_pick_cpu ____cacheline_aligned; > >> + cpumask_var_t idle_cpus_mask ____cacheline_aligned; > > > > On large configs idle_cpu_masks may be allocated. May need > > more changes to tell the allocator to cache align/pad too? > > > > An alternate approach is to split this struct per node/socket and do > the nohz idle balancing logic at that level. That should be more > scalable in terms of nohz balancing (ensure one CPU wont be doing nohz > balancing for huge number of idle CPUs). I had looked at that approach > couple of years earlier and couldn't measure that much of a gain. May > be it is time to revisit that with increased core count. Yeah, that would be best, although I remember there was a problem with your approach back then, a fully idle node would not balance at all or something like that. ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com>]
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability [not found] ` <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com> @ 2011-11-01 23:52 ` Suresh Siddha 2011-11-02 13:04 ` Srivatsa Vaddagiri ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Suresh Siddha @ 2011-11-01 23:52 UTC (permalink / raw) To: Venki Pallipadi, Peter Zijlstra Cc: Andi Kleen, Tim Chen, Ingo Molnar, linux-kernel@vger.kernel.org On Thu, 2011-10-20 at 10:31 -0700, Suresh Siddha wrote: > On Thu, 2011-10-20 at 05:26 -0700, Venki Pallipadi wrote: > > An alternate approach is to split this struct per node/socket and do > > the nohz idle balancing logic at that level. That should be more > > scalable in terms of nohz balancing (ensure one CPU wont be doing nohz > > balancing for huge number of idle CPUs). I had looked at that approach > > couple of years earlier and couldn't measure that much of a gain. May > > be it is time to revisit that with increased core count. On Thu, 2011-10-20 at 10:38 -0700, Peter Zijlstra wrote: > Yeah, that would be best, although I remember there was a problem with > your approach back then, a fully idle node would not balance at all or > something like that. That is correct. So to minimize the global updates I did two things. 1. Track nr of busy cpus in each scheduler group for each scheduler domain. 2. delay the update of nohz.idle_cpus_mask (And to the above mentioned new sched group specific busy stat) when coming out of idle to the first busy tick. And added idle_cpu() checks at couple of places in idle load balancing code to ensure that the cpu is really idle. The kicked cpu will still do the idle load balancing on behalf of all the cpu's. We are kicking only when it is really needed. If we add more smarts then we can split this also into multiple idle load balances in the future or kick the active load balance directly for the HT/MC cases where we want to spread busy cores/threads. This is open for future though. > > I am actually looking at it. This is a quick fix for the meantime. > > Even with this cacheline aligned, the first_pick_cpu and second_pick_cpu > cacheline bouncing is still causing ~8% more performance impact on this > synthetic lat_ctx heavy workload. For this experiment to measure the > potential scope we have here, We left the cacheline bouncing associated > with the nohz.idle_cpus_mask update in place. Appended patch helped improve that workload by 100%. We haven't completed the testing of this patch but thought of getting quick feedback. Also I may have to split this into multiple patches (removing of nohz_stamp etc are not related). Thanks. --- From: Suresh Siddha <suresh.b.siddha@intel.com> Subject: RFC: sched, nohz: sched group, domain aware nohz idle load balancing Make nohz idle load balancing more scalabale by making it aware of sched group/ sched domains. Introduce nr_busy_cpus in the struct sched_group_power [Not in sched_group because sched groups are duplicated for the SD_OVERLAP scheduler domain] and for each cpu that enters and exits tickless, this parameter will be updated in each scheduler group of the scheduler domain that this cpu belongs to. To avoid the frequent update of this tickless state (global nohz data aswell as some what local nr_busy_cpus in the sched_group's) when the cpu enters and exits tickless idle often, the update of the exit state is delayed to the first timer tick that happens after the cpu becomes busy. Idle load balance is kicked on one of the idle cpu's when there is atleast one idle cpu and - a busy rq having more than one task or - a busy scheduler group having multiple busy cpus that exceed the sched group power or - for the SD_ASYM_PACKING domain, if the lower numbered cpu's in that domain are idle compared to the busy ones. This will help in kicking the idle load balancing request only when there is a real imbalance. And once it is mostly balanced, these kicks will be minimized. These changes helped improve the workload that is context switch intensive between number of task pairs by 2x on a 8 socket NHM-EX based system. Reported-by: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> --- include/linux/sched.h | 4 + kernel/sched.c | 27 +++-- kernel/sched_fair.c | 264 ++++++++++++++++++------------------------------- 3 files changed, 116 insertions(+), 179 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9c7a7a4..61e9980 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -901,6 +901,10 @@ struct sched_group_power { * single CPU. */ unsigned int power, power_orig; + /* + * Number of busy cpus in this group. + */ + atomic_t nr_busy_cpus; }; struct sched_group { diff --git a/kernel/sched.c b/kernel/sched.c index cf7f696..a8da213 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -600,8 +600,7 @@ struct rq { unsigned long cpu_load[CPU_LOAD_IDX_MAX]; unsigned long last_load_update_tick; #ifdef CONFIG_NO_HZ - u64 nohz_stamp; - unsigned char nohz_balance_kick; + unsigned int tick_stopped:1; #endif int skip_clock_update; @@ -1337,6 +1336,20 @@ static void resched_cpu(int cpu) } #ifdef CONFIG_NO_HZ +#define NOHZ_BALANCE_KICK 0 +#define NOHZ_NEED_BALANCING 1 +/* + * idle load balancing details + * - When one of the busy CPUs notice that there may be an idle rebalancing + * needed, they will kick the idle load balancer, which then does idle + * load balancing for all the idle CPUs. + */ +static struct { + cpumask_var_t idle_cpus_mask; + unsigned long next_balance; /* in jiffy units */ + unsigned long bits; +} nohz ____cacheline_aligned; + /* * In the semi idle case, use the nearest busy cpu for migrating timers * from an idle cpu. This is good for power-savings. @@ -1406,7 +1419,8 @@ void wake_up_idle_cpu(int cpu) static inline bool got_nohz_idle_kick(void) { - return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick; + return idle_cpu(smp_processor_id()) && + test_bit(NOHZ_BALANCE_KICK, &nohz.bits); } #else /* CONFIG_NO_HZ */ @@ -8297,9 +8311,6 @@ void __init sched_init(void) rq->idle_stamp = 0; rq->avg_idle = 2*sysctl_sched_migration_cost; rq_attach_root(rq, &def_root_domain); -#ifdef CONFIG_NO_HZ - rq->nohz_balance_kick = 0; -#endif #endif init_rq_hrtick(rq); atomic_set(&rq->nr_iowait, 0); @@ -8344,10 +8355,6 @@ void __init sched_init(void) zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT); #ifdef CONFIG_NO_HZ zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); - alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT); - atomic_set(&nohz.load_balancer, nr_cpu_ids); - atomic_set(&nohz.first_pick_cpu, nr_cpu_ids); - atomic_set(&nohz.second_pick_cpu, nr_cpu_ids); #endif /* May be allocated at isolcpus cmdline parse time */ if (cpu_isolated_map == NULL) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5c9e679..d18a1de 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -3317,6 +3317,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) cpu_rq(cpu)->cpu_power = power; sdg->sgp->power = power; + atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight); } static void update_group_power(struct sched_domain *sd, int cpu) @@ -3339,6 +3340,7 @@ static void update_group_power(struct sched_domain *sd, int cpu) } while (group != child->groups); sdg->sgp->power = power; + atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight); } /* @@ -4269,30 +4271,6 @@ out_unlock: } #ifdef CONFIG_NO_HZ -/* - * idle load balancing details - * - One of the idle CPUs nominates itself as idle load_balancer, while - * entering idle. - * - This idle load balancer CPU will also go into tickless mode when - * it is idle, just like all other idle CPUs - * - When one of the busy CPUs notice that there may be an idle rebalancing - * needed, they will kick the idle load balancer, which then does idle - * load balancing for all the idle CPUs. - */ -static struct { - atomic_t load_balancer; - atomic_t first_pick_cpu; - atomic_t second_pick_cpu; - cpumask_var_t idle_cpus_mask; - cpumask_var_t grp_idle_mask; - unsigned long next_balance; /* in jiffy units */ -} nohz ____cacheline_aligned; - -int get_nohz_load_balancer(void) -{ - return atomic_read(&nohz.load_balancer); -} - #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) /** * lowest_flag_domain - Return lowest sched_domain containing flag. @@ -4329,33 +4307,6 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) (sd && (sd->flags & flag)); sd = sd->parent) /** - * is_semi_idle_group - Checks if the given sched_group is semi-idle. - * @ilb_group: group to be checked for semi-idleness - * - * Returns: 1 if the group is semi-idle. 0 otherwise. - * - * We define a sched_group to be semi idle if it has atleast one idle-CPU - * and atleast one non-idle CPU. This helper function checks if the given - * sched_group is semi-idle or not. - */ -static inline int is_semi_idle_group(struct sched_group *ilb_group) -{ - cpumask_and(nohz.grp_idle_mask, nohz.idle_cpus_mask, - sched_group_cpus(ilb_group)); - - /* - * A sched_group is semi-idle when it has atleast one busy cpu - * and atleast one idle cpu. - */ - if (cpumask_empty(nohz.grp_idle_mask)) - return 0; - - if (cpumask_equal(nohz.grp_idle_mask, sched_group_cpus(ilb_group))) - return 0; - - return 1; -} -/** * find_new_ilb - Finds the optimum idle load balancer for nomination. * @cpu: The cpu which is nominating a new idle_load_balancer. * @@ -4369,9 +4320,9 @@ static inline int is_semi_idle_group(struct sched_group *ilb_group) */ static int find_new_ilb(int cpu) { + int ilb = cpumask_first(nohz.idle_cpus_mask); + struct sched_group *ilbg; struct sched_domain *sd; - struct sched_group *ilb_group; - int ilb = nr_cpu_ids; /* * Have idle load balancer selection from semi-idle packages only @@ -4389,23 +4340,28 @@ static int find_new_ilb(int cpu) rcu_read_lock(); for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) { - ilb_group = sd->groups; + ilbg = sd->groups; do { - if (is_semi_idle_group(ilb_group)) { - ilb = cpumask_first(nohz.grp_idle_mask); + if (ilbg->group_weight != + atomic_read(&ilbg->sgp->nr_busy_cpus)) { + ilb = cpumask_first_and(nohz.idle_cpus_mask, + sched_group_cpus(ilbg)); goto unlock; } - ilb_group = ilb_group->next; + ilbg = ilbg->next; - } while (ilb_group != sd->groups); + } while (ilbg != sd->groups); } unlock: rcu_read_unlock(); out_done: - return ilb; + if (ilb < nr_cpu_ids && idle_cpu(ilb)) + return ilb; + + return nr_cpu_ids; } #else /* (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */ static inline int find_new_ilb(int call_cpu) @@ -4423,102 +4379,52 @@ static void nohz_balancer_kick(int cpu) { int ilb_cpu; - nohz.next_balance++; - - ilb_cpu = get_nohz_load_balancer(); + ilb_cpu = find_new_ilb(cpu); if (ilb_cpu >= nr_cpu_ids) { - ilb_cpu = cpumask_first(nohz.idle_cpus_mask); - if (ilb_cpu >= nr_cpu_ids) - return; + clear_bit(NOHZ_BALANCE_KICK, &nohz.bits); + return; } - if (!cpu_rq(ilb_cpu)->nohz_balance_kick) { - cpu_rq(ilb_cpu)->nohz_balance_kick = 1; - - smp_mb(); - /* - * Use smp_send_reschedule() instead of resched_cpu(). - * This way we generate a sched IPI on the target cpu which - * is idle. And the softirq performing nohz idle load balance - * will be run before returning from the IPI. - */ - smp_send_reschedule(ilb_cpu); - } + /* + * Use smp_send_reschedule() instead of resched_cpu(). + * This way we generate a sched IPI on the target cpu which + * is idle. And the softirq performing nohz idle load balance + * will be run before returning from the IPI. + */ + smp_send_reschedule(ilb_cpu); return; } -/* - * This routine will try to nominate the ilb (idle load balancing) - * owner among the cpus whose ticks are stopped. ilb owner will do the idle - * load balancing on behalf of all those cpus. - * - * When the ilb owner becomes busy, we will not have new ilb owner until some - * idle CPU wakes up and goes back to idle or some busy CPU tries to kick - * idle load balancing by kicking one of the idle CPUs. - * - * Ticks are stopped for the ilb owner as well, with busy CPU kicking this - * ilb owner CPU in future (when there is a need for idle load balancing on - * behalf of all idle CPUs). - */ void select_nohz_load_balancer(int stop_tick) { int cpu = smp_processor_id(); + struct sched_domain *sd; if (stop_tick) { - if (!cpu_active(cpu)) { - if (atomic_read(&nohz.load_balancer) != cpu) - return; - - /* - * If we are going offline and still the leader, - * give up! - */ - if (atomic_cmpxchg(&nohz.load_balancer, cpu, - nr_cpu_ids) != cpu) - BUG(); - + if (this_rq()->tick_stopped) return; - } - cpumask_set_cpu(cpu, nohz.idle_cpus_mask); - if (atomic_read(&nohz.first_pick_cpu) == cpu) - atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids); - if (atomic_read(&nohz.second_pick_cpu) == cpu) - atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids); - - if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) { - int new_ilb; + this_rq()->tick_stopped = 1; - /* make me the ilb owner */ - if (atomic_cmpxchg(&nohz.load_balancer, nr_cpu_ids, - cpu) != nr_cpu_ids) - return; + /* + * Indicate the idle state to all the scheduler groups that + * this cpu is part of. + */ + for_each_domain(cpu, sd) + atomic_dec(&sd->groups->sgp->nr_busy_cpus); - /* - * Check to see if there is a more power-efficient - * ilb. - */ - new_ilb = find_new_ilb(cpu); - if (new_ilb < nr_cpu_ids && new_ilb != cpu) { - atomic_set(&nohz.load_balancer, nr_cpu_ids); - resched_cpu(new_ilb); - return; - } - return; - } - } else { - if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) + if (test_bit(NOHZ_NEED_BALANCING, &nohz.bits)) return; - cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); - - if (atomic_read(&nohz.load_balancer) == cpu) - if (atomic_cmpxchg(&nohz.load_balancer, cpu, - nr_cpu_ids) != cpu) - BUG(); + /* + * Indicate that there is atleast one cpu in tickless mode + * and hence the possible need for NOHZ idle load balancing. + */ + set_bit(NOHZ_NEED_BALANCING, &nohz.bits); } + return; } #endif @@ -4623,11 +4529,11 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) struct rq *rq; int balance_cpu; - if (idle != CPU_IDLE || !this_rq->nohz_balance_kick) + if (idle != CPU_IDLE || !(test_bit(NOHZ_BALANCE_KICK, &nohz.bits))) return; for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { - if (balance_cpu == this_cpu) + if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) continue; /* @@ -4636,7 +4542,7 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) * balancing owner will pick it up. */ if (need_resched()) { - this_rq->nohz_balance_kick = 0; + clear_bit(NOHZ_BALANCE_KICK, &nohz.bits); break; } @@ -4652,53 +4558,73 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) this_rq->next_balance = rq->next_balance; } nohz.next_balance = this_rq->next_balance; - this_rq->nohz_balance_kick = 0; + clear_bit(NOHZ_BALANCE_KICK, &nohz.bits); } /* - * Current heuristic for kicking the idle load balancer - * - first_pick_cpu is the one of the busy CPUs. It will kick - * idle load balancer when it has more than one process active. This - * eliminates the need for idle load balancing altogether when we have - * only one running process in the system (common case). - * - If there are more than one busy CPU, idle load balancer may have - * to run for active_load_balance to happen (i.e., two busy CPUs are - * SMT or core siblings and can run better if they move to different - * physical CPUs). So, second_pick_cpu is the second of the busy CPUs - * which will kick idle load balancer as soon as it has any load. + * Current heuristic for kicking the idle load balancer in the presence + * of an idle cpu is the system. + * - This rq has more than one task. + * - At any scheduler domain level, this cpu's scheduler group has multiple + * busy cpu's exceeding the group's power. + * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler + * domain span are idle. */ static inline int nohz_kick_needed(struct rq *rq, int cpu) { unsigned long now = jiffies; - int ret; - int first_pick_cpu, second_pick_cpu; - - if (time_before(now, nohz.next_balance)) - return 0; + struct sched_domain *sd; if (idle_cpu(cpu)) return 0; - first_pick_cpu = atomic_read(&nohz.first_pick_cpu); - second_pick_cpu = atomic_read(&nohz.second_pick_cpu); + /* + * We were recently in tickless idle mode. We will do the delayed + * update of busy mode now (first busy tick after returning from idle). + */ + if (unlikely(rq->tick_stopped)) { + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); + + if (cpumask_bits(nohz.idle_cpus_mask)[BIT_WORD(cpu)] == 0 && + cpumask_empty(nohz.idle_cpus_mask)) + clear_bit(NOHZ_NEED_BALANCING, &nohz.bits); + + rq->tick_stopped = 0; + for_each_domain(cpu, sd) + atomic_inc(&sd->groups->sgp->nr_busy_cpus); + } - if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu && - second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu) + if (likely(!test_bit(NOHZ_NEED_BALANCING, &nohz.bits))) return 0; - ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu); - if (ret == nr_cpu_ids || ret == cpu) { - atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids); - if (rq->nr_running > 1) - return 1; - } else { - ret = atomic_cmpxchg(&nohz.second_pick_cpu, nr_cpu_ids, cpu); - if (ret == nr_cpu_ids || ret == cpu) { - if (rq->nr_running) - return 1; - } + if (rq->nr_running >= 2) + goto need_kick; + + for_each_domain(cpu, sd) { + struct sched_group *sg = sd->groups; + struct sched_group_power *sgp = sg->sgp; + int nr_busy = atomic_read(&sgp->nr_busy_cpus); + + if (nr_busy > 1 && (nr_busy * SCHED_LOAD_SCALE > sgp->power)) + goto need_kick; + + if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight + && (cpumask_first_and(nohz.idle_cpus_mask, + sched_domain_span(sd)) < cpu)) + goto need_kick; } + return 0; + +need_kick: + if (time_before(now, nohz.next_balance) || + test_bit(NOHZ_BALANCE_KICK, &nohz.bits)) + return 0; + + if (test_and_set_bit(NOHZ_BALANCE_KICK, &nohz.bits)) + return 0; + + return 1; } #else static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-11-01 23:52 ` Suresh Siddha @ 2011-11-02 13:04 ` Srivatsa Vaddagiri 2011-11-02 13:54 ` Srivatsa Vaddagiri 2011-11-14 9:32 ` Peter Zijlstra 2 siblings, 0 replies; 15+ messages in thread From: Srivatsa Vaddagiri @ 2011-11-02 13:04 UTC (permalink / raw) To: Suresh Siddha Cc: Venki Pallipadi, Peter Zijlstra, Andi Kleen, Tim Chen, Ingo Molnar, linux-kernel@vger.kernel.org * Suresh Siddha <suresh.b.siddha@intel.com> [2011-11-01 16:52:38]: > That is correct. So to minimize the global updates I did two things. Another issue I am trying to solve with nohz balancer is stale-ness of nohz.next_balance. As cpus enter/exit tickless state, nohz.next_balance is not updated to reflect that. I have found that its a predominant source of increased idle time with CFS bandwidth patches : https://lkml.org/lkml/2011/9/26/162 While it may be costly to update nohz.next_balance during exit event, I think we should update it upon entry. I am working on a patch to make nohz.next_balance more precise. This requires tickless idle cpus to be tracked in a global rb-tree (each tickless idle cpu is inserted in this tree, indexed by its rq->next_balance). Having this rb-tree makes it easy to precisely determine min(rq->next_balance) when a kick is deserved, which I am hoping is good for both performance (on the dot wakeup) but also power (avoid unnecessary/early wakeup). Draft patch is below. I was intending to post it soon along with some real-world benchmark numbers (based on matrix multiplication). Any other suggestions to avoid this global rb-tree, but still keep nohz.next_balance as precise as possible, would be welcome! Not-yet-Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> --- Makefile | 2 kernel/sched.c | 9 ++ kernel/sched_fair.c | 185 +++++++++++++++++++++++++++++++++++++++++++----- kernel/sched_idletask.c | 2 4 files changed, 179 insertions(+), 19 deletions(-) Index: current/Makefile =================================================================== --- current.orig/Makefile +++ current/Makefile @@ -1,6 +1,6 @@ VERSION = 3 PATCHLEVEL = 1 -SUBLEVEL = 0 +SUBLEVEL = 0-myp EXTRAVERSION = NAME = "Divemaster Edition" Index: current/kernel/sched.c =================================================================== --- current.orig/kernel/sched.c +++ current/kernel/sched.c @@ -602,6 +602,7 @@ struct rq { #ifdef CONFIG_NO_HZ u64 nohz_stamp; unsigned char nohz_balance_kick; + struct rb_node nohz_node; #endif int skip_clock_update; @@ -1409,6 +1410,8 @@ static inline bool got_nohz_idle_kick(vo return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick; } +static void reset_first_second_pick_cpu(int cpu); + #else /* CONFIG_NO_HZ */ static inline bool got_nohz_idle_kick(void) @@ -1416,6 +1419,8 @@ static inline bool got_nohz_idle_kick(vo return false; } +static inline void reset_first_second_pick_cpu(int cpu) { } + #endif /* CONFIG_NO_HZ */ static u64 sched_avg_period(void) @@ -8299,6 +8304,7 @@ void __init sched_init(void) rq_attach_root(rq, &def_root_domain); #ifdef CONFIG_NO_HZ rq->nohz_balance_kick = 0; + rb_init_node(&rq->nohz_node); #endif #endif init_rq_hrtick(rq); @@ -8344,10 +8350,13 @@ void __init sched_init(void) zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT); #ifdef CONFIG_NO_HZ zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); + zalloc_cpumask_var(&nohz.stale_idle_cpus_mask, GFP_NOWAIT); alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT); atomic_set(&nohz.load_balancer, nr_cpu_ids); atomic_set(&nohz.first_pick_cpu, nr_cpu_ids); atomic_set(&nohz.second_pick_cpu, nr_cpu_ids); + nohz.rq_next_balance = RB_ROOT; + spin_lock_init(&nohz.next_balance_lock); #endif /* May be allocated at isolcpus cmdline parse time */ if (cpu_isolated_map == NULL) Index: current/kernel/sched_fair.c =================================================================== --- current.orig/kernel/sched_fair.c +++ current/kernel/sched_fair.c @@ -4285,7 +4285,11 @@ static struct { atomic_t second_pick_cpu; cpumask_var_t idle_cpus_mask; cpumask_var_t grp_idle_mask; + cpumask_var_t stale_idle_cpus_mask; unsigned long next_balance; /* in jiffy units */ + struct rb_root rq_next_balance; + struct rb_node *rb_leftmost; + spinlock_t next_balance_lock; } nohz ____cacheline_aligned; int get_nohz_load_balancer(void) @@ -4427,8 +4431,12 @@ static void nohz_balancer_kick(int cpu) ilb_cpu = get_nohz_load_balancer(); - if (ilb_cpu >= nr_cpu_ids) { - ilb_cpu = cpumask_first(nohz.idle_cpus_mask); + /* + * ilb_cpu itself can be attempting to kick another idle cpu. Pick + * another idle cpu in that case. + */ + if (ilb_cpu >= nr_cpu_ids || ilb_cpu == cpu) { + ilb_cpu = cpumask_any_but(nohz.idle_cpus_mask, cpu); if (ilb_cpu >= nr_cpu_ids) return; } @@ -4449,6 +4457,122 @@ static void nohz_balancer_kick(int cpu) } /* + * Lazy dequeue of no-longer-tickless-idle cpu from rb-tree. + * + * This is done lazily, as otherwise taking a spinlock (nohz.next_balance_lock) + * to do immediate update of rb-tree can slowdown transition out of tickless + * state (and thus may hurt scheduling latencies for a task waking on a tickless + * idle cpu). + */ +static void __dequeue_nohz_idle_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu), *entry; + + WARN_ON(RB_EMPTY_NODE(&rq->nohz_node)); + + if (nohz.rb_leftmost == &rq->nohz_node) { + struct rb_node *next_node; + + next_node = rb_next(&rq->nohz_node); + nohz.rb_leftmost = next_node; + if (next_node) { + entry = rb_entry(next_node, struct rq, nohz_node); + + nohz.next_balance = entry->next_balance; + } + } + + rb_erase(&rq->nohz_node, &nohz.rq_next_balance); + RB_CLEAR_NODE(&rq->nohz_node); +} + +static void __enqueue_nohz_idle_cpu(int cpu) +{ + struct rb_node **link = &nohz.rq_next_balance.rb_node; + struct rb_node *parent = NULL; + struct rq *entry; + struct rq *rq = cpu_rq(cpu); + int leftmost = 1; + + /* + * Find the right place in the rbtree: + */ + while (*link) { + parent = *link; + entry = rb_entry(parent, struct rq, nohz_node); + /* + * We dont care about collisions. Nodes with + * the same key stay together. + */ + if (time_before(rq->next_balance, entry->next_balance)) { + link = &parent->rb_left; + } else { + link = &parent->rb_right; + leftmost = 0; + } + } + + /* + * Maintain a cache of leftmost tree entries (it is frequently + * used): + */ + if (leftmost) { + nohz.rb_leftmost = &rq->nohz_node; + nohz.next_balance = rq->next_balance; + } + + rb_link_node(&rq->nohz_node, parent, link); + rb_insert_color(&rq->nohz_node, &nohz.rq_next_balance); +} + +static void __clear_stale_nohz_idle_cpus(void) +{ + int stale_idle_cpu; + + for_each_cpu(stale_idle_cpu, nohz.stale_idle_cpus_mask) { + __dequeue_nohz_idle_cpu(stale_idle_cpu); + cpumask_clear_cpu(stale_idle_cpu, nohz.stale_idle_cpus_mask); + } +} + +/* + * Remove no-longer-tickless-idle cpus from rb-tree. + */ +static void clear_stale_nohz_idle_cpus(void) +{ + spin_lock(&nohz.next_balance_lock); + __clear_stale_nohz_idle_cpus(); + spin_unlock(&nohz.next_balance_lock); +} + +/* + * Enqueue a idle cpu going tickless in a rb-tree, indexed by its + * rq->next_balance value. + */ +static void enqueue_nohz_idle_cpu(int cpu) +{ + spin_lock(&nohz.next_balance_lock); + if (cpumask_test_cpu(cpu, nohz.stale_idle_cpus_mask)) { + __dequeue_nohz_idle_cpu(cpu); + cpumask_clear_cpu(cpu, nohz.stale_idle_cpus_mask); + } + __enqueue_nohz_idle_cpu(cpu); + spin_unlock(&nohz.next_balance_lock); +} + +/* + * Dequeue and enqueue a tickless idle cpu indexed at its new rq->next_balance + * value. + */ +static void re_enqueue_nohz_idle_cpu(int cpu) +{ + spin_lock(&nohz.next_balance_lock); + __dequeue_nohz_idle_cpu(cpu); + __enqueue_nohz_idle_cpu(cpu); + spin_unlock(&nohz.next_balance_lock); +} + +/* * This routine will try to nominate the ilb (idle load balancing) * owner among the cpus whose ticks are stopped. ilb owner will do the idle * load balancing on behalf of all those cpus. @@ -4481,12 +4605,9 @@ void select_nohz_load_balancer(int stop_ return; } - cpumask_set_cpu(cpu, nohz.idle_cpus_mask); + enqueue_nohz_idle_cpu(cpu); - if (atomic_read(&nohz.first_pick_cpu) == cpu) - atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids); - if (atomic_read(&nohz.second_pick_cpu) == cpu) - atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids); + cpumask_set_cpu(cpu, nohz.idle_cpus_mask); if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) { int new_ilb; @@ -4512,6 +4633,7 @@ void select_nohz_load_balancer(int stop_ if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) return; + cpumask_set_cpu(cpu, nohz.stale_idle_cpus_mask); cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); if (atomic_read(&nohz.load_balancer) == cpu) @@ -4622,10 +4744,20 @@ static void nohz_idle_balance(int this_c struct rq *this_rq = cpu_rq(this_cpu); struct rq *rq; int balance_cpu; + unsigned long old_next_balance; - if (idle != CPU_IDLE || !this_rq->nohz_balance_kick) + if (!this_rq->nohz_balance_kick) return; + /* Wakeup another idle cpu to do idle load balance if we got busy */ + if (!idle_cpu(this_cpu)) { + nohz_balancer_kick(this_cpu); + goto out; + } + + /* Remove no-longer-tickless-idle cpus from rb-tree */ + clear_stale_nohz_idle_cpus(); + for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu) continue; @@ -4636,8 +4768,8 @@ static void nohz_idle_balance(int this_c * balancing owner will pick it up. */ if (need_resched()) { - this_rq->nohz_balance_kick = 0; - break; + nohz_balancer_kick(this_cpu); + goto out; } raw_spin_lock_irq(&this_rq->lock); @@ -4645,13 +4777,15 @@ static void nohz_idle_balance(int this_c update_cpu_load(this_rq); raw_spin_unlock_irq(&this_rq->lock); - rebalance_domains(balance_cpu, CPU_IDLE); - rq = cpu_rq(balance_cpu); - if (time_after(this_rq->next_balance, rq->next_balance)) - this_rq->next_balance = rq->next_balance; + old_next_balance = rq->next_balance; + if (time_after_eq(jiffies, old_next_balance)) { + rebalance_domains(balance_cpu, CPU_IDLE); + if (rq->next_balance != old_next_balance) + re_enqueue_nohz_idle_cpu(balance_cpu); + } } - nohz.next_balance = this_rq->next_balance; +out: this_rq->nohz_balance_kick = 0; } @@ -4700,9 +4834,24 @@ static inline int nohz_kick_needed(struc } return 0; } -#else + +/* + * Reset first_pick_cpu or second_pick_cpu identifier in case + * corresponding cpu is going idle. + */ +static void reset_first_second_pick_cpu(int cpu) +{ + if (atomic_read(&nohz.first_pick_cpu) == cpu) + atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids); + if (atomic_read(&nohz.second_pick_cpu) == cpu) + atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids); +} + +#else /* CONFIG_NO_HZ */ + static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { } -#endif + +#endif /* CONFIG_NO_HZ */ /* * run_rebalance_domains is triggered when needed from the scheduler tick. @@ -4740,7 +4889,7 @@ static inline void trigger_load_balance( likely(!on_null_domain(cpu))) raise_softirq(SCHED_SOFTIRQ); #ifdef CONFIG_NO_HZ - else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu))) + if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu))) nohz_balancer_kick(cpu); #endif } Index: current/kernel/sched_idletask.c =================================================================== --- current.orig/kernel/sched_idletask.c +++ current/kernel/sched_idletask.c @@ -24,6 +24,8 @@ static struct task_struct *pick_next_tas { schedstat_inc(rq, sched_goidle); calc_load_account_idle(rq); + reset_first_second_pick_cpu(cpu_of(rq)); + return rq->idle; } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-11-01 23:52 ` Suresh Siddha 2011-11-02 13:04 ` Srivatsa Vaddagiri @ 2011-11-02 13:54 ` Srivatsa Vaddagiri 2011-11-02 15:13 ` Suresh Siddha 2011-11-14 9:32 ` Peter Zijlstra 2 siblings, 1 reply; 15+ messages in thread From: Srivatsa Vaddagiri @ 2011-11-02 13:54 UTC (permalink / raw) To: Suresh Siddha Cc: Venki Pallipadi, Peter Zijlstra, Andi Kleen, Tim Chen, Ingo Molnar, linux-kernel@vger.kernel.org * Suresh Siddha <suresh.b.siddha@intel.com> [2011-11-01 16:52:38]: > + /* > + * We were recently in tickless idle mode. We will do the delayed > + * update of busy mode now (first busy tick after returning from idle). > + */ > + if (unlikely(rq->tick_stopped)) { > + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); > + > + if (cpumask_bits(nohz.idle_cpus_mask)[BIT_WORD(cpu)] == 0 && > + cpumask_empty(nohz.idle_cpus_mask)) > + clear_bit(NOHZ_NEED_BALANCING, &nohz.bits); Can't this clear_bit race with set_bit() in select_nohz_load_balancer()? CPU0 CPU1 cpumask_clear_cpu() if ( ...) cpumask_set_cpu(); set_bit(); clear_bit(); ? - vatsa ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-11-02 13:54 ` Srivatsa Vaddagiri @ 2011-11-02 15:13 ` Suresh Siddha 0 siblings, 0 replies; 15+ messages in thread From: Suresh Siddha @ 2011-11-02 15:13 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Venki Pallipadi, Peter Zijlstra, Andi Kleen, Tim Chen, Ingo Molnar, linux-kernel@vger.kernel.org On Wed, 2011-11-02 at 06:54 -0700, Srivatsa Vaddagiri wrote: > * Suresh Siddha <suresh.b.siddha@intel.com> [2011-11-01 16:52:38]: > > > + /* > > + * We were recently in tickless idle mode. We will do the delayed > > + * update of busy mode now (first busy tick after returning from idle). > > + */ > > + if (unlikely(rq->tick_stopped)) { > > + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); > > + > > + if (cpumask_bits(nohz.idle_cpus_mask)[BIT_WORD(cpu)] == 0 && > > + cpumask_empty(nohz.idle_cpus_mask)) > > + clear_bit(NOHZ_NEED_BALANCING, &nohz.bits); > > Can't this clear_bit race with set_bit() in select_nohz_load_balancer()? > > CPU0 CPU1 > > cpumask_clear_cpu() > if ( ...) > cpumask_set_cpu(); > set_bit(); > > clear_bit(); > > ? Yes. All I want is a quick way of getting the cpumask_weight(nohz.idle_cpus_mask) so that the busy cpu need not spend much time to see if there is an idle cpu that needs idle load balancing. Let me see if there is any simple way or else we need nohz.nr_idle_cpus thanks, suresh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-11-01 23:52 ` Suresh Siddha 2011-11-02 13:04 ` Srivatsa Vaddagiri 2011-11-02 13:54 ` Srivatsa Vaddagiri @ 2011-11-14 9:32 ` Peter Zijlstra 2011-11-14 19:37 ` Suresh Siddha 2 siblings, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2011-11-14 9:32 UTC (permalink / raw) To: Suresh Siddha Cc: Venki Pallipadi, Andi Kleen, Tim Chen, Ingo Molnar, linux-kernel@vger.kernel.org On Tue, 2011-11-01 at 16:52 -0700, Suresh Siddha wrote: > @@ -3317,6 +3317,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) > > cpu_rq(cpu)->cpu_power = power; > sdg->sgp->power = power; > + atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight); > } > > static void update_group_power(struct sched_domain *sd, int cpu) > @@ -3339,6 +3340,7 @@ static void update_group_power(struct sched_domain *sd, int cpu) > } while (group != child->groups); > > sdg->sgp->power = power; > + atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight); > } So we run this rather frequently, and it will trample all over: > + */ > + for_each_domain(cpu, sd) > + atomic_dec(&sd->groups->sgp->nr_busy_cpus); because I cannot see any serialization between those sites. Also, isn't it rather weird to just assume all cpus are busy in update_group_power()? If you would actually set the right value in update_cpu_power() you could use a straight sum in update_group_power() and get a more or less accurate number out. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability 2011-11-14 9:32 ` Peter Zijlstra @ 2011-11-14 19:37 ` Suresh Siddha 0 siblings, 0 replies; 15+ messages in thread From: Suresh Siddha @ 2011-11-14 19:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Venki Pallipadi, Andi Kleen, Tim Chen, Ingo Molnar, linux-kernel@vger.kernel.org On Mon, 2011-11-14 at 01:32 -0800, Peter Zijlstra wrote: > On Tue, 2011-11-01 at 16:52 -0700, Suresh Siddha wrote: > > @@ -3317,6 +3317,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) > > > > cpu_rq(cpu)->cpu_power = power; > > sdg->sgp->power = power; > > + atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight); > > } > > > > static void update_group_power(struct sched_domain *sd, int cpu) > > @@ -3339,6 +3340,7 @@ static void update_group_power(struct sched_domain *sd, int cpu) > > } while (group != child->groups); > > > > sdg->sgp->power = power; > > + atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight); > > } > > So we run this rather frequently, and it will trample all over: > > > + */ > > + for_each_domain(cpu, sd) > > + atomic_dec(&sd->groups->sgp->nr_busy_cpus); > > because I cannot see any serialization between those sites. That was an overlook from myside. I moved the initialization of this to init_sched_groups_power() now and there is no need to re-do this everytime we call the update_group_power(). > Also, isn't it rather weird to just assume all cpus are busy in > update_group_power()? If you would actually set the right value in > update_cpu_power() you could use a straight sum in update_group_power() > and get a more or less accurate number out. I will post an updated patch soon (once we complete the performance analysis of this patch with respect to other workloads) but the below hunk gives an idea of what I am planning to do now. @@ -7369,6 +7384,7 @@ static void init_sched_groups_power(int cpu, struct sched_ return; update_group_power(sd, cpu); + atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight); } /* thanks, suresh ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2011-11-14 19:34 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-19 21:45 [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability Tim Chen
2011-10-20 4:18 ` Eric Dumazet
2011-10-20 5:57 ` Suresh Siddha
2011-10-20 6:43 ` Eric Dumazet
2011-10-20 17:19 ` Tim Chen
2011-10-20 4:24 ` Andi Kleen
2011-10-20 12:26 ` Venki Pallipadi
2011-10-20 17:31 ` Suresh Siddha
2011-10-20 17:38 ` Peter Zijlstra
[not found] ` <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com>
2011-11-01 23:52 ` Suresh Siddha
2011-11-02 13:04 ` Srivatsa Vaddagiri
2011-11-02 13:54 ` Srivatsa Vaddagiri
2011-11-02 15:13 ` Suresh Siddha
2011-11-14 9:32 ` Peter Zijlstra
2011-11-14 19:37 ` Suresh Siddha
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox