[Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
@ 2011-10-19 21:45 Tim Chen
  2011-10-20  4:18 ` Eric Dumazet
  2011-10-20  4:24 ` Andi Kleen
  0 siblings, 2 replies; 15+ messages in thread
From: Tim Chen @ 2011-10-19 21:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Andi Kleen, Suresh Siddha, Venki Pallipadi

Idle load balancing makes use of a global structure nohz to keep track
of the cpu doing the idle load balancing, first and second busy cpu and
the cpus that are idle.  This leads to scalability issue.

For workload that has processes waking up and going to sleep often, the 
load_balancer, first_pick_cpu, second_cpu and idle_cpus_mask in the
no_hz structure get updated very frequently. This causes lots of cache
bouncing and slowing down the idle and wakeup path for large system with
many cores/sockets.  This is evident from up to 41% of cpu cycles spent
in the function select_nohz_load_balancer from a test work load I ran.
By putting these fields in their own cache line, the problem can be
mitigated.

The test workload has multiple pairs of processes. Within a process
pair, each process receive and then send message back and forth to the
other process via a pipe connecting them. So at any one time, half the
processes are active.

I found that for 32 pairs of processes, I got an increase of the rate of
context switching between the processes by 37% and by 24% for 64 process
pairs. The test was run on a 8 socket 64 cores NHM-EX system, where
hyper-threading has been turned on.

Tim

Workload cpu cycle profile on vanilla kernel:
41.19%          swapper  [kernel.kallsyms]          [k] select_nohz_load_balancer   
   - select_nohz_load_balancer                                                       
      + 54.91% tick_nohz_restart_sched_tick                                         
      + 45.04% tick_nohz_stop_sched_tick     
18.96%          swapper  [kernel.kallsyms]          [k] mwait_idle_with_hints        
 3.50%          swapper  [kernel.kallsyms]          [k] tick_nohz_restart_sched_tick 
 3.36%          swapper  [kernel.kallsyms]          [k] tick_check_idle              
 2.96%          swapper  [kernel.kallsyms]          [k] rcu_enter_nohz               
 2.40%          swapper  [kernel.kallsyms]          [k] _raw_spin_lock               
 2.11%          swapper  [kernel.kallsyms]          [k] tick_nohz_stop_sched_tick    

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bc8ee99..26ea877 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3639,10 +3639,10 @@ static inline void init_sched_softirq_csd(struct call_single_data *csd)
  *   load balancing for all the idle CPUs.
  */
 static struct {
-	atomic_t load_balancer;
-	atomic_t first_pick_cpu;
-	atomic_t second_pick_cpu;
-	cpumask_var_t idle_cpus_mask;
+	atomic_t load_balancer ____cacheline_aligned;
+	atomic_t first_pick_cpu ____cacheline_aligned;
+	atomic_t second_pick_cpu ____cacheline_aligned;
+	cpumask_var_t idle_cpus_mask ____cacheline_aligned;
 	cpumask_var_t grp_idle_mask;
 	unsigned long next_balance;     /* in jiffy units */
 } nohz ____cacheline_aligned;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-10-19 21:45 [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability Tim Chen
@ 2011-10-20  4:18 ` Eric Dumazet
  2011-10-20  5:57   ` Suresh Siddha
  2011-10-20 17:19   ` Tim Chen
  2011-10-20  4:24 ` Andi Kleen
  1 sibling, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2011-10-20  4:18 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Andi Kleen,
	Suresh Siddha, Venki Pallipadi

Le mercredi 19 octobre 2011 à 14:45 -0700, Tim Chen a écrit :
> Idle load balancing makes use of a global structure nohz to keep track
> of the cpu doing the idle load balancing, first and second busy cpu and
> the cpus that are idle.  This leads to scalability issue.
> 
> For workload that has processes waking up and going to sleep often, the 
> load_balancer, first_pick_cpu, second_cpu and idle_cpus_mask in the
> no_hz structure get updated very frequently. This causes lots of cache
> bouncing and slowing down the idle and wakeup path for large system with
> many cores/sockets.  This is evident from up to 41% of cpu cycles spent
> in the function select_nohz_load_balancer from a test work load I ran.
> By putting these fields in their own cache line, the problem can be
> mitigated.
> 
> The test workload has multiple pairs of processes. Within a process
> pair, each process receive and then send message back and forth to the
> other process via a pipe connecting them. So at any one time, half the
> processes are active.
> 
> I found that for 32 pairs of processes, I got an increase of the rate of
> context switching between the processes by 37% and by 24% for 64 process
> pairs. The test was run on a 8 socket 64 cores NHM-EX system, where
> hyper-threading has been turned on.
> 
> Tim
> 
> Workload cpu cycle profile on vanilla kernel:
> 41.19%          swapper  [kernel.kallsyms]          [k] select_nohz_load_balancer   
>    - select_nohz_load_balancer                                                       
>       + 54.91% tick_nohz_restart_sched_tick                                         
>       + 45.04% tick_nohz_stop_sched_tick     
> 18.96%          swapper  [kernel.kallsyms]          [k] mwait_idle_with_hints        
>  3.50%          swapper  [kernel.kallsyms]          [k] tick_nohz_restart_sched_tick 
>  3.36%          swapper  [kernel.kallsyms]          [k] tick_check_idle              
>  2.96%          swapper  [kernel.kallsyms]          [k] rcu_enter_nohz               
>  2.40%          swapper  [kernel.kallsyms]          [k] _raw_spin_lock               
>  2.11%          swapper  [kernel.kallsyms]          [k] tick_nohz_stop_sched_tick    
> 
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index bc8ee99..26ea877 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -3639,10 +3639,10 @@ static inline void init_sched_softirq_csd(struct call_single_data *csd)
>   *   load balancing for all the idle CPUs.
>   */
>  static struct {
> -	atomic_t load_balancer;
> -	atomic_t first_pick_cpu;
> -	atomic_t second_pick_cpu;
> -	cpumask_var_t idle_cpus_mask;
> +	atomic_t load_balancer ____cacheline_aligned;
> +	atomic_t first_pick_cpu ____cacheline_aligned;
> +	atomic_t second_pick_cpu ____cacheline_aligned;
> +	cpumask_var_t idle_cpus_mask ____cacheline_aligned;
>  	cpumask_var_t grp_idle_mask;
>  	unsigned long next_balance;     /* in jiffy units */
>  } nohz ____cacheline_aligned;
> 

Dont you increase cache footprint, say for an Uniprocessor machine ?

(CONFIG_SMP=n)

____cacheline_aligned_in_smp seems more suitable in this case.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-10-20  4:18 ` Eric Dumazet
@ 2011-10-20  5:57   ` Suresh Siddha
  2011-10-20  6:43     ` Eric Dumazet
  2011-10-20 17:19   ` Tim Chen
  1 sibling, 1 reply; 15+ messages in thread
From: Suresh Siddha @ 2011-10-20  5:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tim Chen, Ingo Molnar, Peter Zijlstra,
	linux-kernel@vger.kernel.org, Andi Kleen, Venki Pallipadi

On Wed, 2011-10-19 at 21:18 -0700, Eric Dumazet wrote:
> > +	atomic_t load_balancer ____cacheline_aligned;
> > +	atomic_t first_pick_cpu ____cacheline_aligned;
> > +	atomic_t second_pick_cpu ____cacheline_aligned;
> > +	cpumask_var_t idle_cpus_mask ____cacheline_aligned;
> >  	cpumask_var_t grp_idle_mask;
> >  	unsigned long next_balance;     /* in jiffy units */
> >  } nohz ____cacheline_aligned;
> > 
> 
> Dont you increase cache footprint, say for an Uniprocessor machine ?
> 
> (CONFIG_SMP=n)
> 
> ____cacheline_aligned_in_smp seems more suitable in this case.

I believe this code is already under ifdef CONFIG_SMP.

thanks,
suresh



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-10-20  5:57   ` Suresh Siddha
@ 2011-10-20  6:43     ` Eric Dumazet
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2011-10-20  6:43 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Tim Chen, Ingo Molnar, Peter Zijlstra,
	linux-kernel@vger.kernel.org, Andi Kleen, Venki Pallipadi

Le mercredi 19 octobre 2011 à 22:57 -0700, Suresh Siddha a écrit :
> On Wed, 2011-10-19 at 21:18 -0700, Eric Dumazet wrote:
> > > +	atomic_t load_balancer ____cacheline_aligned;
> > > +	atomic_t first_pick_cpu ____cacheline_aligned;
> > > +	atomic_t second_pick_cpu ____cacheline_aligned;
> > > +	cpumask_var_t idle_cpus_mask ____cacheline_aligned;
> > >  	cpumask_var_t grp_idle_mask;
> > >  	unsigned long next_balance;     /* in jiffy units */
> > >  } nohz ____cacheline_aligned;
> > > 
> > 
> > Dont you increase cache footprint, say for an Uniprocessor machine ?
> > 
> > (CONFIG_SMP=n)
> > 
> > ____cacheline_aligned_in_smp seems more suitable in this case.
> 
> I believe this code is already under ifdef CONFIG_SMP.

Right now yes, but it might change in the future after some code
cleanup / consolidation.

This kind of things are copy/pasted, and the right semantic here is
____cacheline_aligned_in_smp . It matches the changelog better.

Note: the ____cacheline_aligned on global "nohz" structure was quite
different, and in contradiction of what this patch wants to do.

Previously, we wanted to group all fields on a single cache line.
We now know it was not a scalable strategy, thanks to Tim.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-10-20  4:18 ` Eric Dumazet
  2011-10-20  5:57   ` Suresh Siddha
@ 2011-10-20 17:19   ` Tim Chen
  1 sibling, 0 replies; 15+ messages in thread
From: Tim Chen @ 2011-10-20 17:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Andi Kleen,
	Suresh Siddha, Venki Pallipadi

On Thu, 2011-10-20 at 06:18 +0200, Eric Dumazet wrote:

> Dont you increase cache footprint, say for an Uniprocessor machine ?
> 
> (CONFIG_SMP=n)
> 
> ____cacheline_aligned_in_smp seems more suitable in this case.
> 
> 
> 
Okay, using the smp version of the cache align in this updated patch.

Thanks.

------------

Idle load balancing makes use of a global structure nohz to keep track
of the cpu doing the idle load balancing, first and second busy cpu and
the cpus that are idle.  This leads to scalability issue.

For workload that has processes waking up and going to sleep often, the 
load_balancer, first_pick_cpu, second_cpu and idle_cpus_mask in the
no_hz structure get updated very frequently. This causes lots of cache
bouncing and slowing down the idle and wakeup path for large system with
many cores/sockets.  This is evident from up to 41% of cpu cycles spent
in the function select_nohz_load_balancer from a test work load I ran.
By putting these fields in their own cache line, the problem can be
mitigated.

The test workload has multiple pairs of processes. Within a process
pair, each process receive and then send message back and forth to the
other process via a pipe connecting them. So at any one time, half the
processes are active.

I found that for 32 pairs of processes, I got an increase of the rate of
context switching between the processes by 37% and by 24% for 64 process
pairs. The test was run on a 8 socket 64 cores NHM-EX system, where
hyper-threading has been turned on.

Tim

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bc8ee99..4ae4b7d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3639,10 +3639,10 @@ static inline void init_sched_softirq_csd(struct call_single_data *csd)
  *   load balancing for all the idle CPUs.
  */
 static struct {
-	atomic_t load_balancer;
-	atomic_t first_pick_cpu;
-	atomic_t second_pick_cpu;
-	cpumask_var_t idle_cpus_mask;
+	atomic_t load_balancer ____cacheline_aligned_in_smp;
+	atomic_t first_pick_cpu ____cacheline_aligned_in_smp;
+	atomic_t second_pick_cpu ____cacheline_aligned_in_smp;
+	cpumask_var_t idle_cpus_mask ____cacheline_aligned_in_smp;
 	cpumask_var_t grp_idle_mask;
 	unsigned long next_balance;     /* in jiffy units */
 } nohz ____cacheline_aligned;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-10-19 21:45 [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability Tim Chen
  2011-10-20  4:18 ` Eric Dumazet
@ 2011-10-20  4:24 ` Andi Kleen
  2011-10-20 12:26   ` Venki Pallipadi
  1 sibling, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2011-10-20  4:24 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Suresh Siddha,
	Venki Pallipadi

Tim Chen <tim.c.chen@linux.intel.com> writes:
>   */
>  static struct {
> -	atomic_t load_balancer;
> -	atomic_t first_pick_cpu;
> -	atomic_t second_pick_cpu;
> -	cpumask_var_t idle_cpus_mask;
> +	atomic_t load_balancer ____cacheline_aligned;
> +	atomic_t first_pick_cpu ____cacheline_aligned;
> +	atomic_t second_pick_cpu ____cacheline_aligned;
> +	cpumask_var_t idle_cpus_mask ____cacheline_aligned;

On large configs idle_cpu_masks may be allocated.  May need
more changes to tell the allocator to cache align/pad too?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-10-20  4:24 ` Andi Kleen
@ 2011-10-20 12:26   ` Venki Pallipadi
  2011-10-20 17:31     ` Suresh Siddha
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Venki Pallipadi @ 2011-10-20 12:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tim Chen, Ingo Molnar, Peter Zijlstra, linux-kernel,
	Suresh Siddha

On Wed, Oct 19, 2011 at 9:24 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Tim Chen <tim.c.chen@linux.intel.com> writes:
>>   */
>>  static struct {
>> -     atomic_t load_balancer;
>> -     atomic_t first_pick_cpu;
>> -     atomic_t second_pick_cpu;
>> -     cpumask_var_t idle_cpus_mask;
>> +     atomic_t load_balancer ____cacheline_aligned;
>> +     atomic_t first_pick_cpu ____cacheline_aligned;
>> +     atomic_t second_pick_cpu ____cacheline_aligned;
>> +     cpumask_var_t idle_cpus_mask ____cacheline_aligned;
>
> On large configs idle_cpu_masks may be allocated.  May need
> more changes to tell the allocator to cache align/pad too?
>

An alternate approach is to split this struct per node/socket and do
the nohz idle balancing logic at that level. That should be more
scalable in terms of nohz balancing (ensure one CPU wont be doing nohz
balancing for huge number of idle CPUs). I had looked at that approach
couple of years earlier and couldn't measure that much of a gain. May
be it is time to revisit that with increased core count.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-10-20 12:26   ` Venki Pallipadi
@ 2011-10-20 17:31     ` Suresh Siddha
  2011-10-20 17:38     ` Peter Zijlstra
       [not found]     ` <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com>
  2 siblings, 0 replies; 15+ messages in thread
From: Suresh Siddha @ 2011-10-20 17:31 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Andi Kleen, Tim Chen, Ingo Molnar, Peter Zijlstra,
	linux-kernel@vger.kernel.org

On Thu, 2011-10-20 at 05:26 -0700, Venki Pallipadi wrote:
> On Wed, Oct 19, 2011 at 9:24 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > Tim Chen <tim.c.chen@linux.intel.com> writes:
> >>   */
> >>  static struct {
> >> -     atomic_t load_balancer;
> >> -     atomic_t first_pick_cpu;
> >> -     atomic_t second_pick_cpu;
> >> -     cpumask_var_t idle_cpus_mask;
> >> +     atomic_t load_balancer ____cacheline_aligned;
> >> +     atomic_t first_pick_cpu ____cacheline_aligned;
> >> +     atomic_t second_pick_cpu ____cacheline_aligned;
> >> +     cpumask_var_t idle_cpus_mask ____cacheline_aligned;
> >
> > On large configs idle_cpu_masks may be allocated.  May need
> > more changes to tell the allocator to cache align/pad too?
> >
> 
> An alternate approach is to split this struct per node/socket and do
> the nohz idle balancing logic at that level. That should be more
> scalable in terms of nohz balancing (ensure one CPU wont be doing nohz
> balancing for huge number of idle CPUs). I had looked at that approach
> couple of years earlier and couldn't measure that much of a gain. May
> be it is time to revisit that with increased core count.

I am actually looking at it. This is a quick fix for the meantime.

Even with this cacheline aligned, the first_pick_cpu and second_pick_cpu
cacheline bouncing is still causing ~8% more performance impact on this
synthetic lat_ctx heavy workload. For this experiment to measure the
potential scope we have here, We left the cacheline bouncing associated
with the  nohz.idle_cpus_mask update in place.

thanks,
suresh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-10-20 12:26   ` Venki Pallipadi
  2011-10-20 17:31     ` Suresh Siddha
@ 2011-10-20 17:38     ` Peter Zijlstra
       [not found]     ` <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com>
  2 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2011-10-20 17:38 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Andi Kleen, Tim Chen, Ingo Molnar, linux-kernel, Suresh Siddha

On Thu, 2011-10-20 at 05:26 -0700, Venki Pallipadi wrote:
> On Wed, Oct 19, 2011 at 9:24 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > Tim Chen <tim.c.chen@linux.intel.com> writes:
> >>   */
> >>  static struct {
> >> -     atomic_t load_balancer;
> >> -     atomic_t first_pick_cpu;
> >> -     atomic_t second_pick_cpu;
> >> -     cpumask_var_t idle_cpus_mask;
> >> +     atomic_t load_balancer ____cacheline_aligned;
> >> +     atomic_t first_pick_cpu ____cacheline_aligned;
> >> +     atomic_t second_pick_cpu ____cacheline_aligned;
> >> +     cpumask_var_t idle_cpus_mask ____cacheline_aligned;
> >
> > On large configs idle_cpu_masks may be allocated.  May need
> > more changes to tell the allocator to cache align/pad too?
> >
> 
> An alternate approach is to split this struct per node/socket and do
> the nohz idle balancing logic at that level. That should be more
> scalable in terms of nohz balancing (ensure one CPU wont be doing nohz
> balancing for huge number of idle CPUs). I had looked at that approach
> couple of years earlier and couldn't measure that much of a gain. May
> be it is time to revisit that with increased core count.

Yeah, that would be best, although I remember there was a problem with
your approach back then, a fully idle node would not balance at all or
something like that.


^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com>]

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
       [not found]     ` <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com>
@ 2011-11-01 23:52       ` Suresh Siddha
  2011-11-02 13:04         ` Srivatsa Vaddagiri
                           ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Suresh Siddha @ 2011-11-01 23:52 UTC (permalink / raw)
  To: Venki Pallipadi, Peter Zijlstra
  Cc: Andi Kleen, Tim Chen, Ingo Molnar, linux-kernel@vger.kernel.org

On Thu, 2011-10-20 at 10:31 -0700, Suresh Siddha wrote:
> On Thu, 2011-10-20 at 05:26 -0700, Venki Pallipadi wrote:
> > An alternate approach is to split this struct per node/socket and do
> > the nohz idle balancing logic at that level. That should be more
> > scalable in terms of nohz balancing (ensure one CPU wont be doing nohz
> > balancing for huge number of idle CPUs). I had looked at that approach
> > couple of years earlier and couldn't measure that much of a gain. May
> > be it is time to revisit that with increased core count.

On Thu, 2011-10-20 at 10:38 -0700, Peter Zijlstra wrote:
> Yeah, that would be best, although I remember there was a problem with
> your approach back then, a fully idle node would not balance at all or
> something like that.

That is correct. So to minimize the global updates I did two things.

1. Track nr of busy cpus in each scheduler group for each scheduler
domain.

2. delay the update of nohz.idle_cpus_mask (And to the above mentioned
new sched group specific busy stat)  when coming out of idle to the
first busy tick. And added idle_cpu() checks at couple of places in idle
load balancing code to ensure that the cpu is really idle.

The kicked cpu will still do the idle load balancing on behalf of all
the cpu's. We are kicking only when it is really needed.

If we add more smarts then we can split this also into multiple idle
load balances in the future or kick the active load balance directly for
the HT/MC cases where we want to spread busy cores/threads. This is open
for future though.

> 
> I am actually looking at it. This is a quick fix for the meantime.
> 
> Even with this cacheline aligned, the first_pick_cpu and second_pick_cpu
> cacheline bouncing is still causing ~8% more performance impact on this
> synthetic lat_ctx heavy workload. For this experiment to measure the
> potential scope we have here, We left the cacheline bouncing associated
> with the  nohz.idle_cpus_mask update in place.

Appended patch helped improve that workload by 100%. We haven't
completed the testing of this patch but thought of getting quick
feedback. Also I may have to split this into multiple patches (removing
of nohz_stamp etc are not related). Thanks.
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: RFC: sched, nohz: sched group, domain aware nohz idle load balancing 

Make nohz idle load balancing more scalabale by making it aware of sched group/
sched domains.

Introduce nr_busy_cpus in the struct sched_group_power [Not in sched_group
because sched groups are duplicated for the SD_OVERLAP scheduler domain]
and for each cpu that enters and exits tickless, this parameter will
be updated in each scheduler group of the scheduler domain that this cpu
belongs to.

To avoid the frequent update of this tickless state (global nohz data aswell
as some what local nr_busy_cpus in the sched_group's) when the cpu enters
and exits tickless idle often, the update of the exit state is delayed
to the first timer tick that happens after the cpu becomes busy.

Idle load balance is kicked on one of the idle cpu's when there is atleast
one idle cpu and

 - a busy rq having more than one task or

 - a busy scheduler group having multiple busy cpus that exceed the sched group
   power or

 - for the SD_ASYM_PACKING domain, if the lower numbered cpu's in that
   domain are idle compared to the busy ones.

This will help in kicking the idle load balancing request only when
there is a real imbalance. And once it is mostly balanced, these kicks will
be minimized.

These changes helped improve the workload that is context switch intensive
between number of task pairs by 2x on a 8 socket NHM-EX based system.

Reported-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

 include/linux/sched.h |    4 +
 kernel/sched.c        |   27 +++--
 kernel/sched_fair.c   |  264 ++++++++++++++++++-------------------------------
 3 files changed, 116 insertions(+), 179 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9c7a7a4..61e9980 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -901,6 +901,10 @@ struct sched_group_power {
 	 * single CPU.
 	 */
 	unsigned int power, power_orig;
+	/*
+	 * Number of busy cpus in this group.
+	 */
+	atomic_t nr_busy_cpus;
 };
 
 struct sched_group {
diff --git a/kernel/sched.c b/kernel/sched.c
index cf7f696..a8da213 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -600,8 +600,7 @@ struct rq {
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 	unsigned long last_load_update_tick;
 #ifdef CONFIG_NO_HZ
-	u64 nohz_stamp;
-	unsigned char nohz_balance_kick;
+	unsigned int tick_stopped:1;
 #endif
 	int skip_clock_update;
 
@@ -1337,6 +1336,20 @@ static void resched_cpu(int cpu)
 }
 
 #ifdef CONFIG_NO_HZ
+#define NOHZ_BALANCE_KICK	0
+#define NOHZ_NEED_BALANCING	1
+/*
+ * idle load balancing details
+ * - When one of the busy CPUs notice that there may be an idle rebalancing
+ *   needed, they will kick the idle load balancer, which then does idle
+ *   load balancing for all the idle CPUs.
+ */
+static struct {
+	cpumask_var_t idle_cpus_mask;
+	unsigned long next_balance;     /* in jiffy units */
+	unsigned long bits;
+} nohz ____cacheline_aligned;
+
 /*
  * In the semi idle case, use the nearest busy cpu for migrating timers
  * from an idle cpu.  This is good for power-savings.
@@ -1406,7 +1419,8 @@ void wake_up_idle_cpu(int cpu)
 
 static inline bool got_nohz_idle_kick(void)
 {
-	return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick;
+	return idle_cpu(smp_processor_id()) &&
+	       test_bit(NOHZ_BALANCE_KICK, &nohz.bits);
 }
 
 #else /* CONFIG_NO_HZ */
@@ -8297,9 +8311,6 @@ void __init sched_init(void)
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq_attach_root(rq, &def_root_domain);
-#ifdef CONFIG_NO_HZ
-		rq->nohz_balance_kick = 0;
-#endif
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
@@ -8344,10 +8355,6 @@ void __init sched_init(void)
 	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT);
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
-	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
-	atomic_set(&nohz.load_balancer, nr_cpu_ids);
-	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
-	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
 #endif
 	/* May be allocated at isolcpus cmdline parse time */
 	if (cpu_isolated_map == NULL)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..d18a1de 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3317,6 +3317,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 
 	cpu_rq(cpu)->cpu_power = power;
 	sdg->sgp->power = power;
+	atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight);
 }
 
 static void update_group_power(struct sched_domain *sd, int cpu)
@@ -3339,6 +3340,7 @@ static void update_group_power(struct sched_domain *sd, int cpu)
 	} while (group != child->groups);
 
 	sdg->sgp->power = power;
+	atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight);
 }
 
 /*
@@ -4269,30 +4271,6 @@ out_unlock:
 }
 
 #ifdef CONFIG_NO_HZ
-/*
- * idle load balancing details
- * - One of the idle CPUs nominates itself as idle load_balancer, while
- *   entering idle.
- * - This idle load balancer CPU will also go into tickless mode when
- *   it is idle, just like all other idle CPUs
- * - When one of the busy CPUs notice that there may be an idle rebalancing
- *   needed, they will kick the idle load balancer, which then does idle
- *   load balancing for all the idle CPUs.
- */
-static struct {
-	atomic_t load_balancer;
-	atomic_t first_pick_cpu;
-	atomic_t second_pick_cpu;
-	cpumask_var_t idle_cpus_mask;
-	cpumask_var_t grp_idle_mask;
-	unsigned long next_balance;     /* in jiffy units */
-} nohz ____cacheline_aligned;
-
-int get_nohz_load_balancer(void)
-{
-	return atomic_read(&nohz.load_balancer);
-}
-
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
 /**
  * lowest_flag_domain - Return lowest sched_domain containing flag.
@@ -4329,33 +4307,6 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 		(sd && (sd->flags & flag)); sd = sd->parent)
 
 /**
- * is_semi_idle_group - Checks if the given sched_group is semi-idle.
- * @ilb_group:	group to be checked for semi-idleness
- *
- * Returns:	1 if the group is semi-idle. 0 otherwise.
- *
- * We define a sched_group to be semi idle if it has atleast one idle-CPU
- * and atleast one non-idle CPU. This helper function checks if the given
- * sched_group is semi-idle or not.
- */
-static inline int is_semi_idle_group(struct sched_group *ilb_group)
-{
-	cpumask_and(nohz.grp_idle_mask, nohz.idle_cpus_mask,
-					sched_group_cpus(ilb_group));
-
-	/*
-	 * A sched_group is semi-idle when it has atleast one busy cpu
-	 * and atleast one idle cpu.
-	 */
-	if (cpumask_empty(nohz.grp_idle_mask))
-		return 0;
-
-	if (cpumask_equal(nohz.grp_idle_mask, sched_group_cpus(ilb_group)))
-		return 0;
-
-	return 1;
-}
-/**
  * find_new_ilb - Finds the optimum idle load balancer for nomination.
  * @cpu:	The cpu which is nominating a new idle_load_balancer.
  *
@@ -4369,9 +4320,9 @@ static inline int is_semi_idle_group(struct sched_group *ilb_group)
  */
 static int find_new_ilb(int cpu)
 {
+	int ilb = cpumask_first(nohz.idle_cpus_mask);
+	struct sched_group *ilbg;
 	struct sched_domain *sd;
-	struct sched_group *ilb_group;
-	int ilb = nr_cpu_ids;
 
 	/*
 	 * Have idle load balancer selection from semi-idle packages only
@@ -4389,23 +4340,28 @@ static int find_new_ilb(int cpu)
 
 	rcu_read_lock();
 	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
-		ilb_group = sd->groups;
+		ilbg = sd->groups;
 
 		do {
-			if (is_semi_idle_group(ilb_group)) {
-				ilb = cpumask_first(nohz.grp_idle_mask);
+			if (ilbg->group_weight !=
+				atomic_read(&ilbg->sgp->nr_busy_cpus)) {
+				ilb = cpumask_first_and(nohz.idle_cpus_mask,
+							sched_group_cpus(ilbg));
 				goto unlock;
 			}
 
-			ilb_group = ilb_group->next;
+			ilbg = ilbg->next;
 
-		} while (ilb_group != sd->groups);
+		} while (ilbg != sd->groups);
 	}
 unlock:
 	rcu_read_unlock();
 
 out_done:
-	return ilb;
+	if (ilb < nr_cpu_ids && idle_cpu(ilb))
+		return ilb;
+
+	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
 static inline int find_new_ilb(int call_cpu)
@@ -4423,102 +4379,52 @@ static void nohz_balancer_kick(int cpu)
 {
 	int ilb_cpu;
 
-	nohz.next_balance++;
-
-	ilb_cpu = get_nohz_load_balancer();
+	ilb_cpu = find_new_ilb(cpu);
 
 	if (ilb_cpu >= nr_cpu_ids) {
-		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
-		if (ilb_cpu >= nr_cpu_ids)
-			return;
+		clear_bit(NOHZ_BALANCE_KICK, &nohz.bits);
+		return;
 	}
 
-	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
-		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
-
-		smp_mb();
-		/*
-		 * Use smp_send_reschedule() instead of resched_cpu().
-		 * This way we generate a sched IPI on the target cpu which
-		 * is idle. And the softirq performing nohz idle load balance
-		 * will be run before returning from the IPI.
-		 */
-		smp_send_reschedule(ilb_cpu);
-	}
+	/*
+	 * Use smp_send_reschedule() instead of resched_cpu().
+	 * This way we generate a sched IPI on the target cpu which
+	 * is idle. And the softirq performing nohz idle load balance
+	 * will be run before returning from the IPI.
+	 */
+	smp_send_reschedule(ilb_cpu);
 	return;
 }
 
-/*
- * This routine will try to nominate the ilb (idle load balancing)
- * owner among the cpus whose ticks are stopped. ilb owner will do the idle
- * load balancing on behalf of all those cpus.
- *
- * When the ilb owner becomes busy, we will not have new ilb owner until some
- * idle CPU wakes up and goes back to idle or some busy CPU tries to kick
- * idle load balancing by kicking one of the idle CPUs.
- *
- * Ticks are stopped for the ilb owner as well, with busy CPU kicking this
- * ilb owner CPU in future (when there is a need for idle load balancing on
- * behalf of all idle CPUs).
- */
 void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
+	struct sched_domain *sd;
 
 	if (stop_tick) {
-		if (!cpu_active(cpu)) {
-			if (atomic_read(&nohz.load_balancer) != cpu)
-				return;
-
-			/*
-			 * If we are going offline and still the leader,
-			 * give up!
-			 */
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
-					   nr_cpu_ids) != cpu)
-				BUG();
-
+		if (this_rq()->tick_stopped)
 			return;
-		}
-
 		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
 
-		if (atomic_read(&nohz.first_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
-		if (atomic_read(&nohz.second_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
-
-		if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) {
-			int new_ilb;
+		this_rq()->tick_stopped = 1;
 
-			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, nr_cpu_ids,
-					   cpu) != nr_cpu_ids)
-				return;
+		/*
+		 * Indicate the idle state to all the scheduler groups that
+		 * this cpu is part of.
+		 */
+		for_each_domain(cpu, sd)
+			atomic_dec(&sd->groups->sgp->nr_busy_cpus);
 
-			/*
-			 * Check to see if there is a more power-efficient
-			 * ilb.
-			 */
-			new_ilb = find_new_ilb(cpu);
-			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
-				atomic_set(&nohz.load_balancer, nr_cpu_ids);
-				resched_cpu(new_ilb);
-				return;
-			}
-			return;
-		}
-	} else {
-		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
+		if (test_bit(NOHZ_NEED_BALANCING, &nohz.bits))
 			return;
 
-		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
-
-		if (atomic_read(&nohz.load_balancer) == cpu)
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
-					   nr_cpu_ids) != cpu)
-				BUG();
+		/*
+		 * Indicate that there is atleast one cpu in tickless mode
+		 * and hence the possible need for NOHZ idle load balancing.
+		 */
+		set_bit(NOHZ_NEED_BALANCING, &nohz.bits);
 	}
+
 	return;
 }
 #endif
@@ -4623,11 +4529,11 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 	struct rq *rq;
 	int balance_cpu;
 
-	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
+	if (idle != CPU_IDLE || !(test_bit(NOHZ_BALANCE_KICK, &nohz.bits)))
 		return;
 
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
-		if (balance_cpu == this_cpu)
+		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
 			continue;
 
 		/*
@@ -4636,7 +4542,7 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 		 * balancing owner will pick it up.
 		 */
 		if (need_resched()) {
-			this_rq->nohz_balance_kick = 0;
+			clear_bit(NOHZ_BALANCE_KICK, &nohz.bits);
 			break;
 		}
 
@@ -4652,53 +4558,73 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 			this_rq->next_balance = rq->next_balance;
 	}
 	nohz.next_balance = this_rq->next_balance;
-	this_rq->nohz_balance_kick = 0;
+	clear_bit(NOHZ_BALANCE_KICK, &nohz.bits);
 }
 
 /*
- * Current heuristic for kicking the idle load balancer
- * - first_pick_cpu is the one of the busy CPUs. It will kick
- *   idle load balancer when it has more than one process active. This
- *   eliminates the need for idle load balancing altogether when we have
- *   only one running process in the system (common case).
- * - If there are more than one busy CPU, idle load balancer may have
- *   to run for active_load_balance to happen (i.e., two busy CPUs are
- *   SMT or core siblings and can run better if they move to different
- *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
- *   which will kick idle load balancer as soon as it has any load.
+ * Current heuristic for kicking the idle load balancer in the presence
+ * of an idle cpu is the system.
+ * - This rq has more than one task.
+ * - At any scheduler domain level, this cpu's scheduler group has multiple
+ *   busy cpu's exceeding the group's power.
+ * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
+ *   domain span are idle.
  */
 static inline int nohz_kick_needed(struct rq *rq, int cpu)
 {
 	unsigned long now = jiffies;
-	int ret;
-	int first_pick_cpu, second_pick_cpu;
-
-	if (time_before(now, nohz.next_balance))
-		return 0;
+	struct sched_domain *sd;
 
 	if (idle_cpu(cpu))
 		return 0;
 
-	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
-	second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
+	/*
+	 * We were recently in tickless idle mode. We will do the delayed
+	 * update of busy mode now (first busy tick after returning from idle).
+	 */
+	if (unlikely(rq->tick_stopped)) {
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+
+		if (cpumask_bits(nohz.idle_cpus_mask)[BIT_WORD(cpu)] == 0 &&
+		    cpumask_empty(nohz.idle_cpus_mask))
+			clear_bit(NOHZ_NEED_BALANCING, &nohz.bits);
+
+		rq->tick_stopped = 0;
+		for_each_domain(cpu, sd)
+			atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+	}
 
-	if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
-	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
+	if (likely(!test_bit(NOHZ_NEED_BALANCING, &nohz.bits)))
 		return 0;
 
-	ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu);
-	if (ret == nr_cpu_ids || ret == cpu) {
-		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
-		if (rq->nr_running > 1)
-			return 1;
-	} else {
-		ret = atomic_cmpxchg(&nohz.second_pick_cpu, nr_cpu_ids, cpu);
-		if (ret == nr_cpu_ids || ret == cpu) {
-			if (rq->nr_running)
-				return 1;
-		}
+	if (rq->nr_running >= 2)
+		goto need_kick;
+
+	for_each_domain(cpu, sd) {
+		struct sched_group *sg = sd->groups;
+		struct sched_group_power *sgp = sg->sgp;
+		int nr_busy = atomic_read(&sgp->nr_busy_cpus);
+
+		if (nr_busy > 1 && (nr_busy * SCHED_LOAD_SCALE > sgp->power))
+			goto need_kick;
+
+		if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
+		    && (cpumask_first_and(nohz.idle_cpus_mask,
+					  sched_domain_span(sd)) < cpu))
+			goto need_kick;
 	}
+
 	return 0;
+
+need_kick:
+	if (time_before(now, nohz.next_balance) ||
+	    test_bit(NOHZ_BALANCE_KICK, &nohz.bits))
+		return 0;
+
+	if (test_and_set_bit(NOHZ_BALANCE_KICK, &nohz.bits))
+		return 0;
+
+	return 1;
 }
 #else
 static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-11-01 23:52       ` Suresh Siddha
@ 2011-11-02 13:04         ` Srivatsa Vaddagiri
  2011-11-02 13:54         ` Srivatsa Vaddagiri
  2011-11-14  9:32         ` Peter Zijlstra
  2 siblings, 0 replies; 15+ messages in thread
From: Srivatsa Vaddagiri @ 2011-11-02 13:04 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Venki Pallipadi, Peter Zijlstra, Andi Kleen, Tim Chen,
	Ingo Molnar, linux-kernel@vger.kernel.org

* Suresh Siddha <suresh.b.siddha@intel.com> [2011-11-01 16:52:38]:

> That is correct. So to minimize the global updates I did two things.

Another issue I am trying to solve with nohz balancer is stale-ness of
nohz.next_balance. As cpus enter/exit tickless state, nohz.next_balance
is not updated to reflect that. I have found that its a predominant source
of increased idle time with CFS bandwidth patches :

https://lkml.org/lkml/2011/9/26/162

While it may be costly to update nohz.next_balance during exit event,
I think we should update it upon entry. I am working on a patch to make
nohz.next_balance more precise. This requires tickless idle cpus to be tracked
in a global rb-tree (each tickless idle cpu is inserted in this tree,
indexed by its rq->next_balance). Having this rb-tree makes it easy to
precisely determine min(rq->next_balance) when a kick is deserved, which
I am hoping is good for both performance (on the dot wakeup) but also power 
(avoid unnecessary/early wakeup).

Draft patch is below. I was intending to post it soon along with some 
real-world benchmark numbers (based on matrix multiplication).

Any other suggestions to avoid this global rb-tree, but still keep
nohz.next_balance as precise as possible, would be welcome!

Not-yet-Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 Makefile                |    2 
 kernel/sched.c          |    9 ++
 kernel/sched_fair.c     |  185 +++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched_idletask.c |    2 
 4 files changed, 179 insertions(+), 19 deletions(-)

Index: current/Makefile
===================================================================
--- current.orig/Makefile
+++ current/Makefile
@@ -1,6 +1,6 @@
 VERSION = 3
 PATCHLEVEL = 1
-SUBLEVEL = 0
+SUBLEVEL = 0-myp
 EXTRAVERSION =
 NAME = "Divemaster Edition"
 
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -602,6 +602,7 @@ struct rq {
 #ifdef CONFIG_NO_HZ
 	u64 nohz_stamp;
 	unsigned char nohz_balance_kick;
+	struct rb_node nohz_node;
 #endif
 	int skip_clock_update;
 
@@ -1409,6 +1410,8 @@ static inline bool got_nohz_idle_kick(vo
 	return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick;
 }
 
+static void reset_first_second_pick_cpu(int cpu);
+
 #else /* CONFIG_NO_HZ */
 
 static inline bool got_nohz_idle_kick(void)
@@ -1416,6 +1419,8 @@ static inline bool got_nohz_idle_kick(vo
 	return false;
 }
 
+static inline void reset_first_second_pick_cpu(int cpu) { }
+
 #endif /* CONFIG_NO_HZ */
 
 static u64 sched_avg_period(void)
@@ -8299,6 +8304,7 @@ void __init sched_init(void)
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_balance_kick = 0;
+		rb_init_node(&rq->nohz_node);
 #endif
 #endif
 		init_rq_hrtick(rq);
@@ -8344,10 +8350,13 @@ void __init sched_init(void)
 	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT);
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	zalloc_cpumask_var(&nohz.stale_idle_cpus_mask, GFP_NOWAIT);
 	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
 	atomic_set(&nohz.load_balancer, nr_cpu_ids);
 	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
 	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
+	nohz.rq_next_balance = RB_ROOT;
+	spin_lock_init(&nohz.next_balance_lock);
 #endif
 	/* May be allocated at isolcpus cmdline parse time */
 	if (cpu_isolated_map == NULL)
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -4285,7 +4285,11 @@ static struct {
 	atomic_t second_pick_cpu;
 	cpumask_var_t idle_cpus_mask;
 	cpumask_var_t grp_idle_mask;
+	cpumask_var_t stale_idle_cpus_mask;
 	unsigned long next_balance;     /* in jiffy units */
+	struct rb_root rq_next_balance;
+	struct rb_node *rb_leftmost;
+	spinlock_t next_balance_lock;
 } nohz ____cacheline_aligned;
 
 int get_nohz_load_balancer(void)
@@ -4427,8 +4431,12 @@ static void nohz_balancer_kick(int cpu)
 
 	ilb_cpu = get_nohz_load_balancer();
 
-	if (ilb_cpu >= nr_cpu_ids) {
-		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
+	/*
+	 * ilb_cpu itself can be attempting to kick another idle cpu. Pick
+	 * another idle cpu in that case.
+	 */
+	if (ilb_cpu >= nr_cpu_ids || ilb_cpu == cpu) {
+		ilb_cpu = cpumask_any_but(nohz.idle_cpus_mask, cpu);
 		if (ilb_cpu >= nr_cpu_ids)
 			return;
 	}
@@ -4449,6 +4457,122 @@ static void nohz_balancer_kick(int cpu)
 }
 
 /*
+ * Lazy dequeue of no-longer-tickless-idle cpu from rb-tree.
+ *
+ * This is done lazily, as otherwise taking a spinlock (nohz.next_balance_lock)
+ * to do immediate update of rb-tree can slowdown transition out of tickless
+ * state (and thus may hurt scheduling latencies for a task waking on a tickless
+ * idle cpu).
+ */
+static void __dequeue_nohz_idle_cpu(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu), *entry;
+
+	WARN_ON(RB_EMPTY_NODE(&rq->nohz_node));
+
+	if (nohz.rb_leftmost == &rq->nohz_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&rq->nohz_node);
+		nohz.rb_leftmost = next_node;
+		if (next_node) {
+			entry = rb_entry(next_node, struct rq, nohz_node);
+
+			nohz.next_balance = entry->next_balance;
+		}
+	}
+
+	rb_erase(&rq->nohz_node, &nohz.rq_next_balance);
+	RB_CLEAR_NODE(&rq->nohz_node);
+}
+
+static void __enqueue_nohz_idle_cpu(int cpu)
+{
+	struct rb_node **link = &nohz.rq_next_balance.rb_node;
+	struct rb_node *parent = NULL;
+	struct rq *entry;
+	struct rq *rq = cpu_rq(cpu);
+	int leftmost = 1;
+
+	/*
+	 * Find the right place in the rbtree:
+	 */
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rq, nohz_node);
+		/*
+		 * We dont care about collisions. Nodes with
+		 * the same key stay together.
+		 */
+		if (time_before(rq->next_balance, entry->next_balance)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	/*
+	 * Maintain a cache of leftmost tree entries (it is frequently
+	 * used):
+	 */
+	if (leftmost) {
+		nohz.rb_leftmost = &rq->nohz_node;
+		nohz.next_balance = rq->next_balance;
+	}
+
+	rb_link_node(&rq->nohz_node, parent, link);
+	rb_insert_color(&rq->nohz_node, &nohz.rq_next_balance);
+}
+
+static void __clear_stale_nohz_idle_cpus(void)
+{
+	int stale_idle_cpu;
+
+	for_each_cpu(stale_idle_cpu, nohz.stale_idle_cpus_mask) {
+		__dequeue_nohz_idle_cpu(stale_idle_cpu);
+		cpumask_clear_cpu(stale_idle_cpu, nohz.stale_idle_cpus_mask);
+	}
+}
+
+/*
+ * Remove no-longer-tickless-idle cpus from rb-tree.
+ */
+static void clear_stale_nohz_idle_cpus(void)
+{
+	spin_lock(&nohz.next_balance_lock);
+	__clear_stale_nohz_idle_cpus();
+	spin_unlock(&nohz.next_balance_lock);
+}
+
+/*
+ * Enqueue a idle cpu going tickless in a rb-tree, indexed by its
+ * rq->next_balance value.
+ */
+static void enqueue_nohz_idle_cpu(int cpu)
+{
+	spin_lock(&nohz.next_balance_lock);
+	if (cpumask_test_cpu(cpu, nohz.stale_idle_cpus_mask)) {
+		__dequeue_nohz_idle_cpu(cpu);
+		cpumask_clear_cpu(cpu, nohz.stale_idle_cpus_mask);
+	}
+	__enqueue_nohz_idle_cpu(cpu);
+	spin_unlock(&nohz.next_balance_lock);
+}
+
+/*
+ * Dequeue and enqueue a tickless idle cpu indexed at its new rq->next_balance
+ * value.
+ */
+static void re_enqueue_nohz_idle_cpu(int cpu)
+{
+	spin_lock(&nohz.next_balance_lock);
+	__dequeue_nohz_idle_cpu(cpu);
+	__enqueue_nohz_idle_cpu(cpu);
+	spin_unlock(&nohz.next_balance_lock);
+}
+
+/*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
  * load balancing on behalf of all those cpus.
@@ -4481,12 +4605,9 @@ void select_nohz_load_balancer(int stop_
 			return;
 		}
 
-		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
+		enqueue_nohz_idle_cpu(cpu);
 
-		if (atomic_read(&nohz.first_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
-		if (atomic_read(&nohz.second_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
+		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
 
 		if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) {
 			int new_ilb;
@@ -4512,6 +4633,7 @@ void select_nohz_load_balancer(int stop_
 		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
 			return;
 
+		cpumask_set_cpu(cpu, nohz.stale_idle_cpus_mask);
 		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
 
 		if (atomic_read(&nohz.load_balancer) == cpu)
@@ -4622,10 +4744,20 @@ static void nohz_idle_balance(int this_c
 	struct rq *this_rq = cpu_rq(this_cpu);
 	struct rq *rq;
 	int balance_cpu;
+	unsigned long old_next_balance;
 
-	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
+	if (!this_rq->nohz_balance_kick)
 		return;
 
+	/* Wakeup another idle cpu to do idle load balance if we got busy */
+	if (!idle_cpu(this_cpu)) {
+		nohz_balancer_kick(this_cpu);
+		goto out;
+	}
+
+	/* Remove no-longer-tickless-idle cpus from rb-tree */
+	clear_stale_nohz_idle_cpus();
+
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
 		if (balance_cpu == this_cpu)
 			continue;
@@ -4636,8 +4768,8 @@ static void nohz_idle_balance(int this_c
 		 * balancing owner will pick it up.
 		 */
 		if (need_resched()) {
-			this_rq->nohz_balance_kick = 0;
-			break;
+			nohz_balancer_kick(this_cpu);
+			goto out;
 		}
 
 		raw_spin_lock_irq(&this_rq->lock);
@@ -4645,13 +4777,15 @@ static void nohz_idle_balance(int this_c
 		update_cpu_load(this_rq);
 		raw_spin_unlock_irq(&this_rq->lock);
 
-		rebalance_domains(balance_cpu, CPU_IDLE);
-
 		rq = cpu_rq(balance_cpu);
-		if (time_after(this_rq->next_balance, rq->next_balance))
-			this_rq->next_balance = rq->next_balance;
+		old_next_balance = rq->next_balance;
+		if (time_after_eq(jiffies, old_next_balance)) {
+			rebalance_domains(balance_cpu, CPU_IDLE);
+			if (rq->next_balance != old_next_balance)
+				re_enqueue_nohz_idle_cpu(balance_cpu);
+		}
 	}
-	nohz.next_balance = this_rq->next_balance;
+out:
 	this_rq->nohz_balance_kick = 0;
 }
 
@@ -4700,9 +4834,24 @@ static inline int nohz_kick_needed(struc
 	}
 	return 0;
 }
-#else
+
+/*
+ * Reset first_pick_cpu or second_pick_cpu identifier in case
+ * corresponding cpu is going idle.
+ */
+static void reset_first_second_pick_cpu(int cpu)
+{
+	if (atomic_read(&nohz.first_pick_cpu) == cpu)
+		atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
+	if (atomic_read(&nohz.second_pick_cpu) == cpu)
+		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
+}
+
+#else	/* CONFIG_NO_HZ */
+
 static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
-#endif
+
+#endif	/* CONFIG_NO_HZ */
 
 /*
  * run_rebalance_domains is triggered when needed from the scheduler tick.
@@ -4740,7 +4889,7 @@ static inline void trigger_load_balance(
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
 #ifdef CONFIG_NO_HZ
-	else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
+	if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
 		nohz_balancer_kick(cpu);
 #endif
 }
Index: current/kernel/sched_idletask.c
===================================================================
--- current.orig/kernel/sched_idletask.c
+++ current/kernel/sched_idletask.c
@@ -24,6 +24,8 @@ static struct task_struct *pick_next_tas
 {
 	schedstat_inc(rq, sched_goidle);
 	calc_load_account_idle(rq);
+	reset_first_second_pick_cpu(cpu_of(rq));
+
 	return rq->idle;
 }
 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-11-01 23:52       ` Suresh Siddha
  2011-11-02 13:04         ` Srivatsa Vaddagiri
@ 2011-11-02 13:54         ` Srivatsa Vaddagiri
  2011-11-02 15:13           ` Suresh Siddha
  2011-11-14  9:32         ` Peter Zijlstra
  2 siblings, 1 reply; 15+ messages in thread
From: Srivatsa Vaddagiri @ 2011-11-02 13:54 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Venki Pallipadi, Peter Zijlstra, Andi Kleen, Tim Chen,
	Ingo Molnar, linux-kernel@vger.kernel.org

* Suresh Siddha <suresh.b.siddha@intel.com> [2011-11-01 16:52:38]:

> +	/*
> +	 * We were recently in tickless idle mode. We will do the delayed
> +	 * update of busy mode now (first busy tick after returning from idle).
> +	 */
> +	if (unlikely(rq->tick_stopped)) {
> +		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> +
> +		if (cpumask_bits(nohz.idle_cpus_mask)[BIT_WORD(cpu)] == 0 &&
> +		    cpumask_empty(nohz.idle_cpus_mask))
> +			clear_bit(NOHZ_NEED_BALANCING, &nohz.bits);

Can't this clear_bit race with set_bit() in select_nohz_load_balancer()?

   CPU0				CPU1

cpumask_clear_cpu()
if ( ...)
				cpumask_set_cpu();
				set_bit();

	clear_bit();

?

- vatsa


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-11-02 13:54         ` Srivatsa Vaddagiri
@ 2011-11-02 15:13           ` Suresh Siddha
  0 siblings, 0 replies; 15+ messages in thread
From: Suresh Siddha @ 2011-11-02 15:13 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Venki Pallipadi, Peter Zijlstra, Andi Kleen, Tim Chen,
	Ingo Molnar, linux-kernel@vger.kernel.org

On Wed, 2011-11-02 at 06:54 -0700, Srivatsa Vaddagiri wrote:
> * Suresh Siddha <suresh.b.siddha@intel.com> [2011-11-01 16:52:38]:
> 
> > +	/*
> > +	 * We were recently in tickless idle mode. We will do the delayed
> > +	 * update of busy mode now (first busy tick after returning from idle).
> > +	 */
> > +	if (unlikely(rq->tick_stopped)) {
> > +		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> > +
> > +		if (cpumask_bits(nohz.idle_cpus_mask)[BIT_WORD(cpu)] == 0 &&
> > +		    cpumask_empty(nohz.idle_cpus_mask))
> > +			clear_bit(NOHZ_NEED_BALANCING, &nohz.bits);
> 
> Can't this clear_bit race with set_bit() in select_nohz_load_balancer()?
> 
>    CPU0				CPU1
> 
> cpumask_clear_cpu()
> if ( ...)
> 				cpumask_set_cpu();
> 				set_bit();
> 
> 	clear_bit();
> 
> ?

Yes. All I want is a quick way of getting the
cpumask_weight(nohz.idle_cpus_mask) so that the busy cpu need not spend
much time to see if there is an idle cpu that needs idle load balancing.
Let me see if there is any simple way or else we need nohz.nr_idle_cpus

thanks,
suresh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-11-01 23:52       ` Suresh Siddha
  2011-11-02 13:04         ` Srivatsa Vaddagiri
  2011-11-02 13:54         ` Srivatsa Vaddagiri
@ 2011-11-14  9:32         ` Peter Zijlstra
  2011-11-14 19:37           ` Suresh Siddha
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2011-11-14  9:32 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Venki Pallipadi, Andi Kleen, Tim Chen, Ingo Molnar,
	linux-kernel@vger.kernel.org

On Tue, 2011-11-01 at 16:52 -0700, Suresh Siddha wrote:
> @@ -3317,6 +3317,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
>  
>         cpu_rq(cpu)->cpu_power = power;
>         sdg->sgp->power = power;
> +       atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight);
>  }
>  
>  static void update_group_power(struct sched_domain *sd, int cpu)
> @@ -3339,6 +3340,7 @@ static void update_group_power(struct sched_domain *sd, int cpu)
>         } while (group != child->groups);
>  
>         sdg->sgp->power = power;
> +       atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight);
>  } 

So we run this rather frequently, and it will trample all over:

> +                */
> +               for_each_domain(cpu, sd)
> +                       atomic_dec(&sd->groups->sgp->nr_busy_cpus);

because I cannot see any serialization between those sites.

Also, isn't it rather weird to just assume all cpus are busy in
update_group_power()? If you would actually set the right value in
update_cpu_power() you could use a straight sum in update_group_power()
and get a more or less accurate number out.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability
  2011-11-14  9:32         ` Peter Zijlstra
@ 2011-11-14 19:37           ` Suresh Siddha
  0 siblings, 0 replies; 15+ messages in thread
From: Suresh Siddha @ 2011-11-14 19:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Venki Pallipadi, Andi Kleen, Tim Chen, Ingo Molnar,
	linux-kernel@vger.kernel.org

On Mon, 2011-11-14 at 01:32 -0800, Peter Zijlstra wrote:
> On Tue, 2011-11-01 at 16:52 -0700, Suresh Siddha wrote:
> > @@ -3317,6 +3317,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
> >  
> >         cpu_rq(cpu)->cpu_power = power;
> >         sdg->sgp->power = power;
> > +       atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight);
> >  }
> >  
> >  static void update_group_power(struct sched_domain *sd, int cpu)
> > @@ -3339,6 +3340,7 @@ static void update_group_power(struct sched_domain *sd, int cpu)
> >         } while (group != child->groups);
> >  
> >         sdg->sgp->power = power;
> > +       atomic_set(&sdg->sgp->nr_busy_cpus, sdg->group_weight);
> >  } 
> 
> So we run this rather frequently, and it will trample all over:
> 
> > +                */
> > +               for_each_domain(cpu, sd)
> > +                       atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> 
> because I cannot see any serialization between those sites.

That was an overlook from myside. I moved the initialization of this to
init_sched_groups_power() now and there is no need to re-do this
everytime we call the update_group_power().

> Also, isn't it rather weird to just assume all cpus are busy in
> update_group_power()? If you would actually set the right value in
> update_cpu_power() you could use a straight sum in update_group_power()
> and get a more or less accurate number out.

I will post an updated patch soon (once we complete the performance
analysis of this patch with respect to other workloads) but the below
hunk gives an idea of what I am planning to do now.

@@ -7369,6 +7384,7 @@ static void init_sched_groups_power(int cpu, struct sched_
                return;
 
        update_group_power(sd, cpu);
+       atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
 }
 
 /*

thanks,
suresh


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-11-14 19:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-19 21:45 [Patch] Idle balancer: cache align nohz structure to improve idle load balancing scalability Tim Chen
2011-10-20  4:18 ` Eric Dumazet
2011-10-20  5:57   ` Suresh Siddha
2011-10-20  6:43     ` Eric Dumazet
2011-10-20 17:19   ` Tim Chen
2011-10-20  4:24 ` Andi Kleen
2011-10-20 12:26   ` Venki Pallipadi
2011-10-20 17:31     ` Suresh Siddha
2011-10-20 17:38     ` Peter Zijlstra
     [not found]     ` <4FF5AC937153B0459463C1A88EB478F20135D6ECB5@orsmsx505.amr.corp.intel.com>
2011-11-01 23:52       ` Suresh Siddha
2011-11-02 13:04         ` Srivatsa Vaddagiri
2011-11-02 13:54         ` Srivatsa Vaddagiri
2011-11-02 15:13           ` Suresh Siddha
2011-11-14  9:32         ` Peter Zijlstra
2011-11-14 19:37           ` Suresh Siddha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox