[patch v3 0/6] nohz idle load balancing patches

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch v3 0/6] nohz idle load balancing patches
@ 2011-12-02  1:07 Suresh Siddha
  2011-12-02  1:07 ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Suresh Siddha
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Suresh Siddha @ 2011-12-02  1:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith
  Cc: linux-kernel, Tim Chen, alex.shi

This is the -v3 of the load balancing patches that makes nohz idle
load balancing more scalabale. While the last 2 in the series are not
directly related to nohz, I clubbed them as part of this series as I noticed
them while I was working in this area and they helped the context switch
intensive workloads improve on the 8-socket system.

First RFC version of this can be found at
https://lkml.org/lkml/2011/11/1/368

Second version of this can be found at
https://lkml.org/lkml/2011/11/18/518

changes from v2:
a. Make the changelogs more descriptive.
b. nr_busy_cpus in the sched_group_power is updated when entering idle
   irrespective of whether the tick is stopped or not.
c. kick the idle load balancer for the groups sharing package resources
   and having more than one cpu busy.
d. Incroporate Mike's patch for enhancing ttwu_feature instead of disabling
   the feature by default.

thanks,
suresh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq
  2011-12-02  1:07 [patch v3 0/6] nohz idle load balancing patches Suresh Siddha
@ 2011-12-02  1:07 ` Suresh Siddha
  2011-12-06  9:53   ` [tip:sched/core] sched, nohz: Introduce nohz_flags in 'struct rq' tip-bot for Suresh Siddha
  2011-12-06 12:14   ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Srivatsa Vaddagiri
  2011-12-02  1:07 ` [patch v3 2/6] sched, nohz: track nr_busy_cpus in the sched_group_power Suresh Siddha
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Suresh Siddha @ 2011-12-02  1:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith
  Cc: linux-kernel, Tim Chen, alex.shi, Suresh Siddha

[-- Attachment #1: introduce_rq_nohz_flags.patch --]
[-- Type: text/plain, Size: 5545 bytes --]

Introduce nohz_flags in the struct rq, which will track these two flags
for now.

NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when
the tick is stopped. It will be used to update the nohz idle load balancer data
structures during the first busy tick after the tick is restarted. At this
first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset.
This will minimize the nohz idle load balancer status updates that currently
happen for every tickless exit, making it more scalable when there
are many logical cpu's that enter and exit idle often.

NOHZ_BALANCE_KICK will track the need for nohz idle load balance
on this rq. This will replace the nohz_balance_kick in the rq, which was
not being updated atomically.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> 
---
 kernel/sched/core.c  |    5 +++--
 kernel/sched/fair.c  |   48 +++++++++++++++++++++++++++---------------------
 kernel/sched/sched.h |   11 ++++++++++-
 3 files changed, 40 insertions(+), 24 deletions(-)

Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -575,7 +575,8 @@ void wake_up_idle_cpu(int cpu)
 
 static inline bool got_nohz_idle_kick(void)
 {
-	return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick;
+	int cpu = smp_processor_id();
+	return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
 }
 
 #else /* CONFIG_NO_HZ */
@@ -6833,7 +6834,7 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
-		rq->nohz_balance_kick = 0;
+		rq->nohz_flags = 0;
 #endif
 #endif
 		init_rq_hrtick(rq);
Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -4866,18 +4866,15 @@ static void nohz_balancer_kick(int cpu)
 			return;
 	}
 
-	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
-		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
-
-		smp_mb();
-		/*
-		 * Use smp_send_reschedule() instead of resched_cpu().
-		 * This way we generate a sched IPI on the target cpu which
-		 * is idle. And the softirq performing nohz idle load balance
-		 * will be run before returning from the IPI.
-		 */
-		smp_send_reschedule(ilb_cpu);
-	}
+	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)))
+		return;
+	/*
+	 * Use smp_send_reschedule() instead of resched_cpu().
+	 * This way we generate a sched IPI on the target cpu which
+	 * is idle. And the softirq performing nohz idle load balance
+	 * will be run before returning from the IPI.
+	 */
+	smp_send_reschedule(ilb_cpu);
 	return;
 }
 
@@ -4941,6 +4938,8 @@ void select_nohz_load_balancer(int stop_
 			}
 			return;
 		}
+
+		set_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
 	} else {
 		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
 			return;
@@ -5056,8 +5055,9 @@ static void nohz_idle_balance(int this_c
 	struct rq *rq;
 	int balance_cpu;
 
-	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
-		return;
+	if (idle != CPU_IDLE ||
+	    !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
+		goto end;
 
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
 		if (balance_cpu == this_cpu)
@@ -5068,10 +5068,8 @@ static void nohz_idle_balance(int this_c
 		 * work being done for other cpus. Next load
 		 * balancing owner will pick it up.
 		 */
-		if (need_resched()) {
-			this_rq->nohz_balance_kick = 0;
+		if (need_resched())
 			break;
-		}
 
 		raw_spin_lock_irq(&this_rq->lock);
 		update_rq_clock(this_rq);
@@ -5085,7 +5083,8 @@ static void nohz_idle_balance(int this_c
 			this_rq->next_balance = rq->next_balance;
 	}
 	nohz.next_balance = this_rq->next_balance;
-	this_rq->nohz_balance_kick = 0;
+end:
+	clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
 }
 
 /*
@@ -5106,10 +5105,17 @@ static inline int nohz_kick_needed(struc
 	int ret;
 	int first_pick_cpu, second_pick_cpu;
 
-	if (time_before(now, nohz.next_balance))
+	if (unlikely(idle_cpu(cpu)))
 		return 0;
 
-	if (idle_cpu(cpu))
+       /*
+	* We may be recently in ticked or tickless idle mode. At the first
+	* busy tick after returning from idle, we will update the busy stats.
+	*/
+	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu))))
+		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
+
+	if (time_before(now, nohz.next_balance))
 		return 0;
 
 	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
@@ -5173,7 +5179,7 @@ void trigger_load_balance(struct rq *rq,
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
 #ifdef CONFIG_NO_HZ
-	else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
+	if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
 		nohz_balancer_kick(cpu);
 #endif
 }
Index: tip/kernel/sched/sched.h
===================================================================
--- tip.orig/kernel/sched/sched.h
+++ tip/kernel/sched/sched.h
@@ -371,7 +371,7 @@ struct rq {
 	unsigned long last_load_update_tick;
 #ifdef CONFIG_NO_HZ
 	u64 nohz_stamp;
-	unsigned char nohz_balance_kick;
+	unsigned long nohz_flags;
 #endif
 	int skip_clock_update;
 
@@ -1062,3 +1062,12 @@ extern void init_rt_rq(struct rt_rq *rt_
 extern void unthrottle_offline_cfs_rqs(struct rq *rq);
 
 extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
+
+#ifdef CONFIG_NO_HZ
+enum rq_nohz_flag_bits {
+	NOHZ_TICK_STOPPED,
+	NOHZ_BALANCE_KICK,
+};
+
+#define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)
+#endif



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:sched/core] sched, nohz: Introduce nohz_flags in 'struct rq'
  2011-12-02  1:07 ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Suresh Siddha
@ 2011-12-06  9:53   ` tip-bot for Suresh Siddha
  2011-12-06 12:14   ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Srivatsa Vaddagiri
  1 sibling, 0 replies; 30+ messages in thread
From: tip-bot for Suresh Siddha @ 2011-12-06  9:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, suresh.b.siddha, tglx,
	mingo

Commit-ID:  1c792db7f7957e2e34b9a164f08200e36a25dfd0
Gitweb:     http://git.kernel.org/tip/1c792db7f7957e2e34b9a164f08200e36a25dfd0
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Thu, 1 Dec 2011 17:07:32 -0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 09:06:30 +0100

sched, nohz: Introduce nohz_flags in 'struct rq'

Introduce nohz_flags in the struct rq, which will track these two flags
for now.

NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when
the tick is stopped. It will be used to update the nohz idle load balancer data
structures during the first busy tick after the tick is restarted. At this
first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset.
This will minimize the nohz idle load balancer status updates that currently
happen for every tickless exit, making it more scalable when there
are many logical cpu's that enter and exit idle often.

NOHZ_BALANCE_KICK will track the need for nohz idle load balance
on this rq. This will replace the nohz_balance_kick in the rq, which was
not being updated atomically.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20111202010832.499438999@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/core.c  |    5 +++--
 kernel/sched/fair.c  |   48 +++++++++++++++++++++++++++---------------------
 kernel/sched/sched.h |   11 ++++++++++-
 3 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 07f1e99..7f1da77 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -575,7 +575,8 @@ void wake_up_idle_cpu(int cpu)
 
 static inline bool got_nohz_idle_kick(void)
 {
-	return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick;
+	int cpu = smp_processor_id();
+	return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
 }
 
 #else /* CONFIG_NO_HZ */
@@ -6840,7 +6841,7 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
-		rq->nohz_balance_kick = 0;
+		rq->nohz_flags = 0;
 #endif
 #endif
 		init_rq_hrtick(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81ccb81..50c06b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4889,18 +4889,15 @@ static void nohz_balancer_kick(int cpu)
 			return;
 	}
 
-	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
-		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
-
-		smp_mb();
-		/*
-		 * Use smp_send_reschedule() instead of resched_cpu().
-		 * This way we generate a sched IPI on the target cpu which
-		 * is idle. And the softirq performing nohz idle load balance
-		 * will be run before returning from the IPI.
-		 */
-		smp_send_reschedule(ilb_cpu);
-	}
+	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)))
+		return;
+	/*
+	 * Use smp_send_reschedule() instead of resched_cpu().
+	 * This way we generate a sched IPI on the target cpu which
+	 * is idle. And the softirq performing nohz idle load balance
+	 * will be run before returning from the IPI.
+	 */
+	smp_send_reschedule(ilb_cpu);
 	return;
 }
 
@@ -4964,6 +4961,8 @@ void select_nohz_load_balancer(int stop_tick)
 			}
 			return;
 		}
+
+		set_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
 	} else {
 		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
 			return;
@@ -5079,8 +5078,9 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 	struct rq *rq;
 	int balance_cpu;
 
-	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
-		return;
+	if (idle != CPU_IDLE ||
+	    !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
+		goto end;
 
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
 		if (balance_cpu == this_cpu)
@@ -5091,10 +5091,8 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 		 * work being done for other cpus. Next load
 		 * balancing owner will pick it up.
 		 */
-		if (need_resched()) {
-			this_rq->nohz_balance_kick = 0;
+		if (need_resched())
 			break;
-		}
 
 		raw_spin_lock_irq(&this_rq->lock);
 		update_rq_clock(this_rq);
@@ -5108,7 +5106,8 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 			this_rq->next_balance = rq->next_balance;
 	}
 	nohz.next_balance = this_rq->next_balance;
-	this_rq->nohz_balance_kick = 0;
+end:
+	clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
 }
 
 /*
@@ -5129,10 +5128,17 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 	int ret;
 	int first_pick_cpu, second_pick_cpu;
 
-	if (time_before(now, nohz.next_balance))
+	if (unlikely(idle_cpu(cpu)))
 		return 0;
 
-	if (idle_cpu(cpu))
+       /*
+	* We may be recently in ticked or tickless idle mode. At the first
+	* busy tick after returning from idle, we will update the busy stats.
+	*/
+	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu))))
+		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
+
+	if (time_before(now, nohz.next_balance))
 		return 0;
 
 	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
@@ -5196,7 +5202,7 @@ void trigger_load_balance(struct rq *rq, int cpu)
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
 #ifdef CONFIG_NO_HZ
-	else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
+	if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
 		nohz_balancer_kick(cpu);
 #endif
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8715055..cf7d026 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -371,7 +371,7 @@ struct rq {
 	unsigned long last_load_update_tick;
 #ifdef CONFIG_NO_HZ
 	u64 nohz_stamp;
-	unsigned char nohz_balance_kick;
+	unsigned long nohz_flags;
 #endif
 	int skip_clock_update;
 
@@ -1064,3 +1064,12 @@ extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
 extern void unthrottle_offline_cfs_rqs(struct rq *rq);
 
 extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
+
+#ifdef CONFIG_NO_HZ
+enum rq_nohz_flag_bits {
+	NOHZ_TICK_STOPPED,
+	NOHZ_BALANCE_KICK,
+};
+
+#define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)
+#endif

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq
  2011-12-02  1:07 ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Suresh Siddha
  2011-12-06  9:53   ` [tip:sched/core] sched, nohz: Introduce nohz_flags in 'struct rq' tip-bot for Suresh Siddha
@ 2011-12-06 12:14   ` Srivatsa Vaddagiri
  2011-12-06 19:26     ` Suresh Siddha
  1 sibling, 1 reply; 30+ messages in thread
From: Srivatsa Vaddagiri @ 2011-12-06 12:14 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Mike Galbraith,
	linux-kernel, Tim Chen, alex.shi

* Suresh Siddha <suresh.b.siddha@intel.com> [2011-12-01 17:07:32]:

> @@ -4866,18 +4866,15 @@ static void nohz_balancer_kick(int cpu)
>  			return;
>  	}
> 
> -	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
> -		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
> -
> -		smp_mb();
> -		/*
> -		 * Use smp_send_reschedule() instead of resched_cpu().
> -		 * This way we generate a sched IPI on the target cpu which
> -		 * is idle. And the softirq performing nohz idle load balance
> -		 * will be run before returning from the IPI.
> -		 */
> -		smp_send_reschedule(ilb_cpu);
> -	}
> +	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)))

s/cpu/ilb_cpu?

Also given that 'cpu' argument to nohz_balancer_kick() is no longer used, we can
avoid passing any argument to it as well ..

> +		return;
> +	/*
> +	 * Use smp_send_reschedule() instead of resched_cpu().
> +	 * This way we generate a sched IPI on the target cpu which
> +	 * is idle. And the softirq performing nohz idle load balance
> +	 * will be run before returning from the IPI.
> +	 */
> +	smp_send_reschedule(ilb_cpu);
>  	return;
>  }

- vatsa


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq
  2011-12-06 12:14   ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Srivatsa Vaddagiri
@ 2011-12-06 19:26     ` Suresh Siddha
  2011-12-06 19:39       ` Peter Zijlstra
  2011-12-06 20:24       ` [tip:sched/core] sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer tip-bot for Suresh Siddha
  0 siblings, 2 replies; 30+ messages in thread
From: Suresh Siddha @ 2011-12-06 19:26 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Mike Galbraith,
	linux-kernel, Tim Chen, Shi, Alex

On Tue, 2011-12-06 at 04:14 -0800, Srivatsa Vaddagiri wrote:
> * Suresh Siddha <suresh.b.siddha@intel.com> [2011-12-01 17:07:32]:
> 
> > @@ -4866,18 +4866,15 @@ static void nohz_balancer_kick(int cpu)
> >  			return;
> >  	}
> > 
> > -	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
> > -		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
> > -
> > -		smp_mb();
> > -		/*
> > -		 * Use smp_send_reschedule() instead of resched_cpu().
> > -		 * This way we generate a sched IPI on the target cpu which
> > -		 * is idle. And the softirq performing nohz idle load balance
> > -		 * will be run before returning from the IPI.
> > -		 */
> > -		smp_send_reschedule(ilb_cpu);
> > -	}
> > +	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)))
> 
> s/cpu/ilb_cpu?

correct. Thanks again. Peter, can you queue the appended fix too?

> Also given that 'cpu' argument to nohz_balancer_kick() is no longer used, we can
> avoid passing any argument to it as well ..

we do use it currently in the find_new_ilb().

thanks,
suresh
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched, nohz: set the NOHZ_BALANCE_KICK flag for idle load balancer

Intention is to set the NOHZ_BALANCE_KICK flag for the 'ilb_cpu'. Not
for the 'cpu' which is the local cpu. Fix the typo.

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched/fair.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65a6f8b..9e34688 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4852,7 +4852,7 @@ static void nohz_balancer_kick(int cpu)
 	if (ilb_cpu >= nr_cpu_ids)
 		return;
 
-	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)))
+	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
 		return;
 	/*
 	 * Use smp_send_reschedule() instead of resched_cpu().



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq
  2011-12-06 19:26     ` Suresh Siddha
@ 2011-12-06 19:39       ` Peter Zijlstra
  2011-12-06 20:24       ` [tip:sched/core] sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer tip-bot for Suresh Siddha
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2011-12-06 19:39 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Venki Pallipadi, Mike Galbraith,
	linux-kernel, Tim Chen, Shi, Alex

On Tue, 2011-12-06 at 11:26 -0800, Suresh Siddha wrote:
> Peter, can you queue the appended fix too?

Queued, both, thanks guys!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:sched/core] sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer
  2011-12-06 19:26     ` Suresh Siddha
  2011-12-06 19:39       ` Peter Zijlstra
@ 2011-12-06 20:24       ` tip-bot for Suresh Siddha
  1 sibling, 0 replies; 30+ messages in thread
From: tip-bot for Suresh Siddha @ 2011-12-06 20:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, vatsa, hpa, mingo, a.p.zijlstra, suresh.b.siddha,
	tglx, mingo

Commit-ID:  cd490c5b285544dc1319cf79c2ca0528a6447f61
Gitweb:     http://git.kernel.org/tip/cd490c5b285544dc1319cf79c2ca0528a6447f61
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Tue, 6 Dec 2011 11:26:34 -0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 20:51:29 +0100

sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer

Intention is to set the NOHZ_BALANCE_KICK flag for the 'ilb_cpu'. Not
for the 'cpu' which is the local cpu. Fix the typo.

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323199594.1984.18.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8be45ed..6482136 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4853,7 +4853,7 @@ static void nohz_balancer_kick(int cpu)
 	if (ilb_cpu >= nr_cpu_ids)
 		return;
 
-	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)))
+	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
 		return;
 	/*
 	 * Use smp_send_reschedule() instead of resched_cpu().

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [patch v3 2/6] sched, nohz: track nr_busy_cpus in the sched_group_power
  2011-12-02  1:07 [patch v3 0/6] nohz idle load balancing patches Suresh Siddha
  2011-12-02  1:07 ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Suresh Siddha
@ 2011-12-02  1:07 ` Suresh Siddha
  2011-12-06  9:54   ` [tip:sched/core] sched, nohz: Track " tip-bot for Suresh Siddha
  2011-12-02  1:07 ` [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing Suresh Siddha
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Suresh Siddha @ 2011-12-02  1:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith
  Cc: linux-kernel, Tim Chen, alex.shi, Suresh Siddha

[-- Attachment #1: track_nr_busy_cpus_in_sched_group.patch --]
[-- Type: text/plain, Size: 4518 bytes --]

Introduce nr_busy_cpus in the struct sched_group_power [Not in sched_group
because sched groups are duplicated for the SD_OVERLAP scheduler domain]
and for each cpu that enters and exits idle, this parameter will
be updated in each scheduler group of the scheduler domain that this cpu
belongs to.

To avoid the frequent update of this state as the cpu enters
and exits idle, the update of the stat during idle exit is
delayed to the first timer tick that happens after the cpu becomes busy.
This is done using NOHZ_IDLE flag in the struct rq's nohz_flags.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 include/linux/sched.h    |    6 ++++++
 kernel/sched/core.c      |    1 +
 kernel/sched/fair.c      |   31 +++++++++++++++++++++++++++++++
 kernel/sched/sched.h     |    1 +
 kernel/time/tick-sched.c |    9 +++++++++
 5 files changed, 48 insertions(+)

Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -273,9 +273,11 @@ extern int runqueue_is_locked(int cpu);
 
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
 extern void select_nohz_load_balancer(int stop_tick);
+extern void set_cpu_sd_state_idle(void);
 extern int get_nohz_timer_target(void);
 #else
 static inline void select_nohz_load_balancer(int stop_tick) { }
+static inline void set_cpu_sd_state_idle(void);
 #endif
 
 /*
@@ -901,6 +903,10 @@ struct sched_group_power {
 	 * single CPU.
 	 */
 	unsigned int power, power_orig;
+	/*
+	 * Number of busy cpus in this group.
+	 */
+	atomic_t nr_busy_cpus;
 };
 
 struct sched_group {
Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -6017,6 +6017,7 @@ static void init_sched_groups_power(int 
 		return;
 
 	update_group_power(sd, cpu);
+	atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
 }
 
 int __weak arch_sd_sibling_asym_packing(void)
Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -4878,6 +4878,36 @@ static void nohz_balancer_kick(int cpu)
 	return;
 }
 
+static inline void set_cpu_sd_state_busy(void)
+{
+	struct sched_domain *sd;
+	int cpu = smp_processor_id();
+
+	if (!test_bit(NOHZ_IDLE, nohz_flags(cpu)))
+		return;
+	clear_bit(NOHZ_IDLE, nohz_flags(cpu));
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd)
+		atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+	rcu_read_unlock();
+}
+
+void set_cpu_sd_state_idle(void)
+{
+	struct sched_domain *sd;
+	int cpu = smp_processor_id();
+
+	if (test_bit(NOHZ_IDLE, nohz_flags(cpu)))
+		return;
+	set_bit(NOHZ_IDLE, nohz_flags(cpu));
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd)
+		atomic_dec(&sd->groups->sgp->nr_busy_cpus);
+	rcu_read_unlock();
+}
+
 /*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
@@ -5112,6 +5142,7 @@ static inline int nohz_kick_needed(struc
 	* We may be recently in ticked or tickless idle mode. At the first
 	* busy tick after returning from idle, we will update the busy stats.
 	*/
+	set_cpu_sd_state_busy();
 	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu))))
 		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
 
Index: tip/kernel/time/tick-sched.c
===================================================================
--- tip.orig/kernel/time/tick-sched.c
+++ tip/kernel/time/tick-sched.c
@@ -297,6 +297,15 @@ void tick_nohz_stop_sched_tick(int inidl
 	ts = &per_cpu(tick_cpu_sched, cpu);
 
 	/*
+ 	 * Update the idle state in the scheduler domain hierarchy
+ 	 * when tick_nohz_stop_sched_tick() is called from the idle loop.
+ 	 * State will be updated to busy during the first busy tick after
+ 	 * exiting idle.
+ 	 */
+	if (inidle)
+		set_cpu_sd_state_idle();
+
+	/*
 	 * Call to tick_nohz_start_idle stops the last_update_time from being
 	 * updated. Thus, it must not be called in the event we are called from
 	 * irq_exit() with the prior state different than idle.
Index: tip/kernel/sched/sched.h
===================================================================
--- tip.orig/kernel/sched/sched.h
+++ tip/kernel/sched/sched.h
@@ -1067,6 +1067,7 @@ extern void account_cfs_bandwidth_used(i
 enum rq_nohz_flag_bits {
 	NOHZ_TICK_STOPPED,
 	NOHZ_BALANCE_KICK,
+	NOHZ_IDLE,
 };
 
 #define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:sched/core] sched, nohz: Track nr_busy_cpus in the sched_group_power
  2011-12-02  1:07 ` [patch v3 2/6] sched, nohz: track nr_busy_cpus in the sched_group_power Suresh Siddha
@ 2011-12-06  9:54   ` tip-bot for Suresh Siddha
  0 siblings, 0 replies; 30+ messages in thread
From: tip-bot for Suresh Siddha @ 2011-12-06  9:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, suresh.b.siddha, tglx,
	mingo

Commit-ID:  69e1e811dcc436a6b129dbef273ad9ec22d095ce
Gitweb:     http://git.kernel.org/tip/69e1e811dcc436a6b129dbef273ad9ec22d095ce
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Thu, 1 Dec 2011 17:07:33 -0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 09:06:32 +0100

sched, nohz: Track nr_busy_cpus in the sched_group_power

Introduce nr_busy_cpus in the struct sched_group_power [Not in sched_group
because sched groups are duplicated for the SD_OVERLAP scheduler domain]
and for each cpu that enters and exits idle, this parameter will
be updated in each scheduler group of the scheduler domain that this cpu
belongs to.

To avoid the frequent update of this state as the cpu enters
and exits idle, the update of the stat during idle exit is
delayed to the first timer tick that happens after the cpu becomes busy.
This is done using NOHZ_IDLE flag in the struct rq's nohz_flags.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20111202010832.555984323@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h    |    6 ++++++
 kernel/sched/core.c      |    1 +
 kernel/sched/fair.c      |   31 +++++++++++++++++++++++++++++++
 kernel/sched/sched.h     |    1 +
 kernel/time/tick-sched.c |    9 +++++++++
 5 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db17b7..295666c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -273,9 +273,11 @@ extern int runqueue_is_locked(int cpu);
 
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
 extern void select_nohz_load_balancer(int stop_tick);
+extern void set_cpu_sd_state_idle(void);
 extern int get_nohz_timer_target(void);
 #else
 static inline void select_nohz_load_balancer(int stop_tick) { }
+static inline void set_cpu_sd_state_idle(void);
 #endif
 
 /*
@@ -901,6 +903,10 @@ struct sched_group_power {
 	 * single CPU.
 	 */
 	unsigned int power, power_orig;
+	/*
+	 * Number of busy cpus in this group.
+	 */
+	atomic_t nr_busy_cpus;
 };
 
 struct sched_group {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f1da77..699ff14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6024,6 +6024,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 		return;
 
 	update_group_power(sd, cpu);
+	atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
 }
 
 int __weak arch_sd_sibling_asym_packing(void)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50c06b0..e050563 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4901,6 +4901,36 @@ static void nohz_balancer_kick(int cpu)
 	return;
 }
 
+static inline void set_cpu_sd_state_busy(void)
+{
+	struct sched_domain *sd;
+	int cpu = smp_processor_id();
+
+	if (!test_bit(NOHZ_IDLE, nohz_flags(cpu)))
+		return;
+	clear_bit(NOHZ_IDLE, nohz_flags(cpu));
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd)
+		atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+	rcu_read_unlock();
+}
+
+void set_cpu_sd_state_idle(void)
+{
+	struct sched_domain *sd;
+	int cpu = smp_processor_id();
+
+	if (test_bit(NOHZ_IDLE, nohz_flags(cpu)))
+		return;
+	set_bit(NOHZ_IDLE, nohz_flags(cpu));
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd)
+		atomic_dec(&sd->groups->sgp->nr_busy_cpus);
+	rcu_read_unlock();
+}
+
 /*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
@@ -5135,6 +5165,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 	* We may be recently in ticked or tickless idle mode. At the first
 	* busy tick after returning from idle, we will update the busy stats.
 	*/
+	set_cpu_sd_state_busy();
 	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu))))
 		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cf7d026..91810f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1069,6 +1069,7 @@ extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
 enum rq_nohz_flag_bits {
 	NOHZ_TICK_STOPPED,
 	NOHZ_BALANCE_KICK,
+	NOHZ_IDLE,
 };
 
 #define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4042064..31cc061 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -297,6 +297,15 @@ void tick_nohz_stop_sched_tick(int inidle)
 	ts = &per_cpu(tick_cpu_sched, cpu);
 
 	/*
+ 	 * Update the idle state in the scheduler domain hierarchy
+ 	 * when tick_nohz_stop_sched_tick() is called from the idle loop.
+ 	 * State will be updated to busy during the first busy tick after
+ 	 * exiting idle.
+ 	 */
+	if (inidle)
+		set_cpu_sd_state_idle();
+
+	/*
 	 * Call to tick_nohz_start_idle stops the last_update_time from being
 	 * updated. Thus, it must not be called in the event we are called from
 	 * irq_exit() with the prior state different than idle.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing
  2011-12-02  1:07 [patch v3 0/6] nohz idle load balancing patches Suresh Siddha
  2011-12-02  1:07 ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Suresh Siddha
  2011-12-02  1:07 ` [patch v3 2/6] sched, nohz: track nr_busy_cpus in the sched_group_power Suresh Siddha
@ 2011-12-02  1:07 ` Suresh Siddha
  2011-12-06  6:37   ` Srivatsa Vaddagiri
  2011-12-06  9:54   ` [tip:sched/core] sched, nohz: Implement " tip-bot for Suresh Siddha
  2011-12-02  1:07 ` [patch v3 4/6] sched, nohz: cleanup the find_new_ilb() using sched groups nr_busy_cpus Suresh Siddha
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Suresh Siddha @ 2011-12-02  1:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith
  Cc: linux-kernel, Tim Chen, alex.shi, Suresh Siddha

[-- Attachment #1: simplify_ilb_using_grp_nr_busy_cpus.patch --]
[-- Type: text/plain, Size: 9354 bytes --]

When there are many logical cpu's that enter and exit idle often, members of
the global nohz data structure are getting modified very frequently causing
lot of cache-line contention.

Make the nohz idle load balancing more scalabale by using the sched domain
topology and 'nr_busy_cpu's in the struct sched_group_power.

Idle load balance is kicked on one of the idle cpu's when there is atleast
one idle cpu and

 - a busy rq having more than one task or

 - a busy rq's scheduler group that share package resources (like HT/MC
   siblings) and has more than one member in that group busy or

 - for the SD_ASYM_PACKING domain, if the lower numbered cpu's in that
   domain are idle compared to the busy ones.

This will help in kicking the idle load balancing request only when
there is a potential imbalance. And once it is mostly balanced, these kicks will
be minimized.

These changes helped improve the workload that is context switch intensive
between number of task pairs by 2x on a 8 socket NHM-EX based system.

Reported-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched/fair.c |  160 +++++++++++++++-------------------------------------
 1 file changed, 47 insertions(+), 113 deletions(-)

Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -4704,28 +4704,17 @@ out_unlock:
 #ifdef CONFIG_NO_HZ
 /*
  * idle load balancing details
- * - One of the idle CPUs nominates itself as idle load_balancer, while
- *   entering idle.
- * - This idle load balancer CPU will also go into tickless mode when
- *   it is idle, just like all other idle CPUs
  * - When one of the busy CPUs notice that there may be an idle rebalancing
  *   needed, they will kick the idle load balancer, which then does idle
  *   load balancing for all the idle CPUs.
  */
 static struct {
-	atomic_t load_balancer;
-	atomic_t first_pick_cpu;
-	atomic_t second_pick_cpu;
 	cpumask_var_t idle_cpus_mask;
 	cpumask_var_t grp_idle_mask;
+	atomic_t nr_cpus;
 	unsigned long next_balance;     /* in jiffy units */
 } nohz ____cacheline_aligned;
 
-int get_nohz_load_balancer(void)
-{
-	return atomic_read(&nohz.load_balancer);
-}
-
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
 /**
  * lowest_flag_domain - Return lowest sched_domain containing flag.
@@ -4802,9 +4791,9 @@ static inline int is_semi_idle_group(str
  */
 static int find_new_ilb(int cpu)
 {
+	int ilb = cpumask_first(nohz.idle_cpus_mask);
 	struct sched_domain *sd;
 	struct sched_group *ilb_group;
-	int ilb = nr_cpu_ids;
 
 	/*
 	 * Have idle load balancer selection from semi-idle packages only
@@ -4858,13 +4847,10 @@ static void nohz_balancer_kick(int cpu)
 
 	nohz.next_balance++;
 
-	ilb_cpu = get_nohz_load_balancer();
+	ilb_cpu = find_new_ilb(cpu);
 
-	if (ilb_cpu >= nr_cpu_ids) {
-		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
-		if (ilb_cpu >= nr_cpu_ids)
-			return;
-	}
+	if (ilb_cpu >= nr_cpu_ids)
+		return;
 
 	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)))
 		return;
@@ -4909,77 +4895,20 @@ void set_cpu_sd_state_idle(void)
 }
 
 /*
- * This routine will try to nominate the ilb (idle load balancing)
- * owner among the cpus whose ticks are stopped. ilb owner will do the idle
- * load balancing on behalf of all those cpus.
- *
- * When the ilb owner becomes busy, we will not have new ilb owner until some
- * idle CPU wakes up and goes back to idle or some busy CPU tries to kick
- * idle load balancing by kicking one of the idle CPUs.
- *
- * Ticks are stopped for the ilb owner as well, with busy CPU kicking this
- * ilb owner CPU in future (when there is a need for idle load balancing on
- * behalf of all idle CPUs).
+ * This routine will record that this cpu is going idle with tick stopped.
+ * This info will be used in performing idle load balancing in the future.
  */
 void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
 	if (stop_tick) {
-		if (!cpu_active(cpu)) {
-			if (atomic_read(&nohz.load_balancer) != cpu)
-				return;
-
-			/*
-			 * If we are going offline and still the leader,
-			 * give up!
-			 */
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
-					   nr_cpu_ids) != cpu)
-				BUG();
-
+		if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
 			return;
-		}
 
 		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
-
-		if (atomic_read(&nohz.first_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
-		if (atomic_read(&nohz.second_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
-
-		if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) {
-			int new_ilb;
-
-			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, nr_cpu_ids,
-					   cpu) != nr_cpu_ids)
-				return;
-
-			/*
-			 * Check to see if there is a more power-efficient
-			 * ilb.
-			 */
-			new_ilb = find_new_ilb(cpu);
-			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
-				atomic_set(&nohz.load_balancer, nr_cpu_ids);
-				resched_cpu(new_ilb);
-				return;
-			}
-			return;
-		}
-
+		atomic_inc(&nohz.nr_cpus);
 		set_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
-	} else {
-		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
-			return;
-
-		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
-
-		if (atomic_read(&nohz.load_balancer) == cpu)
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
-					   nr_cpu_ids) != cpu)
-				BUG();
 	}
 	return;
 }
@@ -5090,7 +5019,7 @@ static void nohz_idle_balance(int this_c
 		goto end;
 
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
-		if (balance_cpu == this_cpu)
+		if (balance_cpu == this_cpu || !idle_cpu(this_cpu))
 			continue;
 
 		/*
@@ -5118,22 +5047,18 @@ end:
 }
 
 /*
- * Current heuristic for kicking the idle load balancer
- * - first_pick_cpu is the one of the busy CPUs. It will kick
- *   idle load balancer when it has more than one process active. This
- *   eliminates the need for idle load balancing altogether when we have
- *   only one running process in the system (common case).
- * - If there are more than one busy CPU, idle load balancer may have
- *   to run for active_load_balance to happen (i.e., two busy CPUs are
- *   SMT or core siblings and can run better if they move to different
- *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
- *   which will kick idle load balancer as soon as it has any load.
+ * Current heuristic for kicking the idle load balancer in the presence
+ * of an idle cpu is the system.
+ *   - This rq has more than one task.
+ *   - At any scheduler domain level, this cpu's scheduler group has multiple
+ *     busy cpu's exceeding the group's power.
+ *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
+ *     domain span are idle.
  */
 static inline int nohz_kick_needed(struct rq *rq, int cpu)
 {
 	unsigned long now = jiffies;
-	int ret;
-	int first_pick_cpu, second_pick_cpu;
+	struct sched_domain *sd;
 
 	if (unlikely(idle_cpu(cpu)))
 		return 0;
@@ -5143,32 +5068,44 @@ static inline int nohz_kick_needed(struc
 	* busy tick after returning from idle, we will update the busy stats.
 	*/
 	set_cpu_sd_state_busy();
-	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu))))
+	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
 		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+		atomic_dec(&nohz.nr_cpus);
+	}
+
+	/*
+	 * None are in tickless mode and hence no need for NOHZ idle load
+	 * balancing.
+	 */
+	if (likely(!atomic_read(&nohz.nr_cpus)))
+		return 0;
 
 	if (time_before(now, nohz.next_balance))
 		return 0;
 
-	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
-	second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
+	if (rq->nr_running >= 2)
+		goto need_kick;
 
-	if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
-	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
-		return 0;
+	for_each_domain(cpu, sd) {
+		struct sched_group *sg = sd->groups;
+		struct sched_group_power *sgp = sg->sgp;
+		int nr_busy = atomic_read(&sgp->nr_busy_cpus);
+
+		if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
+			goto need_kick;
+
+		if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
+		    && (cpumask_first_and(nohz.idle_cpus_mask,
+					  sched_domain_span(sd)) < cpu))
+			goto need_kick;
 
-	ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu);
-	if (ret == nr_cpu_ids || ret == cpu) {
-		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
-		if (rq->nr_running > 1)
-			return 1;
-	} else {
-		ret = atomic_cmpxchg(&nohz.second_pick_cpu, nr_cpu_ids, cpu);
-		if (ret == nr_cpu_ids || ret == cpu) {
-			if (rq->nr_running)
-				return 1;
-		}
+		if (!(sd->flags & (SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING)))
+			break;
 	}
 	return 0;
+need_kick:
+	return 1;
 }
 #else
 static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
@@ -5629,9 +5566,6 @@ __init void init_sched_fair_class(void)
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
 	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
-	atomic_set(&nohz.load_balancer, nr_cpu_ids);
-	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
-	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
 #endif
 #endif /* SMP */
 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing
  2011-12-02  1:07 ` [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing Suresh Siddha
@ 2011-12-06  6:37   ` Srivatsa Vaddagiri
  2011-12-06 19:19     ` Suresh Siddha
       [not found]     ` <A75BCAD09CE00A4280CDD4429D85F1F9261B42A1F9@orsmsx501.amr.corp.intel.com>
  2011-12-06  9:54   ` [tip:sched/core] sched, nohz: Implement " tip-bot for Suresh Siddha
  1 sibling, 2 replies; 30+ messages in thread
From: Srivatsa Vaddagiri @ 2011-12-06  6:37 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Mike Galbraith,
	linux-kernel, Tim Chen, alex.shi

* Suresh Siddha <suresh.b.siddha@intel.com> [2011-12-01 17:07:34]:

> @@ -5090,7 +5019,7 @@ static void nohz_idle_balance(int this_c
>  		goto end;
> 
>  	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> -		if (balance_cpu == this_cpu)
> +		if (balance_cpu == this_cpu || !idle_cpu(this_cpu))
>  			continue;

Hmm ..did you mean to use '!idle_cpu(balance_cpu)' there? If the intent
was on checking this_cpu becoming busy, then we'd rather do a break on
that condition rather than continuing with the loop?

- vatsa


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing
  2011-12-06  6:37   ` Srivatsa Vaddagiri
@ 2011-12-06 19:19     ` Suresh Siddha
  2011-12-06 20:24       ` [tip:sched/core] sched, nohz: Fix the idle cpu check in nohz_idle_balance tip-bot for Suresh Siddha
       [not found]     ` <A75BCAD09CE00A4280CDD4429D85F1F9261B42A1F9@orsmsx501.amr.corp.intel.com>
  1 sibling, 1 reply; 30+ messages in thread
From: Suresh Siddha @ 2011-12-06 19:19 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Mike Galbraith,
	linux-kernel, Tim Chen, Shi, Alex

On Mon, 2011-12-05 at 22:37 -0800, Srivatsa Vaddagiri wrote:
> * Suresh Siddha <suresh.b.siddha@intel.com> [2011-12-01 17:07:34]:
> 
> > @@ -5090,7 +5019,7 @@ static void nohz_idle_balance(int this_c
> >  		goto end;
> > 
> >  	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> > -		if (balance_cpu == this_cpu)
> > +		if (balance_cpu == this_cpu || !idle_cpu(this_cpu))
> >  			continue;
> 
> Hmm ..did you mean to use '!idle_cpu(balance_cpu)' there?

Thanks for reviewing closely. yes, it was a typo. Peter, please queue up
this fix.
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched, nohz: fix the idle cpu check in nohz_idle_balance

cpu bit in the nohz.idle_cpu_mask are reset in the first busy tick after
exiting idle. So during nohz_idle_balance(), intention is to double
check if the cpu that is part of the idle_cpu_mask is indeed idle before
going ahead in performing idle balance for that cpu.

Fix the cpu typo in the idle_cpu() check during nohz_idle_balance().

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched/fair.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65a6f8b..0bcd144 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5019,7 +5019,7 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 		goto end;
 
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
-		if (balance_cpu == this_cpu || !idle_cpu(this_cpu))
+		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
 			continue;
 
 		/*



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [tip:sched/core] sched, nohz: Fix the idle cpu check in nohz_idle_balance
  2011-12-06 19:19     ` Suresh Siddha
@ 2011-12-06 20:24       ` tip-bot for Suresh Siddha
  0 siblings, 0 replies; 30+ messages in thread
From: tip-bot for Suresh Siddha @ 2011-12-06 20:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, vatsa, hpa, mingo, a.p.zijlstra, suresh.b.siddha,
	tglx, mingo

Commit-ID:  8a6d42d1b32ad239c28f445138ea9c19aa52dd20
Gitweb:     http://git.kernel.org/tip/8a6d42d1b32ad239c28f445138ea9c19aa52dd20
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Tue, 6 Dec 2011 11:19:37 -0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 20:51:27 +0100

sched, nohz: Fix the idle cpu check in nohz_idle_balance

cpu bit in the nohz.idle_cpu_mask are reset in the first busy tick after
exiting idle. So during nohz_idle_balance(), intention is to double
check if the cpu that is part of the idle_cpu_mask is indeed idle before
going ahead in performing idle balance for that cpu.

Fix the cpu typo in the idle_cpu() check during nohz_idle_balance().

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323199177.1984.12.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4174338..8be45ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5020,7 +5020,7 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 		goto end;
 
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
-		if (balance_cpu == this_cpu || !idle_cpu(this_cpu))
+		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
 			continue;
 
 		/*

^ permalink raw reply related	[flat|nested] 30+ messages in thread

[parent not found: <A75BCAD09CE00A4280CDD4429D85F1F9261B42A1F9@orsmsx501.amr.corp.intel.com>]

* Re: [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing
       [not found]     ` <A75BCAD09CE00A4280CDD4429D85F1F9261B42A1F9@orsmsx501.amr.corp.intel.com>
@ 2011-12-06 19:27       ` Suresh Siddha
  0 siblings, 0 replies; 30+ messages in thread
From: Suresh Siddha @ 2011-12-06 19:27 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Mike Galbraith,
	linux-kernel, Tim Chen, Shi, Alex

On Tue, 2011-12-06 at 11:19 -0800, Suresh Siddha wrote:
> From: Suresh Siddha <suresh.b.siddha@intel.com>
> Subject: sched, nohz: fix the idle cpu check in nohz_idle_balance
> 
> cpu bit in the nohz.idle_cpu_mask are reset in the first busy tick after
> exiting idle. So during nohz_idle_balance(), intention is to double
> check if the cpu that is part of the idle_cpu_mask is indeed idle before
> going ahead in performing idle balance for that cpu.
> 
> Fix the cpu typo in the idle_cpu() check during nohz_idle_balance().
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

Should have added:

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:sched/core] sched, nohz: Implement sched group, domain aware nohz idle load balancing
  2011-12-02  1:07 ` [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing Suresh Siddha
  2011-12-06  6:37   ` Srivatsa Vaddagiri
@ 2011-12-06  9:54   ` tip-bot for Suresh Siddha
  1 sibling, 0 replies; 30+ messages in thread
From: tip-bot for Suresh Siddha @ 2011-12-06  9:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, tim.c.chen, hpa, mingo, a.p.zijlstra,
	suresh.b.siddha, tglx, mingo

Commit-ID:  0b005cf54eac170a8f22540ab096a6e07bf49e7c
Gitweb:     http://git.kernel.org/tip/0b005cf54eac170a8f22540ab096a6e07bf49e7c
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Thu, 1 Dec 2011 17:07:34 -0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 09:06:34 +0100

sched, nohz: Implement sched group, domain aware nohz idle load balancing

When there are many logical cpu's that enter and exit idle often, members of
the global nohz data structure are getting modified very frequently causing
lot of cache-line contention.

Make the nohz idle load balancing more scalabale by using the sched domain
topology and 'nr_busy_cpu's in the struct sched_group_power.

Idle load balance is kicked on one of the idle cpu's when there is atleast
one idle cpu and:

 - a busy rq having more than one task or

 - a busy rq's scheduler group that share package resources (like HT/MC
   siblings) and has more than one member in that group busy or

 - for the SD_ASYM_PACKING domain, if the lower numbered cpu's in that
   domain are idle compared to the busy ones.

This will help in kicking the idle load balancing request only when
there is a potential imbalance. And once it is mostly balanced, these kicks will
be minimized.

These changes helped improve the workload that is context switch intensive
between number of task pairs by 2x on a 8 socket NHM-EX based system.

Reported-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c |  160 +++++++++++++++------------------------------------
 1 files changed, 47 insertions(+), 113 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e050563..821af14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4727,28 +4727,17 @@ out_unlock:
 #ifdef CONFIG_NO_HZ
 /*
  * idle load balancing details
- * - One of the idle CPUs nominates itself as idle load_balancer, while
- *   entering idle.
- * - This idle load balancer CPU will also go into tickless mode when
- *   it is idle, just like all other idle CPUs
  * - When one of the busy CPUs notice that there may be an idle rebalancing
  *   needed, they will kick the idle load balancer, which then does idle
  *   load balancing for all the idle CPUs.
  */
 static struct {
-	atomic_t load_balancer;
-	atomic_t first_pick_cpu;
-	atomic_t second_pick_cpu;
 	cpumask_var_t idle_cpus_mask;
 	cpumask_var_t grp_idle_mask;
+	atomic_t nr_cpus;
 	unsigned long next_balance;     /* in jiffy units */
 } nohz ____cacheline_aligned;
 
-int get_nohz_load_balancer(void)
-{
-	return atomic_read(&nohz.load_balancer);
-}
-
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
 /**
  * lowest_flag_domain - Return lowest sched_domain containing flag.
@@ -4825,9 +4814,9 @@ static inline int is_semi_idle_group(struct sched_group *ilb_group)
  */
 static int find_new_ilb(int cpu)
 {
+	int ilb = cpumask_first(nohz.idle_cpus_mask);
 	struct sched_domain *sd;
 	struct sched_group *ilb_group;
-	int ilb = nr_cpu_ids;
 
 	/*
 	 * Have idle load balancer selection from semi-idle packages only
@@ -4881,13 +4870,10 @@ static void nohz_balancer_kick(int cpu)
 
 	nohz.next_balance++;
 
-	ilb_cpu = get_nohz_load_balancer();
+	ilb_cpu = find_new_ilb(cpu);
 
-	if (ilb_cpu >= nr_cpu_ids) {
-		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
-		if (ilb_cpu >= nr_cpu_ids)
-			return;
-	}
+	if (ilb_cpu >= nr_cpu_ids)
+		return;
 
 	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)))
 		return;
@@ -4932,77 +4918,20 @@ void set_cpu_sd_state_idle(void)
 }
 
 /*
- * This routine will try to nominate the ilb (idle load balancing)
- * owner among the cpus whose ticks are stopped. ilb owner will do the idle
- * load balancing on behalf of all those cpus.
- *
- * When the ilb owner becomes busy, we will not have new ilb owner until some
- * idle CPU wakes up and goes back to idle or some busy CPU tries to kick
- * idle load balancing by kicking one of the idle CPUs.
- *
- * Ticks are stopped for the ilb owner as well, with busy CPU kicking this
- * ilb owner CPU in future (when there is a need for idle load balancing on
- * behalf of all idle CPUs).
+ * This routine will record that this cpu is going idle with tick stopped.
+ * This info will be used in performing idle load balancing in the future.
  */
 void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
 	if (stop_tick) {
-		if (!cpu_active(cpu)) {
-			if (atomic_read(&nohz.load_balancer) != cpu)
-				return;
-
-			/*
-			 * If we are going offline and still the leader,
-			 * give up!
-			 */
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
-					   nr_cpu_ids) != cpu)
-				BUG();
-
+		if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
 			return;
-		}
 
 		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
-
-		if (atomic_read(&nohz.first_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
-		if (atomic_read(&nohz.second_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
-
-		if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) {
-			int new_ilb;
-
-			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, nr_cpu_ids,
-					   cpu) != nr_cpu_ids)
-				return;
-
-			/*
-			 * Check to see if there is a more power-efficient
-			 * ilb.
-			 */
-			new_ilb = find_new_ilb(cpu);
-			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
-				atomic_set(&nohz.load_balancer, nr_cpu_ids);
-				resched_cpu(new_ilb);
-				return;
-			}
-			return;
-		}
-
+		atomic_inc(&nohz.nr_cpus);
 		set_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
-	} else {
-		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
-			return;
-
-		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
-
-		if (atomic_read(&nohz.load_balancer) == cpu)
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
-					   nr_cpu_ids) != cpu)
-				BUG();
 	}
 	return;
 }
@@ -5113,7 +5042,7 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 		goto end;
 
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
-		if (balance_cpu == this_cpu)
+		if (balance_cpu == this_cpu || !idle_cpu(this_cpu))
 			continue;
 
 		/*
@@ -5141,22 +5070,18 @@ end:
 }
 
 /*
- * Current heuristic for kicking the idle load balancer
- * - first_pick_cpu is the one of the busy CPUs. It will kick
- *   idle load balancer when it has more than one process active. This
- *   eliminates the need for idle load balancing altogether when we have
- *   only one running process in the system (common case).
- * - If there are more than one busy CPU, idle load balancer may have
- *   to run for active_load_balance to happen (i.e., two busy CPUs are
- *   SMT or core siblings and can run better if they move to different
- *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
- *   which will kick idle load balancer as soon as it has any load.
+ * Current heuristic for kicking the idle load balancer in the presence
+ * of an idle cpu is the system.
+ *   - This rq has more than one task.
+ *   - At any scheduler domain level, this cpu's scheduler group has multiple
+ *     busy cpu's exceeding the group's power.
+ *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
+ *     domain span are idle.
  */
 static inline int nohz_kick_needed(struct rq *rq, int cpu)
 {
 	unsigned long now = jiffies;
-	int ret;
-	int first_pick_cpu, second_pick_cpu;
+	struct sched_domain *sd;
 
 	if (unlikely(idle_cpu(cpu)))
 		return 0;
@@ -5166,32 +5091,44 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 	* busy tick after returning from idle, we will update the busy stats.
 	*/
 	set_cpu_sd_state_busy();
-	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu))))
+	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
 		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+		atomic_dec(&nohz.nr_cpus);
+	}
+
+	/*
+	 * None are in tickless mode and hence no need for NOHZ idle load
+	 * balancing.
+	 */
+	if (likely(!atomic_read(&nohz.nr_cpus)))
+		return 0;
 
 	if (time_before(now, nohz.next_balance))
 		return 0;
 
-	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
-	second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
+	if (rq->nr_running >= 2)
+		goto need_kick;
 
-	if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
-	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
-		return 0;
+	for_each_domain(cpu, sd) {
+		struct sched_group *sg = sd->groups;
+		struct sched_group_power *sgp = sg->sgp;
+		int nr_busy = atomic_read(&sgp->nr_busy_cpus);
 
-	ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu);
-	if (ret == nr_cpu_ids || ret == cpu) {
-		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
-		if (rq->nr_running > 1)
-			return 1;
-	} else {
-		ret = atomic_cmpxchg(&nohz.second_pick_cpu, nr_cpu_ids, cpu);
-		if (ret == nr_cpu_ids || ret == cpu) {
-			if (rq->nr_running)
-				return 1;
-		}
+		if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
+			goto need_kick;
+
+		if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
+		    && (cpumask_first_and(nohz.idle_cpus_mask,
+					  sched_domain_span(sd)) < cpu))
+			goto need_kick;
+
+		if (!(sd->flags & (SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING)))
+			break;
 	}
 	return 0;
+need_kick:
+	return 1;
 }
 #else
 static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
@@ -5652,9 +5589,6 @@ __init void init_sched_fair_class(void)
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
 	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
-	atomic_set(&nohz.load_balancer, nr_cpu_ids);
-	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
-	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
 #endif
 #endif /* SMP */
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [patch v3 4/6] sched, nohz: cleanup the find_new_ilb() using sched groups nr_busy_cpus
  2011-12-02  1:07 [patch v3 0/6] nohz idle load balancing patches Suresh Siddha
                   ` (2 preceding siblings ...)
  2011-12-02  1:07 ` [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing Suresh Siddha
@ 2011-12-02  1:07 ` Suresh Siddha
  2011-12-06  9:55   ` [tip:sched/core] sched, nohz: Clean up " tip-bot for Suresh Siddha
  2011-12-02  1:07 ` [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains Suresh Siddha
  2011-12-02  1:07 ` [patch v3 6/6] sched: fix the sched group node allocation for SD_OVERLAP domain Suresh Siddha
  5 siblings, 1 reply; 30+ messages in thread
From: Suresh Siddha @ 2011-12-02  1:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith
  Cc: linux-kernel, Tim Chen, alex.shi, Suresh Siddha

[-- Attachment #1: cleanup_find_ilb.patch --]
[-- Type: text/plain, Size: 3258 bytes --]

nr_busy_cpus in the sched_group_power indicates whether the group
is semi idle or not. This helps remove the is_semi_idle_group() and simplify
the find_new_ilb() in the context of finding an optimal cpu that can do
idle load balancing.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched/fair.c |   48 ++++++++++++------------------------------------
 1 file changed, 12 insertions(+), 36 deletions(-)

Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -4710,7 +4710,6 @@ out_unlock:
  */
 static struct {
 	cpumask_var_t idle_cpus_mask;
-	cpumask_var_t grp_idle_mask;
 	atomic_t nr_cpus;
 	unsigned long next_balance;     /* in jiffy units */
 } nohz ____cacheline_aligned;
@@ -4751,33 +4750,6 @@ static inline struct sched_domain *lowes
 		(sd && (sd->flags & flag)); sd = sd->parent)
 
 /**
- * is_semi_idle_group - Checks if the given sched_group is semi-idle.
- * @ilb_group:	group to be checked for semi-idleness
- *
- * Returns:	1 if the group is semi-idle. 0 otherwise.
- *
- * We define a sched_group to be semi idle if it has atleast one idle-CPU
- * and atleast one non-idle CPU. This helper function checks if the given
- * sched_group is semi-idle or not.
- */
-static inline int is_semi_idle_group(struct sched_group *ilb_group)
-{
-	cpumask_and(nohz.grp_idle_mask, nohz.idle_cpus_mask,
-					sched_group_cpus(ilb_group));
-
-	/*
-	 * A sched_group is semi-idle when it has atleast one busy cpu
-	 * and atleast one idle cpu.
-	 */
-	if (cpumask_empty(nohz.grp_idle_mask))
-		return 0;
-
-	if (cpumask_equal(nohz.grp_idle_mask, sched_group_cpus(ilb_group)))
-		return 0;
-
-	return 1;
-}
-/**
  * find_new_ilb - Finds the optimum idle load balancer for nomination.
  * @cpu:	The cpu which is nominating a new idle_load_balancer.
  *
@@ -4792,8 +4764,8 @@ static inline int is_semi_idle_group(str
 static int find_new_ilb(int cpu)
 {
 	int ilb = cpumask_first(nohz.idle_cpus_mask);
+	struct sched_group *ilbg;
 	struct sched_domain *sd;
-	struct sched_group *ilb_group;
 
 	/*
 	 * Have idle load balancer selection from semi-idle packages only
@@ -4811,23 +4783,28 @@ static int find_new_ilb(int cpu)
 
 	rcu_read_lock();
 	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
-		ilb_group = sd->groups;
+		ilbg = sd->groups;
 
 		do {
-			if (is_semi_idle_group(ilb_group)) {
-				ilb = cpumask_first(nohz.grp_idle_mask);
+			if (ilbg->group_weight !=
+				atomic_read(&ilbg->sgp->nr_busy_cpus)) {
+				ilb = cpumask_first_and(nohz.idle_cpus_mask,
+							sched_group_cpus(ilbg));
 				goto unlock;
 			}
 
-			ilb_group = ilb_group->next;
+			ilbg = ilbg->next;
 
-		} while (ilb_group != sd->groups);
+		} while (ilbg != sd->groups);
 	}
 unlock:
 	rcu_read_unlock();
 
 out_done:
-	return ilb;
+	if (ilb < nr_cpu_ids && idle_cpu(ilb))
+		return ilb;
+
+	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
 static inline int find_new_ilb(int call_cpu)
@@ -5565,7 +5542,6 @@ __init void init_sched_fair_class(void)
 
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
-	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
 #endif
 #endif /* SMP */
 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:sched/core] sched, nohz: Clean up the find_new_ilb() using sched groups nr_busy_cpus
  2011-12-02  1:07 ` [patch v3 4/6] sched, nohz: cleanup the find_new_ilb() using sched groups nr_busy_cpus Suresh Siddha
@ 2011-12-06  9:55   ` tip-bot for Suresh Siddha
  0 siblings, 0 replies; 30+ messages in thread
From: tip-bot for Suresh Siddha @ 2011-12-06  9:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, suresh.b.siddha, tglx,
	mingo

Commit-ID:  786d6dc7aeb2bfbfe417507b7beb83919f319db3
Gitweb:     http://git.kernel.org/tip/786d6dc7aeb2bfbfe417507b7beb83919f319db3
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Thu, 1 Dec 2011 17:07:35 -0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 09:06:36 +0100

sched, nohz: Clean up the find_new_ilb() using sched groups nr_busy_cpus

nr_busy_cpus in the sched_group_power indicates whether the group
is semi idle or not. This helps remove the is_semi_idle_group() and simplify
the find_new_ilb() in the context of finding an optimal cpu that can do
idle load balancing.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20111202010832.656983582@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c |   48 ++++++++++++------------------------------------
 1 files changed, 12 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 821af14..65a6f8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4733,7 +4733,6 @@ out_unlock:
  */
 static struct {
 	cpumask_var_t idle_cpus_mask;
-	cpumask_var_t grp_idle_mask;
 	atomic_t nr_cpus;
 	unsigned long next_balance;     /* in jiffy units */
 } nohz ____cacheline_aligned;
@@ -4774,33 +4773,6 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 		(sd && (sd->flags & flag)); sd = sd->parent)
 
 /**
- * is_semi_idle_group - Checks if the given sched_group is semi-idle.
- * @ilb_group:	group to be checked for semi-idleness
- *
- * Returns:	1 if the group is semi-idle. 0 otherwise.
- *
- * We define a sched_group to be semi idle if it has atleast one idle-CPU
- * and atleast one non-idle CPU. This helper function checks if the given
- * sched_group is semi-idle or not.
- */
-static inline int is_semi_idle_group(struct sched_group *ilb_group)
-{
-	cpumask_and(nohz.grp_idle_mask, nohz.idle_cpus_mask,
-					sched_group_cpus(ilb_group));
-
-	/*
-	 * A sched_group is semi-idle when it has atleast one busy cpu
-	 * and atleast one idle cpu.
-	 */
-	if (cpumask_empty(nohz.grp_idle_mask))
-		return 0;
-
-	if (cpumask_equal(nohz.grp_idle_mask, sched_group_cpus(ilb_group)))
-		return 0;
-
-	return 1;
-}
-/**
  * find_new_ilb - Finds the optimum idle load balancer for nomination.
  * @cpu:	The cpu which is nominating a new idle_load_balancer.
  *
@@ -4815,8 +4787,8 @@ static inline int is_semi_idle_group(struct sched_group *ilb_group)
 static int find_new_ilb(int cpu)
 {
 	int ilb = cpumask_first(nohz.idle_cpus_mask);
+	struct sched_group *ilbg;
 	struct sched_domain *sd;
-	struct sched_group *ilb_group;
 
 	/*
 	 * Have idle load balancer selection from semi-idle packages only
@@ -4834,23 +4806,28 @@ static int find_new_ilb(int cpu)
 
 	rcu_read_lock();
 	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
-		ilb_group = sd->groups;
+		ilbg = sd->groups;
 
 		do {
-			if (is_semi_idle_group(ilb_group)) {
-				ilb = cpumask_first(nohz.grp_idle_mask);
+			if (ilbg->group_weight !=
+				atomic_read(&ilbg->sgp->nr_busy_cpus)) {
+				ilb = cpumask_first_and(nohz.idle_cpus_mask,
+							sched_group_cpus(ilbg));
 				goto unlock;
 			}
 
-			ilb_group = ilb_group->next;
+			ilbg = ilbg->next;
 
-		} while (ilb_group != sd->groups);
+		} while (ilbg != sd->groups);
 	}
 unlock:
 	rcu_read_unlock();
 
 out_done:
-	return ilb;
+	if (ilb < nr_cpu_ids && idle_cpu(ilb))
+		return ilb;
+
+	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
 static inline int find_new_ilb(int call_cpu)
@@ -5588,7 +5565,6 @@ __init void init_sched_fair_class(void)
 
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
-	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
 #endif
 #endif /* SMP */
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-02  1:07 [patch v3 0/6] nohz idle load balancing patches Suresh Siddha
                   ` (3 preceding siblings ...)
  2011-12-02  1:07 ` [patch v3 4/6] sched, nohz: cleanup the find_new_ilb() using sched groups nr_busy_cpus Suresh Siddha
@ 2011-12-02  1:07 ` Suresh Siddha
  2011-12-02  3:34   ` Mike Galbraith
  2011-12-02  1:07 ` [patch v3 6/6] sched: fix the sched group node allocation for SD_OVERLAP domain Suresh Siddha
  5 siblings, 1 reply; 30+ messages in thread
From: Suresh Siddha @ 2011-12-02  1:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith
  Cc: linux-kernel, Tim Chen, alex.shi, Suresh Siddha

[-- Attachment #1: use_ttwu_queue_when_crossing_cache_domains.patch --]
[-- Type: text/plain, Size: 1916 bytes --]

From: Mike Galbraith <efault@gmx.de>

Context-switch intensive microbenchmark on a 8-socket system had
~600K times more resched IPI's on each logical CPU because of the
TTWU_QUEUE sched feature, which queues the task on the remote cpu's
queue and completes the wakeup locally using an IPI.

As the TTWU_QUEUE sched feature is for minimizing the cache-misses
associated with the remote wakeups, use the IPI only when the local and
the remote cpu's are from different cache domains. Otherwise use the
traditional remote wakeup.

With this, context-switch microbenchmark performed 5 times better on the
8-socket NHM-EX system.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched/core.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -1481,12 +1481,35 @@ static int ttwu_activate_remote(struct t
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 #endif /* CONFIG_SMP */
 
+static int ttwu_share_cache(int this_cpu, int cpu)
+{
+#ifndef CONFIG_X86
+	struct sched_domain *sd;
+	int ret = 0;
+
+	rcu_read_lock();
+	for_each_domain(this_cpu, sd) {
+		if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
+			continue;
+
+		ret = (sd->flags & SD_SHARE_PKG_RESOURCES);
+		break;
+	}
+	rcu_read_unlock();
+
+	return ret;
+#else
+	return per_cpu(cpu_llc_id, this_cpu) == per_cpu(cpu_llc_id, cpu);
+#endif
+}
+
 static void ttwu_queue(struct task_struct *p, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
+	if (sched_feat(TTWU_QUEUE) &&
+	    !ttwu_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
 		ttwu_queue_remote(p, cpu);
 		return;



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-02  1:07 ` [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains Suresh Siddha
@ 2011-12-02  3:34   ` Mike Galbraith
  2011-12-07 16:23     ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2011-12-02  3:34 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, alex.shi

On Thu, 2011-12-01 at 17:07 -0800, Suresh Siddha wrote:
> plain text document attachment
> (use_ttwu_queue_when_crossing_cache_domains.patch)
> From: Mike Galbraith <efault@gmx.de>
> 
> Context-switch intensive microbenchmark on a 8-socket system had
> ~600K times more resched IPI's on each logical CPU because of the
> TTWU_QUEUE sched feature, which queues the task on the remote cpu's
> queue and completes the wakeup locally using an IPI.
> 
> As the TTWU_QUEUE sched feature is for minimizing the cache-misses
> associated with the remote wakeups, use the IPI only when the local and
> the remote cpu's are from different cache domains. Otherwise use the
> traditional remote wakeup.

FYI, Peter has already (improved and) queued this patch.

> With this, context-switch microbenchmark performed 5 times better on the
> 8-socket NHM-EX system.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
>  kernel/sched/core.c |   25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> Index: tip/kernel/sched/core.c
> ===================================================================
> --- tip.orig/kernel/sched/core.c
> +++ tip/kernel/sched/core.c
> @@ -1481,12 +1481,35 @@ static int ttwu_activate_remote(struct t
>  #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
>  #endif /* CONFIG_SMP */
>  
> +static int ttwu_share_cache(int this_cpu, int cpu)
> +{
> +#ifndef CONFIG_X86
> +	struct sched_domain *sd;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +	for_each_domain(this_cpu, sd) {
> +		if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
> +			continue;
> +
> +		ret = (sd->flags & SD_SHARE_PKG_RESOURCES);
> +		break;
> +	}
> +	rcu_read_unlock();
> +
> +	return ret;
> +#else
> +	return per_cpu(cpu_llc_id, this_cpu) == per_cpu(cpu_llc_id, cpu);
> +#endif
> +}
> +
>  static void ttwu_queue(struct task_struct *p, int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  
>  #if defined(CONFIG_SMP)
> -	if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
> +	if (sched_feat(TTWU_QUEUE) &&
> +	    !ttwu_share_cache(smp_processor_id(), cpu)) {
>  		sched_clock_cpu(cpu); /* sync clocks x-cpu */
>  		ttwu_queue_remote(p, cpu);
>  		return;
> 
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-02  3:34   ` Mike Galbraith
@ 2011-12-07 16:23     ` Peter Zijlstra
  2011-12-07 19:20       ` Suresh Siddha
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-12-07 16:23 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, alex.shi

On Fri, 2011-12-02 at 04:34 +0100, Mike Galbraith wrote:
> On Thu, 2011-12-01 at 17:07 -0800, Suresh Siddha wrote:
> > plain text document attachment
> > (use_ttwu_queue_when_crossing_cache_domains.patch)
> > From: Mike Galbraith <efault@gmx.de>
> > 
> > Context-switch intensive microbenchmark on a 8-socket system had
> > ~600K times more resched IPI's on each logical CPU because of the
> > TTWU_QUEUE sched feature, which queues the task on the remote cpu's
> > queue and completes the wakeup locally using an IPI.
> > 
> > As the TTWU_QUEUE sched feature is for minimizing the cache-misses
> > associated with the remote wakeups, use the IPI only when the local and
> > the remote cpu's are from different cache domains. Otherwise use the
> > traditional remote wakeup.
> 
> FYI, Peter has already (improved and) queued this patch.

In fact, Ingo (rightfully) refused to take this due to the x86 specific
code in scheduler guts..

Initially the idea was to provide a new arch interface and a fallback
and do the Kconfig thing etc. After a bit of thought I decided against
that for we already have that information in the sched_domain tree
anyway and it should be a simple matter of representing things
differently.

This led to the below patch, which seems to boot on my box. I still hate
the sd_top_spr* names but whatever.. ;-)

---
 kernel/sched/core.c  |   36 +++++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c  |   24 +-----------------------
 kernel/sched/sched.h |   42 ++++++++++++++++++++++++++++++++++++------
 3 files changed, 72 insertions(+), 30 deletions(-)
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -1511,6 +1511,12 @@ static int ttwu_activate_remote(struct t
 
 }
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
+
+static inline int ttwu_share_cache(int this_cpu, int that_cpu)
+{
+	return per_cpu(sd_top_spr_id, this_cpu) ==
+		per_cpu(sd_top_spr_id, that_cpu);
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu)
@@ -1518,7 +1524,7 @@ static void ttwu_queue(struct task_struc
 	struct rq *rq = cpu_rq(cpu);
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
+	if (sched_feat(TTWU_QUEUE) && !ttwu_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
 		ttwu_queue_remote(p, cpu);
 		return;
@@ -5751,6 +5757,32 @@ static void destroy_sched_domains(struct
 }
 
 /*
+ * Keep a special pointer to the highest sched_domain that has
+ * SD_SHARE_PKG_RESOURCE set (Last Level Cache Domain) for this
+ * allows us to avoid some pointer chasing select_idle_sibling().
+ *
+ * Also keep a unique ID per domain (we use the first cpu number in
+ * the cpumask of the domain), this allows us to quickly tell if
+ * two cpus are in the same cache domain, see ttwu_share_cache().
+ */
+DEFINE_PER_CPU(struct sched_domain *, sd_top_spr);
+DEFINE_PER_CPU(int, sd_top_spr_id);
+
+static void update_top_cache_domain(int cpu)
+{
+	struct sched_domain *sd;
+	int id = -1;
+
+	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+	if (sd)
+		id = cpumask_first(sched_domain_span(sd));
+
+
+	rcu_assign_pointer(per_cpu(sd_top_spr, cpu), sd);
+	per_cpu(sd_top_spr_id, cpu) = id;
+}
+
+/*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
  */
@@ -5789,6 +5821,8 @@ cpu_attach_domain(struct sched_domain *s
 	tmp = rq->sd;
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
+
+	update_top_cache_domain(cpu);
 }
 
 /* cpus with isolated domains */
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -2644,28 +2644,6 @@ find_idlest_cpu(struct sched_group *grou
 	return idlest;
 }
 
-/**
- * highest_flag_domain - Return highest sched_domain containing flag.
- * @cpu:	The cpu whose highest level of sched domain is to
- *		be returned.
- * @flag:	The flag to check for the highest sched_domain
- *		for the given cpu.
- *
- * Returns the highest sched_domain of a cpu which contains the given flag.
- */
-static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
-{
-	struct sched_domain *sd, *hsd = NULL;
-
-	for_each_domain(cpu, sd) {
-		if (!(sd->flags & flag))
-			break;
-		hsd = sd;
-	}
-
-	return hsd;
-}
-
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2696,7 +2674,7 @@ static int select_idle_sibling(struct ta
 	 */
 	rcu_read_lock();
 
-	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	sd = rcu_dereference(per_cpu(sd_top_spr, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
Index: linux-2.6/kernel/sched/sched.h
===================================================================
--- linux-2.6.orig/kernel/sched/sched.h
+++ linux-2.6/kernel/sched/sched.h
@@ -487,6 +487,14 @@ static inline int cpu_of(struct rq *rq)
 
 DECLARE_PER_CPU(struct rq, runqueues);
 
+#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
+#define this_rq()		(&__get_cpu_var(runqueues))
+#define task_rq(p)		cpu_rq(task_cpu(p))
+#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
+#define raw_rq()		(&__raw_get_cpu_var(runqueues))
+
+#ifdef CONFIG_SMP
+
 #define rcu_dereference_check_sched_domain(p) \
 	rcu_dereference_check((p), \
 			      lockdep_is_held(&sched_domains_mutex))
@@ -499,15 +507,37 @@ DECLARE_PER_CPU(struct rq, runqueues);
  * preempt-disabled sections.
  */
 #define for_each_domain(cpu, __sd) \
-	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
+	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
+			__sd; __sd = __sd->parent)
 
 #define for_each_lower_domain(sd) for (; sd; sd = sd->child)
 
-#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
-#define this_rq()		(&__get_cpu_var(runqueues))
-#define task_rq(p)		cpu_rq(task_cpu(p))
-#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
-#define raw_rq()		(&__raw_get_cpu_var(runqueues))
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			break;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
+DECLARE_PER_CPU(struct sched_domain *, sd_top_spr);
+DECLARE_PER_CPU(int, sd_top_spr_id);
+
+#endif /* CONFIG_SMP */
 
 #include "stats.h"
 #include "auto_group.h"


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-07 16:23     ` Peter Zijlstra
@ 2011-12-07 19:20       ` Suresh Siddha
  2011-12-08  6:06         ` Mike Galbraith
                           ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Suresh Siddha @ 2011-12-07 19:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, Shi, Alex

On Wed, 2011-12-07 at 08:23 -0800, Peter Zijlstra wrote:
> In fact, Ingo (rightfully) refused to take this due to the x86 specific
> code in scheduler guts..

I noticed this while reviewing/testing the patch and left it because on
the microbenchmark, it had measurable impact of using the direct check
as compared to going through the scheduler domains every time to figure
out if we shared the LLC or not.

> Initially the idea was to provide a new arch interface and a fallback
> and do the Kconfig thing etc. After a bit of thought I decided against
> that for we already have that information in the sched_domain tree
> anyway and it should be a simple matter of representing things
> differently.

This def looks better.

> +DEFINE_PER_CPU(int, sd_top_spr_id);

pkg_srid (shared resource id)?

> +
> +static void update_top_cache_domain(int cpu)
> +{
> +	struct sched_domain *sd;
> +	int id = -1;
> +
> +	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> +	if (sd)
> +		id = cpumask_first(sched_domain_span(sd));

if there is no sd with shared pkg resources, then id has to be set to
'cpu'.

> +	rcu_assign_pointer(per_cpu(sd_top_spr, cpu), sd);
> +	per_cpu(sd_top_spr_id, cpu) = id;

Otherwise it looks good.

Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-07 19:20       ` Suresh Siddha
@ 2011-12-08  6:06         ` Mike Galbraith
  2011-12-08  9:41           ` Peter Zijlstra
  2011-12-08  9:29         ` Peter Zijlstra
  2011-12-08 10:02         ` Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2011-12-08  6:06 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, Shi, Alex

On Wed, 2011-12-07 at 11:20 -0800, Suresh Siddha wrote:
> On Wed, 2011-12-07 at 08:23 -0800, Peter Zijlstra wrote:
> > In fact, Ingo (rightfully) refused to take this due to the x86 specific
> > code in scheduler guts..
> 
> I noticed this while reviewing/testing the patch and left it because on
> the microbenchmark, it had measurable impact of using the direct check
> as compared to going through the scheduler domains every time to figure
> out if we shared the LLC or not.
> 
> > Initially the idea was to provide a new arch interface and a fallback
> > and do the Kconfig thing etc. After a bit of thought I decided against
> > that for we already have that information in the sched_domain tree
> > anyway and it should be a simple matter of representing things
> > differently.
> 
> This def looks better.

Yup.

> > +DEFINE_PER_CPU(int, sd_top_spr_id);
> 
> pkg_srid (shared resource id)?
> 
> > +
> > +static void update_top_cache_domain(int cpu)
> > +{
> > +	struct sched_domain *sd;
> > +	int id = -1;
> > +
> > +	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> > +	if (sd)
> > +		id = cpumask_first(sched_domain_span(sd));
> 
> if there is no sd with shared pkg resources, then id has to be set to
> 'cpu'.
> 
> > +	rcu_assign_pointer(per_cpu(sd_top_spr, cpu), sd);
> > +	per_cpu(sd_top_spr_id, cpu) = id;
> 
> Otherwise it looks good.
> 
> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>

Acked-by: Mike Galbraith <efault@gmx.de>

The stable regression fix should look about like so then, yes?

From: Peter Zijlstra <peterz@infradead.org>

sched, ttwu_queue: queue remote wakeups only when crossing cache domains

<Insert Peter's final changelog>

Acked-by: Mike Galbraith <efault@gmx.de>
Cc: stable@kernel.org # v3.0+

---
 kernel/sched.c |   51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)
Index: linux-3.0/kernel/sched.c
===================================================================
--- linux-3.0.orig/kernel/sched.c
+++ linux-3.0/kernel/sched.c
@@ -2636,6 +2636,19 @@ static int ttwu_activate_remote(struct t
 
 }
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
+
+/*
+ * Keep a unique ID per domain (we use the first cpu number in
+ * the cpumask of the domain), this allows us to quickly tell if
+ * two cpus are in the same cache domain, see ttwu_share_cache().
+ */
+DEFINE_PER_CPU(int, sd_top_spr_id);
+
+static inline int ttwu_share_cache(int this_cpu, int that_cpu)
+{
+	return per_cpu(sd_top_spr_id, this_cpu) ==
+		per_cpu(sd_top_spr_id, that_cpu);
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu)
@@ -2643,7 +2656,7 @@ static void ttwu_queue(struct task_struc
 	struct rq *rq = cpu_rq(cpu);
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
+	if (sched_feat(TTWU_QUEUE) && !ttwu_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
 		ttwu_queue_remote(p, cpu);
 		return;
@@ -6858,6 +6871,40 @@ static void destroy_sched_domains(struct
 		destroy_sched_domain(sd, cpu);
 }
 
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			break;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
+static void update_top_cache_domain(int cpu)
+{
+	struct sched_domain *sd;
+	int id = cpu;
+
+	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+	if (sd)
+		id = cpumask_first(sched_domain_span(sd));
+
+	per_cpu(sd_top_spr_id, cpu) = id;
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -6897,6 +6944,8 @@ cpu_attach_domain(struct sched_domain *s
 	tmp = rq->sd;
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
+
+	update_top_cache_domain(cpu);
 }
 
 /* cpus with isolated domains */




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-08  6:06         ` Mike Galbraith
@ 2011-12-08  9:41           ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2011-12-08  9:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, Shi, Alex

On Thu, 2011-12-08 at 07:06 +0100, Mike Galbraith wrote:
> The stable regression fix should look about like so then, yes?

Something like that, yes.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-07 19:20       ` Suresh Siddha
  2011-12-08  6:06         ` Mike Galbraith
@ 2011-12-08  9:29         ` Peter Zijlstra
  2011-12-08 19:34           ` Suresh Siddha
  2011-12-08 10:02         ` Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-12-08  9:29 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Mike Galbraith, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, Shi, Alex

On Wed, 2011-12-07 at 11:20 -0800, Suresh Siddha wrote:

> > +DEFINE_PER_CPU(int, sd_top_spr_id);
> 
> pkg_srid (shared resource id)?

How about sd_llc{,_id} ?

> > +
> > +static void update_top_cache_domain(int cpu)
> > +{
> > +	struct sched_domain *sd;
> > +	int id = -1;
> > +
> > +	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> > +	if (sd)
> > +		id = cpumask_first(sched_domain_span(sd));
> 
> if there is no sd with shared pkg resources, then id has to be set to
> 'cpu'.

Ah, right. I hadn't considered the case where the LLC isn't shared at
all.

> > +	rcu_assign_pointer(per_cpu(sd_top_spr, cpu), sd);
> > +	per_cpu(sd_top_spr_id, cpu) = id;
> 
> Otherwise it looks good.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-08  9:29         ` Peter Zijlstra
@ 2011-12-08 19:34           ` Suresh Siddha
  2011-12-08 21:50             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Suresh Siddha @ 2011-12-08 19:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, Shi, Alex

On Thu, 2011-12-08 at 01:29 -0800, Peter Zijlstra wrote:
> On Wed, 2011-12-07 at 11:20 -0800, Suresh Siddha wrote:
> 
> > > +DEFINE_PER_CPU(int, sd_top_spr_id);
> > 
> > pkg_srid (shared resource id)?
> 
> How about sd_llc{,_id} ?

There is a reason why I didn't mention it ;)

It is not always LLC. In power-savings mode, we group the cores sharing
the power-domain (typically the whole package) irrespective of one more
multiple LLC's etc.

Anyways, it is just a name.

thanks,
suresh


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-08 19:34           ` Suresh Siddha
@ 2011-12-08 21:50             ` Peter Zijlstra
  2011-12-08 21:51               ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-12-08 21:50 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Mike Galbraith, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, Shi, Alex

On Thu, 2011-12-08 at 11:34 -0800, Suresh Siddha wrote:
> On Thu, 2011-12-08 at 01:29 -0800, Peter Zijlstra wrote:
> > On Wed, 2011-12-07 at 11:20 -0800, Suresh Siddha wrote:
> > 
> > > > +DEFINE_PER_CPU(int, sd_top_spr_id);
> > > 
> > > pkg_srid (shared resource id)?
> > 
> > How about sd_llc{,_id} ?
> 
> There is a reason why I didn't mention it ;)
> 
> It is not always LLC. In power-savings mode, we group the cores sharing
> the power-domain (typically the whole package) irrespective of one more
> multiple LLC's etc.
> 
> Anyways, it is just a 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-08 21:50             ` Peter Zijlstra
@ 2011-12-08 21:51               ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2011-12-08 21:51 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Mike Galbraith, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, Shi, Alex

On Thu, 2011-12-08 at 22:50 +0100, Peter Zijlstra wrote:
> On Thu, 2011-12-08 at 11:34 -0800, Suresh Siddha wrote:
> > On Thu, 2011-12-08 at 01:29 -0800, Peter Zijlstra wrote:
> > > On Wed, 2011-12-07 at 11:20 -0800, Suresh Siddha wrote:
> > > 
> > > > > +DEFINE_PER_CPU(int, sd_top_spr_id);
> > > > 
> > > > pkg_srid (shared resource id)?
> > > 
> > > How about sd_llc{,_id} ?
> > 
> > There is a reason why I didn't mention it ;)
> > 
> > It is not always LLC. In power-savings mode, we group the cores sharing
> > the power-domain (typically the whole package) irrespective of one more
> > multiple LLC's etc.
> > 
> > Anyways, it is just a 

This sending of email stuff is hard..

What I wanted to say is that I consider that an x86 arch bug that needs
fixing.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains
  2011-12-07 19:20       ` Suresh Siddha
  2011-12-08  6:06         ` Mike Galbraith
  2011-12-08  9:29         ` Peter Zijlstra
@ 2011-12-08 10:02         ` Peter Zijlstra
  2011-12-21 11:41           ` [tip:sched/core] sched: Only queue remote wakeups when crossing cache boundaries tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-12-08 10:02 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Mike Galbraith, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	linux-kernel, Tim Chen, Shi, Alex, Chris Mason, Dave Kleikamp


Subject: sched: Only queue remote wakeups when crossing cache boundaries
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Dec 07 15:07:31 CET 2011

Mike reported a 13% drop in netperf TCP_RR performance due to the new
remote wakeup code. Suresh too noticed some performance issues with
it.

Reducing the IPIs to only cross cache domains solves the observed
performance issues.

Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Reported-by: Suresh Siddha <suresh.b.siddha@intel.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/core.c  |   34 +++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c  |   24 +-----------------------
 kernel/sched/sched.h |   42 ++++++++++++++++++++++++++++++++++++------
 3 files changed, 70 insertions(+), 30 deletions(-)
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -1511,6 +1511,11 @@ static int ttwu_activate_remote(struct t
 
 }
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
+
+static inline int ttwu_share_cache(int this_cpu, int that_cpu)
+{
+	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu)
@@ -1518,7 +1523,7 @@ static void ttwu_queue(struct task_struc
 	struct rq *rq = cpu_rq(cpu);
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
+	if (sched_feat(TTWU_QUEUE) && !ttwu_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
 		ttwu_queue_remote(p, cpu);
 		return;
@@ -5751,6 +5756,31 @@ static void destroy_sched_domains(struct
 }
 
 /*
+ * Keep a special pointer to the highest sched_domain that has
+ * SD_SHARE_PKG_RESOURCE set (Last Level Cache Domain) for this
+ * allows us to avoid some pointer chasing select_idle_sibling().
+ *
+ * Also keep a unique ID per domain (we use the first cpu number in
+ * the cpumask of the domain), this allows us to quickly tell if
+ * two cpus are in the same cache domain, see ttwu_share_cache().
+ */
+DEFINE_PER_CPU(struct sched_domain *, sd_llc);
+DEFINE_PER_CPU(int, sd_llc_id);
+
+static void update_top_cache_domain(int cpu)
+{
+	struct sched_domain *sd;
+	int id = cpu;
+
+	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+	if (sd)
+		id = cpumask_first(sched_domain_span(sd));
+
+	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
+	per_cpu(sd_llc_id, cpu) = id;
+}
+
+/*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
  */
@@ -5789,6 +5819,8 @@ cpu_attach_domain(struct sched_domain *s
 	tmp = rq->sd;
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
+
+	update_top_cache_domain(cpu);
 }
 
 /* cpus with isolated domains */
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -2644,28 +2644,6 @@ find_idlest_cpu(struct sched_group *grou
 	return idlest;
 }
 
-/**
- * highest_flag_domain - Return highest sched_domain containing flag.
- * @cpu:	The cpu whose highest level of sched domain is to
- *		be returned.
- * @flag:	The flag to check for the highest sched_domain
- *		for the given cpu.
- *
- * Returns the highest sched_domain of a cpu which contains the given flag.
- */
-static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
-{
-	struct sched_domain *sd, *hsd = NULL;
-
-	for_each_domain(cpu, sd) {
-		if (!(sd->flags & flag))
-			break;
-		hsd = sd;
-	}
-
-	return hsd;
-}
-
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2696,7 +2674,7 @@ static int select_idle_sibling(struct ta
 	 */
 	rcu_read_lock();
 
-	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
Index: linux-2.6/kernel/sched/sched.h
===================================================================
--- linux-2.6.orig/kernel/sched/sched.h
+++ linux-2.6/kernel/sched/sched.h
@@ -487,6 +487,14 @@ static inline int cpu_of(struct rq *rq)
 
 DECLARE_PER_CPU(struct rq, runqueues);
 
+#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
+#define this_rq()		(&__get_cpu_var(runqueues))
+#define task_rq(p)		cpu_rq(task_cpu(p))
+#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
+#define raw_rq()		(&__raw_get_cpu_var(runqueues))
+
+#ifdef CONFIG_SMP
+
 #define rcu_dereference_check_sched_domain(p) \
 	rcu_dereference_check((p), \
 			      lockdep_is_held(&sched_domains_mutex))
@@ -499,15 +507,37 @@ DECLARE_PER_CPU(struct rq, runqueues);
  * preempt-disabled sections.
  */
 #define for_each_domain(cpu, __sd) \
-	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
+	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
+			__sd; __sd = __sd->parent)
 
 #define for_each_lower_domain(sd) for (; sd; sd = sd->child)
 
-#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
-#define this_rq()		(&__get_cpu_var(runqueues))
-#define task_rq(p)		cpu_rq(task_cpu(p))
-#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
-#define raw_rq()		(&__raw_get_cpu_var(runqueues))
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			break;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
+DECLARE_PER_CPU(struct sched_domain *, sd_llc);
+DECLARE_PER_CPU(int, sd_llc_id);
+
+#endif /* CONFIG_SMP */
 
 #include "stats.h"
 #include "auto_group.h"


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:sched/core] sched: Only queue remote wakeups when crossing cache boundaries
  2011-12-08 10:02         ` Peter Zijlstra
@ 2011-12-21 11:41           ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: tip-bot for Peter Zijlstra @ 2011-12-21 11:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, dave.kleikamp, efault,
	chris.mason, suresh.b.siddha, tglx, mingo

Commit-ID:  518cd62341786aa4e3839810832af2fbc0de1ea4
Gitweb:     http://git.kernel.org/tip/518cd62341786aa4e3839810832af2fbc0de1ea4
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Wed, 7 Dec 2011 15:07:31 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 21 Dec 2011 10:34:44 +0100

sched: Only queue remote wakeups when crossing cache boundaries

Mike reported a 13% drop in netperf TCP_RR performance due to the
new remote wakeup code. Suresh too noticed some performance issues
with it.

Reducing the IPIs to only cross cache domains solves the observed
performance issues.

Reported-by: Suresh Siddha <suresh.b.siddha@intel.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/core.c  |   34 +++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c  |   24 +-----------------------
 kernel/sched/sched.h |   42 ++++++++++++++++++++++++++++++++++++------
 3 files changed, 70 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cdf51a2..dba878c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1511,6 +1511,11 @@ static int ttwu_activate_remote(struct task_struct *p, int wake_flags)
 
 }
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
+
+static inline int ttwu_share_cache(int this_cpu, int that_cpu)
+{
+	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu)
@@ -1518,7 +1523,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
 	struct rq *rq = cpu_rq(cpu);
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
+	if (sched_feat(TTWU_QUEUE) && !ttwu_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
 		ttwu_queue_remote(p, cpu);
 		return;
@@ -5744,6 +5749,31 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
 }
 
 /*
+ * Keep a special pointer to the highest sched_domain that has
+ * SD_SHARE_PKG_RESOURCE set (Last Level Cache Domain) for this
+ * allows us to avoid some pointer chasing select_idle_sibling().
+ *
+ * Also keep a unique ID per domain (we use the first cpu number in
+ * the cpumask of the domain), this allows us to quickly tell if
+ * two cpus are in the same cache domain, see ttwu_share_cache().
+ */
+DEFINE_PER_CPU(struct sched_domain *, sd_llc);
+DEFINE_PER_CPU(int, sd_llc_id);
+
+static void update_top_cache_domain(int cpu)
+{
+	struct sched_domain *sd;
+	int id = cpu;
+
+	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+	if (sd)
+		id = cpumask_first(sched_domain_span(sd));
+
+	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
+	per_cpu(sd_llc_id, cpu) = id;
+}
+
+/*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
  */
@@ -5782,6 +5812,8 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	tmp = rq->sd;
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
+
+	update_top_cache_domain(cpu);
 }
 
 /* cpus with isolated domains */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4d2b7a..2237ffe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2644,28 +2644,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
-/**
- * highest_flag_domain - Return highest sched_domain containing flag.
- * @cpu:	The cpu whose highest level of sched domain is to
- *		be returned.
- * @flag:	The flag to check for the highest sched_domain
- *		for the given cpu.
- *
- * Returns the highest sched_domain of a cpu which contains the given flag.
- */
-static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
-{
-	struct sched_domain *sd, *hsd = NULL;
-
-	for_each_domain(cpu, sd) {
-		if (!(sd->flags & flag))
-			break;
-		hsd = sd;
-	}
-
-	return hsd;
-}
-
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2696,7 +2674,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 */
 	rcu_read_lock();
 
-	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d8d3613..98c0c26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -487,6 +487,14 @@ static inline int cpu_of(struct rq *rq)
 
 DECLARE_PER_CPU(struct rq, runqueues);
 
+#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
+#define this_rq()		(&__get_cpu_var(runqueues))
+#define task_rq(p)		cpu_rq(task_cpu(p))
+#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
+#define raw_rq()		(&__raw_get_cpu_var(runqueues))
+
+#ifdef CONFIG_SMP
+
 #define rcu_dereference_check_sched_domain(p) \
 	rcu_dereference_check((p), \
 			      lockdep_is_held(&sched_domains_mutex))
@@ -499,15 +507,37 @@ DECLARE_PER_CPU(struct rq, runqueues);
  * preempt-disabled sections.
  */
 #define for_each_domain(cpu, __sd) \
-	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
+	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
+			__sd; __sd = __sd->parent)
 
 #define for_each_lower_domain(sd) for (; sd; sd = sd->child)
 
-#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
-#define this_rq()		(&__get_cpu_var(runqueues))
-#define task_rq(p)		cpu_rq(task_cpu(p))
-#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
-#define raw_rq()		(&__raw_get_cpu_var(runqueues))
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			break;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
+DECLARE_PER_CPU(struct sched_domain *, sd_llc);
+DECLARE_PER_CPU(int, sd_llc_id);
+
+#endif /* CONFIG_SMP */
 
 #include "stats.h"
 #include "auto_group.h"

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [patch v3 6/6] sched: fix the sched group node allocation for SD_OVERLAP domain
  2011-12-02  1:07 [patch v3 0/6] nohz idle load balancing patches Suresh Siddha
                   ` (4 preceding siblings ...)
  2011-12-02  1:07 ` [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains Suresh Siddha
@ 2011-12-02  1:07 ` Suresh Siddha
  5 siblings, 0 replies; 30+ messages in thread
From: Suresh Siddha @ 2011-12-02  1:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith
  Cc: linux-kernel, Tim Chen, alex.shi, Suresh Siddha

[-- Attachment #1: fix_sched_group_node_allocation.patch --]
[-- Type: text/plain, Size: 786 bytes --]

For the SD_OVERLAP domain, sched_groups for each CPU's sched_domain are
privately allocated and not shared with any other cpu. So the
sched group allocation should come from the cpu's node for which
SD_OVERLAP sched domain is being setup.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -5902,7 +5902,7 @@ build_overlap_sched_groups(struct sched_
 			continue;
 
 		sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
-				GFP_KERNEL, cpu_to_node(i));
+				GFP_KERNEL, cpu_to_node(cpu));
 
 		if (!sg)
 			goto fail;



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2011-12-21 11:41 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-02  1:07 [patch v3 0/6] nohz idle load balancing patches Suresh Siddha
2011-12-02  1:07 ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Suresh Siddha
2011-12-06  9:53   ` [tip:sched/core] sched, nohz: Introduce nohz_flags in 'struct rq' tip-bot for Suresh Siddha
2011-12-06 12:14   ` [patch v3 1/6] sched, nohz: introduce nohz_flags in the struct rq Srivatsa Vaddagiri
2011-12-06 19:26     ` Suresh Siddha
2011-12-06 19:39       ` Peter Zijlstra
2011-12-06 20:24       ` [tip:sched/core] sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer tip-bot for Suresh Siddha
2011-12-02  1:07 ` [patch v3 2/6] sched, nohz: track nr_busy_cpus in the sched_group_power Suresh Siddha
2011-12-06  9:54   ` [tip:sched/core] sched, nohz: Track " tip-bot for Suresh Siddha
2011-12-02  1:07 ` [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing Suresh Siddha
2011-12-06  6:37   ` Srivatsa Vaddagiri
2011-12-06 19:19     ` Suresh Siddha
2011-12-06 20:24       ` [tip:sched/core] sched, nohz: Fix the idle cpu check in nohz_idle_balance tip-bot for Suresh Siddha
     [not found]     ` <A75BCAD09CE00A4280CDD4429D85F1F9261B42A1F9@orsmsx501.amr.corp.intel.com>
2011-12-06 19:27       ` [patch v3 3/6] sched, nohz: sched group, domain aware nohz idle load balancing Suresh Siddha
2011-12-06  9:54   ` [tip:sched/core] sched, nohz: Implement " tip-bot for Suresh Siddha
2011-12-02  1:07 ` [patch v3 4/6] sched, nohz: cleanup the find_new_ilb() using sched groups nr_busy_cpus Suresh Siddha
2011-12-06  9:55   ` [tip:sched/core] sched, nohz: Clean up " tip-bot for Suresh Siddha
2011-12-02  1:07 ` [patch v3 5/6] sched, ttwu_queue: queue remote wakeups only when crossing cache domains Suresh Siddha
2011-12-02  3:34   ` Mike Galbraith
2011-12-07 16:23     ` Peter Zijlstra
2011-12-07 19:20       ` Suresh Siddha
2011-12-08  6:06         ` Mike Galbraith
2011-12-08  9:41           ` Peter Zijlstra
2011-12-08  9:29         ` Peter Zijlstra
2011-12-08 19:34           ` Suresh Siddha
2011-12-08 21:50             ` Peter Zijlstra
2011-12-08 21:51               ` Peter Zijlstra
2011-12-08 10:02         ` Peter Zijlstra
2011-12-21 11:41           ` [tip:sched/core] sched: Only queue remote wakeups when crossing cache boundaries tip-bot for Peter Zijlstra
2011-12-02  1:07 ` [patch v3 6/6] sched: fix the sched group node allocation for SD_OVERLAP domain Suresh Siddha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox