public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 0/2] sched: Change nohz ilb logic from pull to push model
@ 2009-12-11  1:27 venkatesh.pallipadi
  2009-12-11  1:27 ` [patch 1/2] sched: Change the " venkatesh.pallipadi
  2009-12-11  1:27 ` [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node venkatesh.pallipadi
  0 siblings, 2 replies; 13+ messages in thread
From: venkatesh.pallipadi @ 2009-12-11  1:27 UTC (permalink / raw)
  To: Peter Zijlstra, Gautham R Shenoy, Vaidyanathan Srinivasan
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

This is a followup to the RFC here:
http://lkml.indiana.edu/hypermail/linux/kernel/0906.2/01163.html

We have few cleanups since that RFC and also have some data
showing the impact of this change.

Description:
Existing nohz idle load balance logic uses the pull model, with one
idle load balancer CPU nominated on any partially idle system and that
balancer CPU not going into nohz mode. With the periodic tick, the
balancer does the idle balancing on behalf of all the CPUs in nohz mode.

This is not very optimal and has few issues:
* the balancer will continue to have periodic ticks and wakeup
  frequently (HZ rate), even though it may not have any rebalancing to do on
  behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic
  wakeup can result in an additional interrupt on a CPU doing the timer
  broadcast.
* The balancer may end up spending a lot of time doing the balancing on
  behalf of nohz CPUs, especially with increasing number of sockets and
  cores in the platform.

The alternative is to have a push model, where all idle CPUs can enter nohz
mode and any busy CPU kicks one of the idle CPUs to take care of idle
balancing on behalf of a group of idle CPUs.

Following patches switches idle load balancer to this push approach.

Data:
1) Running a bunzip2 of a big file (which happened to be kernel tar ball),
on a netbook with HZ=1000.

Before the change
57.44user 12.36system 1:12.17elapsed

After the change
47.89user 10.31system 0:59.99elapsed

That is ~10 seconds (17%) savings in user time for this task. This is
coming from the idle SMT sibling thread being woken up 1000 times a second
and doing unnecessary idle load balancing, resulting in
slowdown of the thread running the load.

2) Running bzip2 of a big file (which happened to be kernel tar ball),
on a dual socket server with HZ=1000

No change in performance, but there is a noticable (1% - 1.5% range)
reduction in energy consumption. This is due to idle load balancer
that un necessarily gets woken up on second socket with the earlier pull
model. With new push model, second socket will not get woken up often
and can get into low power idle state.

3) We also measured SpecJBB workload with varying number of warehouses
and did not see any noticable change in performance with this patch.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

-- 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 1/2] sched: Change the nohz ilb logic from pull to push model
  2009-12-11  1:27 [patch 0/2] sched: Change nohz ilb logic from pull to push model venkatesh.pallipadi
@ 2009-12-11  1:27 ` venkatesh.pallipadi
  2009-12-14 22:18   ` Peter Zijlstra
  2009-12-21 12:13   ` Peter Zijlstra
  2009-12-11  1:27 ` [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node venkatesh.pallipadi
  1 sibling, 2 replies; 13+ messages in thread
From: venkatesh.pallipadi @ 2009-12-11  1:27 UTC (permalink / raw)
  To: Peter Zijlstra, Gautham R Shenoy, Vaidyanathan Srinivasan
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: 0001-sched-Change-the-nohz-ilb-logic-from-pull-to-push-m.patch --]
[-- Type: text/plain, Size: 16334 bytes --]

In the new push model, all idle CPUs indeed go into nohz mode. There is
still the concept of idle load balancer. Busy CPU kicks the nohz
balancer when any of the nohz CPUs need idle load balancing.
The kickee CPU does the idle load balancing on behalf of all idle CPUs
instead of the normal idle balance.

This addresses two problems with the current nohz ilb logic
* the balancer will continue to have periodic ticks and wakeup
  frequently, even though it may not have any rebalancing to do on
  behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
  periodic wakeup can result in an additional interrupt on a CPU
  doing the timer broadcast.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 include/linux/sched.h    |    7 +-
 kernel/sched.c           |  256 ++++++++++++++++++++++++++++------------------
 kernel/time/tick-sched.c |    8 +--
 3 files changed, 160 insertions(+), 111 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..15c05c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -266,13 +266,10 @@ extern void task_rq_unlock_wait(struct task_struct *p);
 
 extern cpumask_var_t nohz_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
-extern int select_nohz_load_balancer(int cpu);
+extern void select_nohz_load_balancer(int stop_tick);
 extern int get_nohz_load_balancer(void);
 #else
-static inline int select_nohz_load_balancer(int cpu)
-{
-	return 0;
-}
+static inline void select_nohz_load_balancer(int stop_tick) { }
 #endif
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c11ae0..aea2e32 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -536,7 +536,7 @@ struct rq {
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 #ifdef CONFIG_NO_HZ
 	unsigned long last_tick_seen;
-	unsigned char in_nohz_recently;
+	unsigned char nohz_balance_kick;
 #endif
 	/* capture load from *all* tasks on this cpu: */
 	struct load_weight load;
@@ -4507,12 +4507,45 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
 }
 
 #ifdef CONFIG_NO_HZ
+
+/*
+ * idle load balancing details
+ * - One of the idle CPUs nominates itself as idle load_balancer, while
+ *   entering idle.
+ * - With previous logic, this idle load balancer CPU will not go into
+ *   tickless mode when it is idle and does the idle load balancing for
+ *   all the idle CPUs.
+ * - With new logic, this idle load balancer CPU will also go into
+ *   tickless mode when it is idle, just like all other idle CPUs
+ * - When one of the busy CPUs notice that there may be an idle rebalancing
+ *   needed, they will kick the idle load balancer, which then does idle
+ *   load balancing for all the idle CPUs.
+ * - As idle load balancing looks at the load of all the CPUs, not all busy
+ *   CPUs need to do this idle load balancer kick.
+ * - first_pick_cpu is the one of the busy CPUs which will kick
+ *   idle load balancer when it has more than one process active. This
+ *   eliminates the need for idle load balancing altogether when we have
+ *   only one running process in the system (common case).
+ * - If there are more than one busy CPU, idle load balancer may have
+ *   to run for active_load_balance to happen (i.e., two busy CPUs are
+ *   SMT or core siblings and can run better if they move to different
+ *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
+ *   which will kick idle load balancer as soon as it has any load.
+ * - With previous logic, idle load balancer used to run at every tick.
+ *   With new logic, idle load balancer tracks the rq->next_balance for all
+ *   the idle CPUs and does idle load balancing only when needed.
+ */
 static struct {
 	atomic_t load_balancer;
-	cpumask_var_t cpu_mask;
-	cpumask_var_t ilb_grp_nohz_mask;
+	atomic_t first_pick_cpu;
+	atomic_t second_pick_cpu;
+	cpumask_var_t idle_cpus_mask;
+	cpumask_var_t tmp_nohz_mask;
+	unsigned long next_balance;	/* in jiffy units */
 } nohz ____cacheline_aligned = {
 	.load_balancer = ATOMIC_INIT(-1),
+	.first_pick_cpu = ATOMIC_INIT(-1),
+	.second_pick_cpu = ATOMIC_INIT(-1),
 };
 
 int get_nohz_load_balancer(void)
@@ -4567,17 +4600,17 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
  */
 static inline int is_semi_idle_group(struct sched_group *ilb_group)
 {
-	cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
+	cpumask_and(nohz.tmp_nohz_mask, nohz.idle_cpus_mask,
 					sched_group_cpus(ilb_group));
 
 	/*
 	 * A sched_group is semi-idle when it has atleast one busy cpu
 	 * and atleast one idle cpu.
 	 */
-	if (cpumask_empty(nohz.ilb_grp_nohz_mask))
+	if (cpumask_empty(nohz.tmp_nohz_mask))
 		return 0;
 
-	if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
+	if (cpumask_equal(nohz.tmp_nohz_mask, sched_group_cpus(ilb_group)))
 		return 0;
 
 	return 1;
@@ -4610,7 +4643,7 @@ static int find_new_ilb(int cpu)
 	 * Optimize for the case when we have no idle CPUs or only one
 	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
 	 */
-	if (cpumask_weight(nohz.cpu_mask) < 2)
+	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
 		goto out_done;
 
 	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
@@ -4618,7 +4651,7 @@ static int find_new_ilb(int cpu)
 
 		do {
 			if (is_semi_idle_group(ilb_group))
-				return cpumask_first(nohz.ilb_grp_nohz_mask);
+				return cpumask_first(nohz.tmp_nohz_mask);
 
 			ilb_group = ilb_group->next;
 
@@ -4626,45 +4659,61 @@ static int find_new_ilb(int cpu)
 	}
 
 out_done:
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
 static inline int find_new_ilb(int call_cpu)
 {
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #endif
 
 /*
+ * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
+ * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
+ * CPU (if there is one).
+*/
+static void nohz_balancer_kick(int cpu)
+{
+	int ilb_cpu;
+
+	nohz.next_balance++;
+
+	ilb_cpu = get_nohz_load_balancer();
+	if (ilb_cpu < 0) {
+		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
+		if (ilb_cpu >= nr_cpu_ids)
+			return;
+	}
+
+	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
+		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
+		resched_cpu(ilb_cpu);
+	}
+	return;
+}
+
+/*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
- * load balancing on behalf of all those cpus. If all the cpus in the system
- * go into this tickless mode, then there will be no ilb owner (as there is
- * no need for one) and all the cpus will sleep till the next wakeup event
- * arrives...
+ * load balancing on behalf of all those cpus.
  *
- * For the ilb owner, tick is not stopped. And this tick will be used
- * for idle load balancing. ilb owner will still be part of
- * nohz.cpu_mask..
+ * When the ilb owner becomes busy, we will not have new ilb owner until some
+ * idle CPU wakes up and goes back to idle or some busy CPU tries to kick
+ * idle load balancing by kicking one of the idle CPUs.
  *
- * While stopping the tick, this cpu will become the ilb owner if there
- * is no other owner. And will be the owner till that cpu becomes busy
- * or if all cpus in the system stop their ticks at which point
- * there is no need for ilb owner.
- *
- * When the ilb owner becomes busy, it nominates another owner, during the
- * next busy scheduler_tick()
+ * Ticks are stopped for the ilb owner as well, with busy CPU kicking this
+ * ilb owner CPU in future (when there is a need for idle load balancing on
+ * behalf of all idle CPUs).
  */
-int select_nohz_load_balancer(int stop_tick)
+void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
 	if (stop_tick) {
-		cpu_rq(cpu)->in_nohz_recently = 1;
-
 		if (!cpu_active(cpu)) {
 			if (atomic_read(&nohz.load_balancer) != cpu)
-				return 0;
+				return;
 
 			/*
 			 * If we are going offline and still the leader,
@@ -4673,28 +4722,20 @@ int select_nohz_load_balancer(int stop_tick)
 			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
 				BUG();
 
-			return 0;
+			return;
 		}
 
-		cpumask_set_cpu(cpu, nohz.cpu_mask);
-
-		/* time for ilb owner also to sleep */
-		if (cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
-			if (atomic_read(&nohz.load_balancer) == cpu)
-				atomic_set(&nohz.load_balancer, -1);
-			return 0;
-		}
+		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
+		atomic_cmpxchg(&nohz.first_pick_cpu, cpu, -1);
+		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, -1);
 
 		if (atomic_read(&nohz.load_balancer) == -1) {
-			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
-				return 1;
-		} else if (atomic_read(&nohz.load_balancer) == cpu) {
 			int new_ilb;
 
-			if (!(sched_smt_power_savings ||
-						sched_mc_power_savings))
-				return 1;
+			/* make me the ilb owner */
+			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) != -1)
+				return;
+
 			/*
 			 * Check to see if there is a more power-efficient
 			 * ilb.
@@ -4703,21 +4744,18 @@ int select_nohz_load_balancer(int stop_tick)
 			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
 				atomic_set(&nohz.load_balancer, -1);
 				resched_cpu(new_ilb);
-				return 0;
 			}
-			return 1;
 		}
 	} else {
-		if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
-			return 0;
+		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
+			return;
 
-		cpumask_clear_cpu(cpu, nohz.cpu_mask);
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
 
 		if (atomic_read(&nohz.load_balancer) == cpu)
 			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
 				BUG();
 	}
-	return 0;
 }
 #endif
 
@@ -4801,8 +4839,6 @@ out:
 
 /*
  * run_rebalance_domains is triggered when needed from the scheduler tick.
- * In CONFIG_NO_HZ case, the idle load balance owner will do the
- * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
 static void run_rebalance_domains(struct softirq_action *h)
 {
@@ -4812,19 +4848,27 @@ static void run_rebalance_domains(struct softirq_action *h)
 						CPU_IDLE : CPU_NOT_IDLE;
 
 	rebalance_domains(this_cpu, idle);
+}
 
 #ifdef CONFIG_NO_HZ
+/*
+ * In CONFIG_NO_HZ case, the idle balance kickee will do the
+ * rebalancing for all the cpus for whom scheduler ticks are stopped.
+ */
+static void nohz_idle_balance(int this_cpu, struct rq *this_rq)
+{
+	rebalance_domains(this_cpu, CPU_IDLE);
+
 	/*
 	 * If this cpu is the owner for idle load balancing, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
-	if (this_rq->idle_at_tick &&
-	    atomic_read(&nohz.load_balancer) == this_cpu) {
+	if (this_rq->nohz_balance_kick) {
 		struct rq *rq;
 		int balance_cpu;
 
-		for_each_cpu(balance_cpu, nohz.cpu_mask) {
+		for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
 			if (balance_cpu == this_cpu)
 				continue;
 
@@ -4842,68 +4886,68 @@ static void run_rebalance_domains(struct softirq_action *h)
 			if (time_after(this_rq->next_balance, rq->next_balance))
 				this_rq->next_balance = rq->next_balance;
 		}
+		nohz.next_balance = this_rq->next_balance;
+		this_rq->nohz_balance_kick = 0;
 	}
-#endif
 }
+#endif
 
 static inline int on_null_domain(int cpu)
 {
 	return !rcu_dereference(cpu_rq(cpu)->sd);
 }
-
 /*
- * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
- *
- * In case of CONFIG_NO_HZ, this is the place where we nominate a new
- * idle load balancing owner or decide to stop the periodic load balancing,
- * if the whole system is idle.
+ * Current heuristic for kicking the idle load balancer
+ * - first_pick_cpu is the one of the busy CPUs. It will kick
+ *   idle load balancer when it has more than one process active. This
+ *   eliminates the need for idle load balancing altogether when we have
+ *   only one running process in the system (common case).
+ * - If there are more than one busy CPU, idle load balancer may have
+ *   to run for active_load_balance to happen (i.e., two busy CPUs are
+ *   SMT or core siblings and can run better if they move to different
+ *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
+ *   which will kick idle load balancer as soon as it has any load.
  */
-static inline void trigger_load_balance(struct rq *rq, int cpu)
+static inline int nohz_kick_needed(struct rq *rq, int cpu)
 {
-#ifdef CONFIG_NO_HZ
-	/*
-	 * If we were in the nohz mode recently and busy at the current
-	 * scheduler tick, then check if we need to nominate new idle
-	 * load balancer.
-	 */
-	if (rq->in_nohz_recently && !rq->idle_at_tick) {
-		rq->in_nohz_recently = 0;
+	unsigned long now = jiffies;
+	int ret;
 
-		if (atomic_read(&nohz.load_balancer) == cpu) {
-			cpumask_clear_cpu(cpu, nohz.cpu_mask);
-			atomic_set(&nohz.load_balancer, -1);
-		}
+	if (time_before(now, nohz.next_balance))
+		return 0;
 
-		if (atomic_read(&nohz.load_balancer) == -1) {
-			int ilb = find_new_ilb(cpu);
+	if (!rq->nr_running)
+		return 0;
 
-			if (ilb < nr_cpu_ids)
-				resched_cpu(ilb);
+	ret = atomic_cmpxchg(&nohz.first_pick_cpu, -1, cpu);
+	if (ret == -1 || ret == cpu) {
+		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, -1);
+		if (rq->nr_running > 1)
+			return 1;
+	} else {
+		ret = atomic_cmpxchg(&nohz.second_pick_cpu, -1, cpu);
+		if (ret == -1 || ret == cpu) {
+			if (rq->nr_running)
+				return 1;
 		}
 	}
 
-	/*
-	 * If this cpu is idle and doing idle load balancing for all the
-	 * cpus with ticks stopped, is it time for that to stop?
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) == cpu &&
-	    cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
-		resched_cpu(cpu);
-		return;
-	}
+	return 0;
+}
 
-	/*
-	 * If this cpu is idle and the idle load balancing is done by
-	 * someone else, then no need raise the SCHED_SOFTIRQ
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) != cpu &&
-	    cpumask_test_cpu(cpu, nohz.cpu_mask))
-		return;
-#endif
+/*
+ * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
+ */
+static inline void trigger_load_balance(struct rq *rq, int cpu)
+{
 	/* Don't need to rebalance while attached to NULL domain */
 	if (time_after_eq(jiffies, rq->next_balance) &&
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
+#ifdef CONFIG_NO_HZ
+	else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
+		nohz_balancer_kick(cpu);
+#endif
 }
 
 #else	/* CONFIG_SMP */
@@ -5446,8 +5490,19 @@ need_resched_nonpreemptible:
 
 	pre_schedule(rq, prev);
 
-	if (unlikely(!rq->nr_running))
+	if (unlikely(!rq->nr_running)) {
+#ifdef CONFIG_NO_HZ
+		if (rq->nohz_balance_kick) {
+			spin_unlock_irq(&rq->lock);
+			nohz_idle_balance(cpu, rq);
+			spin_lock_irq(&rq->lock);
+		} else {
+			idle_balance(cpu, rq);
+		}
+#else
 		idle_balance(cpu, rq);
+#endif
+	}
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
@@ -9521,6 +9576,9 @@ void __init sched_init(void)
 		rq->push_cpu = 0;
 		rq->cpu = i;
 		rq->online = 0;
+#ifdef CONFIG_NO_HZ
+		rq->nohz_balance_kick = 0;
+#endif
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
 		rq_attach_root(rq, &def_root_domain);
@@ -9568,8 +9626,8 @@ void __init sched_init(void)
 	zalloc_cpumask_var(&nohz_cpu_mask, GFP_NOWAIT);
 #ifdef CONFIG_SMP
 #ifdef CONFIG_NO_HZ
-	zalloc_cpumask_var(&nohz.cpu_mask, GFP_NOWAIT);
-	alloc_cpumask_var(&nohz.ilb_grp_nohz_mask, GFP_NOWAIT);
+	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	alloc_cpumask_var(&nohz.tmp_nohz_mask, GFP_NOWAIT);
 #endif
 	zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
 #endif /* SMP */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 89aed59..bfb615b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -326,13 +326,7 @@ void tick_nohz_stop_sched_tick(int inidle)
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
-			if (select_nohz_load_balancer(1)) {
-				/*
-				 * sched tick not stopped!
-				 */
-				cpumask_clear_cpu(cpu, nohz_cpu_mask);
-				goto out;
-			}
+			select_nohz_load_balancer(1);
 
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
-- 
1.6.0.6

-- 


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node
  2009-12-11  1:27 [patch 0/2] sched: Change nohz ilb logic from pull to push model venkatesh.pallipadi
  2009-12-11  1:27 ` [patch 1/2] sched: Change the " venkatesh.pallipadi
@ 2009-12-11  1:27 ` venkatesh.pallipadi
  2009-12-14 22:21   ` Peter Zijlstra
  2009-12-21 13:11   ` Peter Zijlstra
  1 sibling, 2 replies; 13+ messages in thread
From: venkatesh.pallipadi @ 2009-12-11  1:27 UTC (permalink / raw)
  To: Peter Zijlstra, Gautham R Shenoy, Vaidyanathan Srinivasan
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: 0002-sched-Scale-the-nohz_tracker-logic-by-making-it-per.patch --]
[-- Type: text/plain, Size: 12019 bytes --]

Having one idle CPU doing the rebalancing for all the idle CPUs in
nohz mode does not scale well with increasing number of cores and
sockets. Make the nohz_tracker per NUMA node. This results in multiple
idle load balancing happening at NUMA node level and idle load balancer
only does the rebalance domain among all the other nohz CPUs in that
NUMA node.

This addresses the below problem with the current nohz ilb logic
* The lone balancer may end up spending a lot of time doing the
* balancing on
  behalf of nohz CPUs, especially with increasing number of sockets and
  cores in the platform.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c |  177 +++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 136 insertions(+), 41 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index aea2e32..1cc1485 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4535,22 +4535,90 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
  *   With new logic, idle load balancer tracks the rq->next_balance for all
  *   the idle CPUs and does idle load balancing only when needed.
  */
-static struct {
+struct nohz_tracker {
 	atomic_t load_balancer;
 	atomic_t first_pick_cpu;
 	atomic_t second_pick_cpu;
 	cpumask_var_t idle_cpus_mask;
 	cpumask_var_t tmp_nohz_mask;
 	unsigned long next_balance;	/* in jiffy units */
-} nohz ____cacheline_aligned = {
-	.load_balancer = ATOMIC_INIT(-1),
-	.first_pick_cpu = ATOMIC_INIT(-1),
-	.second_pick_cpu = ATOMIC_INIT(-1),
 };
 
+static DEFINE_PER_CPU(struct nohz_tracker *, cpu_node_nohz_ptr);
+static struct nohz_tracker **nohz_tracker_ptrs;
+
+static int alloc_node_nohz_tracker(void)
+{
+	int i, j;
+
+	/* Do all the allocations only once per boot */
+	if (nohz_tracker_ptrs)
+		return 0;
+
+	nohz_tracker_ptrs = kzalloc(nr_node_ids * sizeof(struct nohz_tracker *),
+				    GFP_KERNEL);
+	if (!nohz_tracker_ptrs) {
+		printk(KERN_WARNING "Can not alloc nohz trackers\n");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < nr_node_ids; i++) {
+		nohz_tracker_ptrs[i] = kzalloc_node(sizeof(struct nohz_tracker),
+						    GFP_KERNEL, i);
+		if (!nohz_tracker_ptrs[i]) {
+			printk(KERN_WARNING "Can not alloc domain group for "
+				"node %d\n", i);
+			goto free_ret;
+		}
+
+		if (!zalloc_cpumask_var_node(&nohz_tracker_ptrs[i]->idle_cpus_mask,
+					    GFP_KERNEL, i)) {
+			kfree(nohz_tracker_ptrs[i]);
+			goto free_ret;
+		}
+
+		if (!zalloc_cpumask_var_node(&nohz_tracker_ptrs[i]->tmp_nohz_mask,
+					    GFP_KERNEL, i)) {
+			free_cpumask_var(nohz_tracker_ptrs[i]->idle_cpus_mask);
+			kfree(nohz_tracker_ptrs[i]);
+			goto free_ret;
+		}
+		atomic_set(&nohz_tracker_ptrs[i]->load_balancer, -1);
+		atomic_set(&nohz_tracker_ptrs[i]->first_pick_cpu, -1);
+		atomic_set(&nohz_tracker_ptrs[i]->second_pick_cpu, -1);
+	}
+
+	return 0;
+
+free_ret:
+	for (j = 0; j < i; j++) {
+		free_cpumask_var(nohz_tracker_ptrs[j]->tmp_nohz_mask);
+		free_cpumask_var(nohz_tracker_ptrs[j]->idle_cpus_mask);
+		kfree(nohz_tracker_ptrs[j]);
+	}
+
+	kfree(nohz_tracker_ptrs);
+
+	for_each_online_cpu(i)
+		per_cpu(cpu_node_nohz_ptr, i) = NULL;
+
+	nohz_tracker_ptrs = NULL;
+	return -ENOMEM;
+}
+
+static int get_nohz_load_balancer_node(struct nohz_tracker *node_nohz)
+{
+	if (!node_nohz)
+		return -1;
+
+	return atomic_read(&node_nohz->load_balancer);
+}
+
 int get_nohz_load_balancer(void)
 {
-	return atomic_read(&nohz.load_balancer);
+	int cpu = smp_processor_id();
+	struct nohz_tracker *node_nohz = per_cpu(cpu_node_nohz_ptr, cpu);
+	return get_nohz_load_balancer_node(node_nohz);
 }
 
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -4591,6 +4659,7 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 /**
  * is_semi_idle_group - Checks if the given sched_group is semi-idle.
  * @ilb_group:	group to be checked for semi-idleness
+ * @node_nohz:	nohz_tracker for the node
  *
  * Returns:	1 if the group is semi-idle. 0 otherwise.
  *
@@ -4598,26 +4667,30 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
  * and atleast one non-idle CPU. This helper function checks if the given
  * sched_group is semi-idle or not.
  */
-static inline int is_semi_idle_group(struct sched_group *ilb_group)
+static inline int is_semi_idle_group(struct sched_group *ilb_group,
+				struct nohz_tracker *node_nohz)
 {
-	cpumask_and(nohz.tmp_nohz_mask, nohz.idle_cpus_mask,
+	cpumask_and(node_nohz->tmp_nohz_mask, node_nohz->idle_cpus_mask,
 					sched_group_cpus(ilb_group));
 
 	/*
 	 * A sched_group is semi-idle when it has atleast one busy cpu
 	 * and atleast one idle cpu.
 	 */
-	if (cpumask_empty(nohz.tmp_nohz_mask))
+	if (cpumask_empty(node_nohz->tmp_nohz_mask))
 		return 0;
 
-	if (cpumask_equal(nohz.tmp_nohz_mask, sched_group_cpus(ilb_group)))
+	if (cpumask_equal(node_nohz->tmp_nohz_mask,
+			sched_group_cpus(ilb_group))) {
 		return 0;
+	}
 
 	return 1;
 }
 /**
  * find_new_ilb - Finds the optimum idle load balancer for nomination.
  * @cpu:	The cpu which is nominating a new idle_load_balancer.
+ * @node_nohz:	nohz_tracker for the node
  *
  * Returns:	Returns the id of the idle load balancer if it exists,
  *		Else, returns >= nr_cpu_ids.
@@ -4627,7 +4700,7 @@ static inline int is_semi_idle_group(struct sched_group *ilb_group)
  * completely idle packages/cores just for the purpose of idle load balancing
  * when there are other idle cpu's which are better suited for that job.
  */
-static int find_new_ilb(int cpu)
+static int find_new_ilb(int cpu, struct nohz_tracker *node_nohz)
 {
 	struct sched_domain *sd;
 	struct sched_group *ilb_group;
@@ -4643,15 +4716,15 @@ static int find_new_ilb(int cpu)
 	 * Optimize for the case when we have no idle CPUs or only one
 	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
 	 */
-	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
+	if (cpumask_weight(node_nohz->idle_cpus_mask) < 2)
 		goto out_done;
 
 	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
 		ilb_group = sd->groups;
 
 		do {
-			if (is_semi_idle_group(ilb_group))
-				return cpumask_first(nohz.tmp_nohz_mask);
+			if (is_semi_idle_group(ilb_group, node_nohz))
+				return cpumask_first(node_nohz->tmp_nohz_mask);
 
 			ilb_group = ilb_group->next;
 
@@ -4662,7 +4735,8 @@ out_done:
 	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
-static inline int find_new_ilb(int call_cpu)
+static inline int find_new_ilb(int call_cpu,
+				struct nohz_tracker *node_nohz)
 {
 	return nr_cpu_ids;
 }
@@ -4676,12 +4750,16 @@ static inline int find_new_ilb(int call_cpu)
 static void nohz_balancer_kick(int cpu)
 {
 	int ilb_cpu;
+	struct nohz_tracker *node_nohz = per_cpu(cpu_node_nohz_ptr, cpu);
+
+	if (unlikely(!node_nohz))
+		return;
 
-	nohz.next_balance++;
+	node_nohz->next_balance++;
 
-	ilb_cpu = get_nohz_load_balancer();
+	ilb_cpu = get_nohz_load_balancer_node(node_nohz);
 	if (ilb_cpu < 0) {
-		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
+		ilb_cpu = cpumask_first(node_nohz->idle_cpus_mask);
 		if (ilb_cpu >= nr_cpu_ids)
 			return;
 	}
@@ -4709,51 +4787,55 @@ static void nohz_balancer_kick(int cpu)
 void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
+	struct nohz_tracker *node_nohz = per_cpu(cpu_node_nohz_ptr, cpu);
+
+	if (unlikely(!node_nohz))
+		return;
 
 	if (stop_tick) {
 		if (!cpu_active(cpu)) {
-			if (atomic_read(&nohz.load_balancer) != cpu)
+			if (atomic_read(&node_nohz->load_balancer) != cpu)
 				return;
 
 			/*
 			 * If we are going offline and still the leader,
 			 * give up!
 			 */
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+			if (atomic_cmpxchg(&node_nohz->load_balancer, cpu, -1) != cpu)
 				BUG();
 
 			return;
 		}
 
-		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
-		atomic_cmpxchg(&nohz.first_pick_cpu, cpu, -1);
-		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, -1);
+		cpumask_set_cpu(cpu, node_nohz->idle_cpus_mask);
+		atomic_cmpxchg(&node_nohz->first_pick_cpu, cpu, -1);
+		atomic_cmpxchg(&node_nohz->second_pick_cpu, cpu, -1);
 
-		if (atomic_read(&nohz.load_balancer) == -1) {
+		if (atomic_read(&node_nohz->load_balancer) == -1) {
 			int new_ilb;
 
 			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) != -1)
+			if (atomic_cmpxchg(&node_nohz->load_balancer, -1, cpu) != -1)
 				return;
 
 			/*
 			 * Check to see if there is a more power-efficient
 			 * ilb.
 			 */
-			new_ilb = find_new_ilb(cpu);
+			new_ilb = find_new_ilb(cpu, node_nohz);
 			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
-				atomic_set(&nohz.load_balancer, -1);
+				atomic_set(&node_nohz->load_balancer, -1);
 				resched_cpu(new_ilb);
 			}
 		}
 	} else {
-		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
+		if (!cpumask_test_cpu(cpu, node_nohz->idle_cpus_mask))
 			return;
 
-		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+		cpumask_clear_cpu(cpu, node_nohz->idle_cpus_mask);
 
-		if (atomic_read(&nohz.load_balancer) == cpu)
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+		if (atomic_read(&node_nohz->load_balancer) == cpu)
+			if (atomic_cmpxchg(&node_nohz->load_balancer, cpu, -1) != cpu)
 				BUG();
 	}
 }
@@ -4857,8 +4939,13 @@ static void run_rebalance_domains(struct softirq_action *h)
  */
 static void nohz_idle_balance(int this_cpu, struct rq *this_rq)
 {
+	struct nohz_tracker *node_nohz = per_cpu(cpu_node_nohz_ptr, this_cpu);
+
 	rebalance_domains(this_cpu, CPU_IDLE);
 
+	if (unlikely(!node_nohz))
+		return;
+
 	/*
 	 * If this cpu is the owner for idle load balancing, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
@@ -4868,7 +4955,7 @@ static void nohz_idle_balance(int this_cpu, struct rq *this_rq)
 		struct rq *rq;
 		int balance_cpu;
 
-		for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+		for_each_cpu(balance_cpu, node_nohz->idle_cpus_mask) {
 			if (balance_cpu == this_cpu)
 				continue;
 
@@ -4886,7 +4973,7 @@ static void nohz_idle_balance(int this_cpu, struct rq *this_rq)
 			if (time_after(this_rq->next_balance, rq->next_balance))
 				this_rq->next_balance = rq->next_balance;
 		}
-		nohz.next_balance = this_rq->next_balance;
+		node_nohz->next_balance = this_rq->next_balance;
 		this_rq->nohz_balance_kick = 0;
 	}
 }
@@ -4912,20 +4999,24 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 {
 	unsigned long now = jiffies;
 	int ret;
+	struct nohz_tracker *node_nohz = per_cpu(cpu_node_nohz_ptr, cpu);
 
-	if (time_before(now, nohz.next_balance))
+	if (unlikely(!node_nohz))
+		return 0;
+
+	if (time_before(now, node_nohz->next_balance))
 		return 0;
 
 	if (!rq->nr_running)
 		return 0;
 
-	ret = atomic_cmpxchg(&nohz.first_pick_cpu, -1, cpu);
+	ret = atomic_cmpxchg(&node_nohz->first_pick_cpu, -1, cpu);
 	if (ret == -1 || ret == cpu) {
-		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, -1);
+		atomic_cmpxchg(&node_nohz->second_pick_cpu, cpu, -1);
 		if (rq->nr_running > 1)
 			return 1;
 	} else {
-		ret = atomic_cmpxchg(&nohz.second_pick_cpu, -1, cpu);
+		ret = atomic_cmpxchg(&node_nohz->second_pick_cpu, -1, cpu);
 		if (ret == -1 || ret == cpu) {
 			if (rq->nr_running)
 				return 1;
@@ -8878,6 +8969,14 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 			goto error;
 #endif
 
+	if (alloc_node_nohz_tracker())
+		goto error;
+
+	for_each_cpu(i, cpu_map) {
+		per_cpu(cpu_node_nohz_ptr, i) =
+					nohz_tracker_ptrs[cpu_to_node(i)];
+	}
+
 	/* Calculate CPU power for physical packages and nodes */
 #ifdef CONFIG_SCHED_SMT
 	for_each_cpu(i, cpu_map) {
@@ -9625,10 +9724,6 @@ void __init sched_init(void)
 	/* Allocate the nohz_cpu_mask if CONFIG_CPUMASK_OFFSTACK */
 	zalloc_cpumask_var(&nohz_cpu_mask, GFP_NOWAIT);
 #ifdef CONFIG_SMP
-#ifdef CONFIG_NO_HZ
-	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
-	alloc_cpumask_var(&nohz.tmp_nohz_mask, GFP_NOWAIT);
-#endif
 	zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
 #endif /* SMP */
 
-- 
1.6.0.6

-- 


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [patch 1/2] sched: Change the nohz ilb logic from pull to push model
  2009-12-11  1:27 ` [patch 1/2] sched: Change the " venkatesh.pallipadi
@ 2009-12-14 22:18   ` Peter Zijlstra
  2009-12-21 12:13   ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-12-14 22:18 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel, Suresh Siddha

On Thu, 2009-12-10 at 17:27 -0800, venkatesh.pallipadi@intel.com wrote:
> +/*
> + * idle load balancing details
> + * - One of the idle CPUs nominates itself as idle load_balancer, while
> + *   entering idle.
> + * - With previous logic, this idle load balancer CPU will not go into
> + *   tickless mode when it is idle and does the idle load balancing for
> + *   all the idle CPUs.
> + * - With new logic, this idle load balancer CPU will also go into
> + *   tickless mode when it is idle, just like all other idle CPUs
> + * - When one of the busy CPUs notice that there may be an idle rebalancing
> + *   needed, they will kick the idle load balancer, which then does idle
> + *   load balancing for all the idle CPUs.
> + * - As idle load balancing looks at the load of all the CPUs, not all busy
> + *   CPUs need to do this idle load balancer kick.
> + * - first_pick_cpu is the one of the busy CPUs which will kick
> + *   idle load balancer when it has more than one process active. This
> + *   eliminates the need for idle load balancing altogether when we have
> + *   only one running process in the system (common case).
> + * - If there are more than one busy CPU, idle load balancer may have
> + *   to run for active_load_balance to happen (i.e., two busy CPUs are
> + *   SMT or core siblings and can run better if they move to different
> + *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
> + *   which will kick idle load balancer as soon as it has any load.
> + * - With previous logic, idle load balancer used to run at every tick.
> + *   With new logic, idle load balancer tracks the rq->next_balance for all
> + *   the idle CPUs and does idle load balancing only when needed.
> + */ 

This isn't a very helpful comment for someone trying to understand the
current code who has never seen the previous bits.

Only describe the new behaviour, there is no point in describing
something that is not longer relevant.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node
  2009-12-11  1:27 ` [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node venkatesh.pallipadi
@ 2009-12-14 22:21   ` Peter Zijlstra
  2009-12-14 22:32     ` Pallipadi, Venkatesh
  2009-12-21 13:11   ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2009-12-14 22:21 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel, Suresh Siddha

On Thu, 2009-12-10 at 17:27 -0800, venkatesh.pallipadi@intel.com wrote:
> Having one idle CPU doing the rebalancing for all the idle CPUs in
> nohz mode does not scale well with increasing number of cores and
> sockets. Make the nohz_tracker per NUMA node. This results in multiple
> idle load balancing happening at NUMA node level and idle load balancer
> only does the rebalance domain among all the other nohz CPUs in that
> NUMA node.
> 
> This addresses the below problem with the current nohz ilb logic
> * The lone balancer may end up spending a lot of time doing the
> * balancing on
>   behalf of nohz CPUs, especially with increasing number of sockets and
>   cores in the platform.

If the purpose is to keep sockets idle, doing things per node doesn't
seem like a fine plan, since we're having nodes <= socket machines these
days.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node
  2009-12-14 22:21   ` Peter Zijlstra
@ 2009-12-14 22:32     ` Pallipadi, Venkatesh
  2009-12-14 22:58       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Pallipadi, Venkatesh @ 2009-12-14 22:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel@vger.kernel.org,
	Siddha, Suresh B

On Mon, 2009-12-14 at 14:21 -0800, Peter Zijlstra wrote:
> On Thu, 2009-12-10 at 17:27 -0800, venkatesh.pallipadi@intel.com wrote:
> > Having one idle CPU doing the rebalancing for all the idle CPUs in
> > nohz mode does not scale well with increasing number of cores and
> > sockets. Make the nohz_tracker per NUMA node. This results in multiple
> > idle load balancing happening at NUMA node level and idle load balancer
> > only does the rebalance domain among all the other nohz CPUs in that
> > NUMA node.
> > 
> > This addresses the below problem with the current nohz ilb logic
> > * The lone balancer may end up spending a lot of time doing the
> > * balancing on
> >   behalf of nohz CPUs, especially with increasing number of sockets and
> >   cores in the platform.
> 
> If the purpose is to keep sockets idle, doing things per node doesn't
> seem like a fine plan, since we're having nodes <= socket machines these
> days.

The idea is to do idle balance only within the nodes.
Eg: 4 node (and 4 socket) system with each socket having 4 cores.
If there is a single active thread on such a system, say on socket 3.
Without this change we have 1 idle load balancer (which may be in socket
0) which has periodic ticks and remaining 14 cores will be tickless.
But this one idle load balancer does load balance on behalf of itself +
14 other idle cores.

With the change proposed in this patch, we will have 3 completely idle
nodes/sockets. We will not do load balance on these cores at all.
Remaining one active socket will have one idle load balancer, which when
needed will do idle load balancing on behalf of itself + 2 other idle
cores in that socket.

If there all sockets have atleast one busy core, then we may have more
than one idle load balancer, but each will only do idle load balance on
behalf of idle processors in its own node, so total idle load balance
will be same as now.

Thanks,
Venki


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node
  2009-12-14 22:32     ` Pallipadi, Venkatesh
@ 2009-12-14 22:58       ` Peter Zijlstra
  2009-12-15  1:00         ` Pallipadi, Venkatesh
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2009-12-14 22:58 UTC (permalink / raw)
  To: Pallipadi, Venkatesh
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel@vger.kernel.org,
	Siddha, Suresh B

On Mon, 2009-12-14 at 14:32 -0800, Pallipadi, Venkatesh wrote:
> On Mon, 2009-12-14 at 14:21 -0800, Peter Zijlstra wrote:
> > On Thu, 2009-12-10 at 17:27 -0800, venkatesh.pallipadi@intel.com wrote:
> > > Having one idle CPU doing the rebalancing for all the idle CPUs in
> > > nohz mode does not scale well with increasing number of cores and
> > > sockets. Make the nohz_tracker per NUMA node. This results in multiple
> > > idle load balancing happening at NUMA node level and idle load balancer
> > > only does the rebalance domain among all the other nohz CPUs in that
> > > NUMA node.
> > > 
> > > This addresses the below problem with the current nohz ilb logic
> > > * The lone balancer may end up spending a lot of time doing the
> > > * balancing on
> > >   behalf of nohz CPUs, especially with increasing number of sockets and
> > >   cores in the platform.
> > 
> > If the purpose is to keep sockets idle, doing things per node doesn't
> > seem like a fine plan, since we're having nodes <= socket machines these
> > days.
> 
> The idea is to do idle balance only within the nodes.
> Eg: 4 node (and 4 socket) system with each socket having 4 cores.
> If there is a single active thread on such a system, say on socket 3.
> Without this change we have 1 idle load balancer (which may be in socket
> 0) which has periodic ticks and remaining 14 cores will be tickless.
> But this one idle load balancer does load balance on behalf of itself +
> 14 other idle cores.
> 
> With the change proposed in this patch, we will have 3 completely idle
> nodes/sockets. We will not do load balance on these cores at all.

That seems like a behavioural change, not balancing these 3 nodes at all
could lead to overload scenarios on the one active node, right?

> Remaining one active socket will have one idle load balancer, which when
> needed will do idle load balancing on behalf of itself + 2 other idle
> cores in that socket.

> If there all sockets have atleast one busy core, then we may have more
> than one idle load balancer, but each will only do idle load balance on
> behalf of idle processors in its own node, so total idle load balance
> will be same as now.

How about things like Magny-Cours which will have multiple nodes per
socket, wouldn't that be best served by having the total socket idle,
instead of just half of it?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node
  2009-12-14 22:58       ` Peter Zijlstra
@ 2009-12-15  1:00         ` Pallipadi, Venkatesh
  2009-12-15 10:21           ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Pallipadi, Venkatesh @ 2009-12-15  1:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel@vger.kernel.org,
	Siddha, Suresh B

On Mon, 2009-12-14 at 14:58 -0800, Peter Zijlstra wrote:
> On Mon, 2009-12-14 at 14:32 -0800, Pallipadi, Venkatesh wrote:
> > 
> > The idea is to do idle balance only within the nodes.
> > Eg: 4 node (and 4 socket) system with each socket having 4 cores.
> > If there is a single active thread on such a system, say on socket 3.
> > Without this change we have 1 idle load balancer (which may be in socket
> > 0) which has periodic ticks and remaining 14 cores will be tickless.
> > But this one idle load balancer does load balance on behalf of itself +
> > 14 other idle cores.
> > 
> > With the change proposed in this patch, we will have 3 completely idle
> > nodes/sockets. We will not do load balance on these cores at all.
> 
> That seems like a behavioural change, not balancing these 3 nodes at all
> could lead to overload scenarios on the one active node, right?
> 

Yes. You are right. This can result in some node level imbalance. The
main problem that we were trying to solve is over-aggressive attempt to
load balance idle CPUs. We have seen on a system with 64 logical CPUs,
if there is only active thread, we have seen one other CPU (the idle
load balancer) spending 3-5% time being non-idle just trying to do load
balance on behalf of 63 idle CPUs on a continuous basis. Trying idle
rebalance every jiffy across all nodes when balance across nodes has
interval of 8 or 16 jiffies. There are other forms of rebalancing like
fork and exec that will still balance across nodes. But, if there are no
forks/execs, we will have the overload scenario you pointed out.

I guess we need to look at other alternatives to make this cross node
idle load balancing more intelligent. However, first patch in this
series  has its share of advantages in avoiding unneeded idle balancing.
And with first patch, cross node issues will be no worse than current
state. So, that is worth as a stand alone change as well. 

> > Remaining one active socket will have one idle load balancer, which when
> > needed will do idle load balancing on behalf of itself + 2 other idle
> > cores in that socket.
> 
> > If there all sockets have atleast one busy core, then we may have more
> > than one idle load balancer, but each will only do idle load balance on
> > behalf of idle processors in its own node, so total idle load balance
> > will be same as now.
> 
> How about things like Magny-Cours which will have multiple nodes per
> socket, wouldn't that be best served by having the total socket idle,
> instead of just half of it?
> 

Yes. But, that will be same with general load balancing behavior and not
just idle load balancing. That would probably need another level in
scheduler domain?

Thanks,
Venki


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node
  2009-12-15  1:00         ` Pallipadi, Venkatesh
@ 2009-12-15 10:21           ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-12-15 10:21 UTC (permalink / raw)
  To: Pallipadi, Venkatesh
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel@vger.kernel.org,
	Siddha, Suresh B, Andreas Herrmann

On Mon, 2009-12-14 at 17:00 -0800, Pallipadi, Venkatesh wrote:
> On Mon, 2009-12-14 at 14:58 -0800, Peter Zijlstra wrote:
> > On Mon, 2009-12-14 at 14:32 -0800, Pallipadi, Venkatesh wrote:
> > > 
> > > The idea is to do idle balance only within the nodes.
> > > Eg: 4 node (and 4 socket) system with each socket having 4 cores.
> > > If there is a single active thread on such a system, say on socket 3.
> > > Without this change we have 1 idle load balancer (which may be in socket
> > > 0) which has periodic ticks and remaining 14 cores will be tickless.
> > > But this one idle load balancer does load balance on behalf of itself +
> > > 14 other idle cores.
> > > 
> > > With the change proposed in this patch, we will have 3 completely idle
> > > nodes/sockets. We will not do load balance on these cores at all.
> > 
> > That seems like a behavioural change, not balancing these 3 nodes at all
> > could lead to overload scenarios on the one active node, right?
> > 
> 
> Yes. You are right. This can result in some node level imbalance. The
> main problem that we were trying to solve is over-aggressive attempt to
> load balance idle CPUs. We have seen on a system with 64 logical CPUs,
> if there is only active thread, we have seen one other CPU (the idle
> load balancer) spending 3-5% time being non-idle just trying to do load
> balance on behalf of 63 idle CPUs on a continuous basis. Trying idle
> rebalance every jiffy across all nodes when balance across nodes has
> interval of 8 or 16 jiffies. There are other forms of rebalancing like
> fork and exec that will still balance across nodes. But, if there are no
> forks/execs, we will have the overload scenario you pointed out.
> 
> I guess we need to look at other alternatives to make this cross node
> idle load balancing more intelligent. However, first patch in this
> series  has its share of advantages in avoiding unneeded idle balancing.
> And with first patch, cross node issues will be no worse than current
> state. So, that is worth as a stand alone change as well. 

OK, I'll actually have a look at the patch now that I understand what
we're trying to do here ;-)

Thanks!

> > > Remaining one active socket will have one idle load balancer, which when
> > > needed will do idle load balancing on behalf of itself + 2 other idle
> > > cores in that socket.
> > 
> > > If there all sockets have atleast one busy core, then we may have more
> > > than one idle load balancer, but each will only do idle load balance on
> > > behalf of idle processors in its own node, so total idle load balance
> > > will be same as now.
> > 
> > How about things like Magny-Cours which will have multiple nodes per
> > socket, wouldn't that be best served by having the total socket idle,
> > instead of just half of it?
> > 
> 
> Yes. But, that will be same with general load balancing behavior and not
> just idle load balancing. That would probably need another level in
> scheduler domain?

Right, Andreas was supposed to look at doing that, not sure if he ever
got around to it though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 1/2] sched: Change the nohz ilb logic from pull to push model
  2009-12-11  1:27 ` [patch 1/2] sched: Change the " venkatesh.pallipadi
  2009-12-14 22:18   ` Peter Zijlstra
@ 2009-12-21 12:13   ` Peter Zijlstra
  2009-12-21 13:00     ` Peter Zijlstra
  2009-12-23  0:15     ` Pallipadi, Venkatesh
  1 sibling, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-12-21 12:13 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel, Suresh Siddha

On Thu, 2009-12-10 at 17:27 -0800, venkatesh.pallipadi@intel.com wrote:

> @@ -4507,12 +4507,45 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
>  }
>  
>  #ifdef CONFIG_NO_HZ
> +
> +/*
> + * idle load balancing details
> + * - One of the idle CPUs nominates itself as idle load_balancer, while
> + *   entering idle.
> + * - With previous logic, this idle load balancer CPU will not go into
> + *   tickless mode when it is idle and does the idle load balancing for
> + *   all the idle CPUs.
> + * - With new logic, this idle load balancer CPU will also go into
> + *   tickless mode when it is idle, just like all other idle CPUs
> + * - When one of the busy CPUs notice that there may be an idle rebalancing
> + *   needed, they will kick the idle load balancer, which then does idle
> + *   load balancing for all the idle CPUs.
> + * - As idle load balancing looks at the load of all the CPUs, not all busy
> + *   CPUs need to do this idle load balancer kick.
> + * - first_pick_cpu is the one of the busy CPUs which will kick
> + *   idle load balancer when it has more than one process active. This
> + *   eliminates the need for idle load balancing altogether when we have
> + *   only one running process in the system (common case).
> + * - If there are more than one busy CPU, idle load balancer may have
> + *   to run for active_load_balance to happen (i.e., two busy CPUs are
> + *   SMT or core siblings and can run better if they move to different
> + *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
> + *   which will kick idle load balancer as soon as it has any load.
> + * - With previous logic, idle load balancer used to run at every tick.
> + *   With new logic, idle load balancer tracks the rq->next_balance for all
> + *   the idle CPUs and does idle load balancing only when needed.
> + */

Right so like said before, this comments needs a rewrite.

>  static struct {
>  	atomic_t load_balancer;
> -	cpumask_var_t cpu_mask;
> -	cpumask_var_t ilb_grp_nohz_mask;
> +	atomic_t first_pick_cpu;
> +	atomic_t second_pick_cpu;
> +	cpumask_var_t idle_cpus_mask;
> +	cpumask_var_t tmp_nohz_mask;

I don't mind the rename, but tmp_nohz_mask is a really bad name.

> +	unsigned long next_balance;	/* in jiffy units */
>  } nohz ____cacheline_aligned = {
>  	.load_balancer = ATOMIC_INIT(-1),
> +	.first_pick_cpu = ATOMIC_INIT(-1),
> +	.second_pick_cpu = ATOMIC_INIT(-1),
>  };
>  
>  int get_nohz_load_balancer(void)

>  /*
> + * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
> + * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
> + * CPU (if there is one).
> +*/
> +static void nohz_balancer_kick(int cpu)
> +{
> +	int ilb_cpu;
> +
> +	nohz.next_balance++;
> +
> +	ilb_cpu = get_nohz_load_balancer();
> +	if (ilb_cpu < 0) {
> +		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
> +		if (ilb_cpu >= nr_cpu_ids)
> +			return;
> +	}
> +
> +	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
> +		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
> +		resched_cpu(ilb_cpu);
> +	}
> +	return;
> +}

So here you simply send an resched-ipi, which requires the below hack in
schedule()?


> @@ -4673,28 +4722,20 @@ int select_nohz_load_balancer(int stop_tick)
>  			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
>  				BUG();
>  
> +			return;
>  		}
>  
> +		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> +		atomic_cmpxchg(&nohz.first_pick_cpu, cpu, -1);
> +		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, -1);

If you were to use nr_cpu_ids here instead of -1, you get more
consistent code in nohz_balancer_kick().


> +	ret = atomic_cmpxchg(&nohz.first_pick_cpu, -1, cpu);
> +	if (ret == -1 || ret == cpu) {
> +		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, -1);
> +		if (rq->nr_running > 1)
> +			return 1;
> +	} else {
> +		ret = atomic_cmpxchg(&nohz.second_pick_cpu, -1, cpu);
> +		if (ret == -1 || ret == cpu) {
> +			if (rq->nr_running)
> +				return 1;
>  		}
>  	}

Looked very funny, and took a while to understand why you're doing that,
but yeah, I can't see a better way of doing it either.

The comments confused me more than helped me understand it.

> @@ -5446,8 +5490,19 @@ need_resched_nonpreemptible:
>  
>  	pre_schedule(rq, prev);
>  
> -	if (unlikely(!rq->nr_running))
> +	if (unlikely(!rq->nr_running)) {
> +#ifdef CONFIG_NO_HZ
> +		if (rq->nohz_balance_kick) {
> +			spin_unlock_irq(&rq->lock);
> +			nohz_idle_balance(cpu, rq);
> +			spin_lock_irq(&rq->lock);
> +		} else {
> +			idle_balance(cpu, rq);
> +		}
> +#else
>  		idle_balance(cpu, rq);
> +#endif
> +	}

And I think this is the wrong kind of trade-off, complicating the
schedule()/newidle path for nohz idle balancing.

nohz_balancer_kick() seems like the perfect place to use something like
send_remote_softirq().


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 1/2] sched: Change the nohz ilb logic from pull to push model
  2009-12-21 12:13   ` Peter Zijlstra
@ 2009-12-21 13:00     ` Peter Zijlstra
  2009-12-23  0:15     ` Pallipadi, Venkatesh
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-12-21 13:00 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel, Suresh Siddha

On Mon, 2009-12-21 at 13:13 +0100, Peter Zijlstra wrote:
> 
> > +     ret = atomic_cmpxchg(&nohz.first_pick_cpu, -1, cpu);
> > +     if (ret == -1 || ret == cpu) {
> > +             atomic_cmpxchg(&nohz.second_pick_cpu, cpu, -1);
> > +             if (rq->nr_running > 1)
> > +                     return 1;
> > +     } else {
> > +             ret = atomic_cmpxchg(&nohz.second_pick_cpu, -1, cpu);
> > +             if (ret == -1 || ret == cpu) {
> > +                     if (rq->nr_running)
> > +                             return 1;
> >               }
> >       }
> 
> Looked very funny, and took a while to understand why you're doing that,
> but yeah, I can't see a better way of doing it either. 

That is, the sanest way to write that is to do something like:

 weight(~nohz & online) == 1 && nr_running == 1

except that with the recent cpumask blowout that's a very expensive op.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node
  2009-12-11  1:27 ` [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node venkatesh.pallipadi
  2009-12-14 22:21   ` Peter Zijlstra
@ 2009-12-21 13:11   ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-12-21 13:11 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel, Suresh Siddha

On Thu, 2009-12-10 at 17:27 -0800, venkatesh.pallipadi@intel.com wrote:
> plain text document attachment
> (0002-sched-Scale-the-nohz_tracker-logic-by-making-it-per.patch)
> Having one idle CPU doing the rebalancing for all the idle CPUs in
> nohz mode does not scale well with increasing number of cores and
> sockets. Make the nohz_tracker per NUMA node. This results in multiple
> idle load balancing happening at NUMA node level and idle load balancer
> only does the rebalance domain among all the other nohz CPUs in that
> NUMA node.
> 
> This addresses the below problem with the current nohz ilb logic
> * The lone balancer may end up spending a lot of time doing the
> * balancing on
>   behalf of nohz CPUs, especially with increasing number of sockets and
>   cores in the platform.

Right, so I think the whole NODE idea here is wrong, it all seems to
work out properly if you simply pick one sched domain larger than the
one that contains all of the current socket and contains an idle unit.

Except that the sched domain stuff is not properly aware of bigger
topology things atm.

The sched domain tree should not view node as the largest structure and
we should remove that current random node split crap we have.

Instead the sched domains should continue to express the topology, like
nodes within 1 hop, nodes within 2 hops, etc.

Then this nohz idle balancing should pick the socket level (which might
be larger than the node level), and walks up the domain tree, until we
reach a level where it has a whole idle group.

This means that we'll always span at least 2 sockets, which means we'll
gracefully deal with the overload scenario.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 1/2] sched: Change the nohz ilb logic from pull to push model
  2009-12-21 12:13   ` Peter Zijlstra
  2009-12-21 13:00     ` Peter Zijlstra
@ 2009-12-23  0:15     ` Pallipadi, Venkatesh
  1 sibling, 0 replies; 13+ messages in thread
From: Pallipadi, Venkatesh @ 2009-12-23  0:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, linux-kernel@vger.kernel.org,
	Siddha, Suresh B

On Mon, 2009-12-21 at 04:13 -0800, Peter Zijlstra wrote:
> On Thu, 2009-12-10 at 17:27 -0800, venkatesh.pallipadi@intel.com wrote:
> 
> > @@ -4507,12 +4507,45 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
> >  }
> >  
> >  #ifdef CONFIG_NO_HZ
> > +
> > +/*
> > + * idle load balancing details
> > + * - One of the idle CPUs nominates itself as idle load_balancer, while
> > + *   entering idle.
> > + * - With previous logic, this idle load balancer CPU will not go into
> > + *   tickless mode when it is idle and does the idle load balancing for
> > + *   all the idle CPUs.
> > + * - With new logic, this idle load balancer CPU will also go into
> > + *   tickless mode when it is idle, just like all other idle CPUs
> > + * - When one of the busy CPUs notice that there may be an idle rebalancing
> > + *   needed, they will kick the idle load balancer, which then does idle
> > + *   load balancing for all the idle CPUs.
> > + * - As idle load balancing looks at the load of all the CPUs, not all busy
> > + *   CPUs need to do this idle load balancer kick.
> > + * - first_pick_cpu is the one of the busy CPUs which will kick
> > + *   idle load balancer when it has more than one process active. This
> > + *   eliminates the need for idle load balancing altogether when we have
> > + *   only one running process in the system (common case).
> > + * - If there are more than one busy CPU, idle load balancer may have
> > + *   to run for active_load_balance to happen (i.e., two busy CPUs are
> > + *   SMT or core siblings and can run better if they move to different
> > + *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
> > + *   which will kick idle load balancer as soon as it has any load.
> > + * - With previous logic, idle load balancer used to run at every tick.
> > + *   With new logic, idle load balancer tracks the rq->next_balance for all
> > + *   the idle CPUs and does idle load balancing only when needed.
> > + */
> 
> Right so like said before, this comments needs a rewrite.

Agreed. Will change this with patch refresh.

> 
> >  static struct {
> >  	atomic_t load_balancer;
> > -	cpumask_var_t cpu_mask;
> > -	cpumask_var_t ilb_grp_nohz_mask;
> > +	atomic_t first_pick_cpu;
> > +	atomic_t second_pick_cpu;
> > +	cpumask_var_t idle_cpus_mask;
> > +	cpumask_var_t tmp_nohz_mask;
> 
> I don't mind the rename, but tmp_nohz_mask is a really bad name.
> 
> > +	unsigned long next_balance;	/* in jiffy units */
> >  } nohz ____cacheline_aligned = {
> >  	.load_balancer = ATOMIC_INIT(-1),
> > +	.first_pick_cpu = ATOMIC_INIT(-1),
> > +	.second_pick_cpu = ATOMIC_INIT(-1),
> >  };
> >  
> >  int get_nohz_load_balancer(void)
> 
> >  /*
> > + * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
> > + * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
> > + * CPU (if there is one).
> > +*/
> > +static void nohz_balancer_kick(int cpu)
> > +{
> > +	int ilb_cpu;
> > +
> > +	nohz.next_balance++;
> > +
> > +	ilb_cpu = get_nohz_load_balancer();
> > +	if (ilb_cpu < 0) {
> > +		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
> > +		if (ilb_cpu >= nr_cpu_ids)
> > +			return;
> > +	}
> > +
> > +	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
> > +		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
> > +		resched_cpu(ilb_cpu);
> > +	}
> > +	return;
> > +}
> 
> So here you simply send an resched-ipi, which requires the below hack in
> schedule()?
> 
> 
> > @@ -4673,28 +4722,20 @@ int select_nohz_load_balancer(int stop_tick)
> >  			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
> >  				BUG();
> >  
> > +			return;
> >  		}
> >  
> > +		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> > +		atomic_cmpxchg(&nohz.first_pick_cpu, cpu, -1);
> > +		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, -1);
> 
> If you were to use nr_cpu_ids here instead of -1, you get more
> consistent code in nohz_balancer_kick().

Yes. Will change.

> 
> 
> > +	ret = atomic_cmpxchg(&nohz.first_pick_cpu, -1, cpu);
> > +	if (ret == -1 || ret == cpu) {
> > +		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, -1);
> > +		if (rq->nr_running > 1)
> > +			return 1;
> > +	} else {
> > +		ret = atomic_cmpxchg(&nohz.second_pick_cpu, -1, cpu);
> > +		if (ret == -1 || ret == cpu) {
> > +			if (rq->nr_running)
> > +				return 1;
> >  		}
> >  	}
> 
> Looked very funny, and took a while to understand why you're doing that,
> but yeah, I can't see a better way of doing it either.
> 
> The comments confused me more than helped me understand it.

This is the least expensive way I could think of. Without dealing with
cpu_masks. I knew this was not very clean. Thats the reason I had it in
a separate function, so that we can change it locally if we can find any
better way to deal with it.

> 
> > @@ -5446,8 +5490,19 @@ need_resched_nonpreemptible:
> >  
> >  	pre_schedule(rq, prev);
> >  
> > -	if (unlikely(!rq->nr_running))
> > +	if (unlikely(!rq->nr_running)) {
> > +#ifdef CONFIG_NO_HZ
> > +		if (rq->nohz_balance_kick) {
> > +			spin_unlock_irq(&rq->lock);
> > +			nohz_idle_balance(cpu, rq);
> > +			spin_lock_irq(&rq->lock);
> > +		} else {
> > +			idle_balance(cpu, rq);
> > +		}
> > +#else
> >  		idle_balance(cpu, rq);
> > +#endif
> > +	}
> 
> And I think this is the wrong kind of trade-off, complicating the
> schedule()/newidle path for nohz idle balancing.
> 
> nohz_balancer_kick() seems like the perfect place to use something like
> send_remote_softirq().

Hmmm. I didn't know send_remote_softirq existed in mainline. I agree
that doing this outside the common path will be better. Let me try using
send_remote_softirq and followup on this.

Thanks,
Venki



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-12-23  0:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-11  1:27 [patch 0/2] sched: Change nohz ilb logic from pull to push model venkatesh.pallipadi
2009-12-11  1:27 ` [patch 1/2] sched: Change the " venkatesh.pallipadi
2009-12-14 22:18   ` Peter Zijlstra
2009-12-21 12:13   ` Peter Zijlstra
2009-12-21 13:00     ` Peter Zijlstra
2009-12-23  0:15     ` Pallipadi, Venkatesh
2009-12-11  1:27 ` [patch 2/2] sched: Scale the nohz_tracker logic by making it per NUMA node venkatesh.pallipadi
2009-12-14 22:21   ` Peter Zijlstra
2009-12-14 22:32     ` Pallipadi, Venkatesh
2009-12-14 22:58       ` Peter Zijlstra
2009-12-15  1:00         ` Pallipadi, Venkatesh
2009-12-15 10:21           ` Peter Zijlstra
2009-12-21 13:11   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox