[PATCH 0/5] On demand sched_domain balancing in tasklet

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] On demand sched_domain balancing in tasklet
@ 2006-10-24 18:31 Christoph Lameter
  2006-10-24 18:31 ` [PATCH 1/5] Disable interrupts for locking in load_balance() Christoph Lameter
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Christoph Lameter @ 2006-10-24 18:31 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	Siddha, Suresh B, Dave Chinner, Ingo Molnar, KAMEZAWA Hiroyuki

This patchset moves potentially expensive load balancing out of the scheduler
tick (where we run with interrupts disabled) into a tasklet that is triggered
if necessary from scheduler_tick(). Load balancing will then run with interrupts
enabled. This eliminates interrupt holdoff times and avoids potential machine
livelock if f.e. load balancing is performed over a large number of processors
and many of the nodes experience heavy load which may lead to delays in
fetching cachelines. We have currently up to 1024 processors and may go up
to 4096 soon. Similar issues were seen on a Fujitsu system in the past.

The moving of the load balancing into a tasklet also allows some
cleanup in scheduler_tick(). It gets easier to read and the determination
of the state for load balancing can be moved out of scheduler_tick().

Further optimization of scheduler_tick() processing occurs because we
no longer check all the sched domains on each tick.
We determine the time of the next load balancing on every load balancing
and check against this single value in scheduler_tick().

Another optimization is that we perform the staggering of the individual
load balance operations not during load balancing but shift that
to the setup of the sched domains.

For the earlier discussion see:
http://marc.theaimsgroup.com/?t=116119187800002&r=1&w=2

Tested on:
NUMA ia64/sn2
SMP i386
UP x86_64

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] Disable interrupts for locking in load_balance()
  2006-10-24 18:31 [PATCH 0/5] On demand sched_domain balancing in tasklet Christoph Lameter
@ 2006-10-24 18:31 ` Christoph Lameter
  2006-10-24 18:31 ` [PATCH 2/5] Extract load calculation from rebalance_tick Christoph Lameter
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Christoph Lameter @ 2006-10-24 18:31 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	KAMEZAWA Hiroyuki, Dave Chinner, Ingo Molnar, Siddha, Suresh B

scheduler: Disable interrupts for locking in load_balance()

Interrupts must be disabled for request queue locks if we want
to run load_balance() with interrupts enabled.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc2-mm2/kernel/sched.c
===================================================================
--- linux-2.6.19-rc2-mm2.orig/kernel/sched.c	2006-10-23 19:35:19.615025838 -0500
+++ linux-2.6.19-rc2-mm2/kernel/sched.c	2006-10-23 19:36:26.208865512 -0500
@@ -2530,8 +2530,6 @@ static inline unsigned long minus_1_or_z
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
- *
- * Called with this_rq unlocked.
  */
 static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum idle_type idle)
@@ -2541,6 +2539,7 @@ static int load_balance(int this_cpu, st
 	unsigned long imbalance;
 	struct rq *busiest;
 	cpumask_t cpus = CPU_MASK_ALL;
+	unsigned long flags;
 
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
@@ -2580,11 +2579,13 @@ redo:
 		 * still unbalanced. nr_moved simply stays zero, so it is
 		 * correctly treated as an imbalance.
 		 */
+		local_irq_save(flags);
 		double_rq_lock(this_rq, busiest);
 		nr_moved = move_tasks(this_rq, this_cpu, busiest,
 				      minus_1_or_zero(busiest->nr_running),
 				      imbalance, sd, idle, &all_pinned);
 		double_rq_unlock(this_rq, busiest);
+		local_irq_restore(flags);
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(all_pinned)) {
@@ -2601,13 +2602,13 @@ redo:
 
 		if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
 
-			spin_lock(&busiest->lock);
+			spin_lock_irqsave(&busiest->lock, flags);
 
 			/* don't kick the migration_thread, if the curr
 			 * task on busiest cpu can't be moved to this_cpu
 			 */
 			if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
-				spin_unlock(&busiest->lock);
+				spin_unlock_irqrestore(&busiest->lock, flags);
 				all_pinned = 1;
 				goto out_one_pinned;
 			}
@@ -2617,7 +2618,7 @@ redo:
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			spin_unlock(&busiest->lock);
+			spin_unlock_irqrestore(&busiest->lock, flags);
 			if (active_balance)
 				wake_up_process(busiest->migration_thread);
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/5] Extract load calculation from rebalance_tick
  2006-10-24 18:31 [PATCH 0/5] On demand sched_domain balancing in tasklet Christoph Lameter
  2006-10-24 18:31 ` [PATCH 1/5] Disable interrupts for locking in load_balance() Christoph Lameter
@ 2006-10-24 18:31 ` Christoph Lameter
  2006-10-26 12:03   ` Nick Piggin
  2006-10-24 18:31 ` [PATCH 3/5] Use next_balance instead of last_balance Christoph Lameter
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-10-24 18:31 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	Siddha, Suresh B, Dave Chinner, Ingo Molnar, KAMEZAWA Hiroyuki

Extract load calculation from rebalance_tick

A load calculation is always done in rebalance_tick() in addition
to the real load balancing activities that only take place when certain
jiffie counts have been reached. Move that processing into a separate
function and call it directly from scheduler_tick().

Also extract the time slice handling from scheduler_tick and
put it into a separate function. Then we can clean up scheduler_tick
significantly. It will no longer have any gotos.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc2-mm2/kernel/sched.c
===================================================================
--- linux-2.6.19-rc2-mm2.orig/kernel/sched.c	2006-10-23 20:13:02.393266962 -0500
+++ linux-2.6.19-rc2-mm2/kernel/sched.c	2006-10-24 10:39:07.158552011 -0500
@@ -2817,27 +2817,10 @@ static void active_load_balance(struct r
 	spin_unlock(&target_rq->lock);
 }
 
-/*
- * rebalance_tick will get called every timer tick, on every CPU.
- *
- * It checks each scheduling domain to see if it is due to be balanced,
- * and initiates a balancing operation if so.
- *
- * Balancing parameters are set up in arch_init_sched_domains.
- */
-
-/* Don't have all balancing operations going off at once: */
-static inline unsigned long cpu_offset(int cpu)
+static void update_load(struct rq *this_rq)
 {
-	return jiffies + cpu * HZ / NR_CPUS;
-}
-
-static void
-rebalance_tick(int this_cpu, struct rq *this_rq, enum idle_type idle)
-{
-	unsigned long this_load, interval, j = cpu_offset(this_cpu);
-	struct sched_domain *sd;
 	int i, scale;
+	unsigned long this_load;
 
 	this_load = this_rq->raw_weighted_load;
 
@@ -2856,6 +2839,28 @@ rebalance_tick(int this_cpu, struct rq *
 			new_load += scale-1;
 		this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) / scale;
 	}
+}
+
+/*
+ * rebalance_tick will get called every timer tick, on every CPU.
+ *
+ * It checks each scheduling domain to see if it is due to be balanced,
+ * and initiates a balancing operation if so.
+ *
+ * Balancing parameters are set up in arch_init_sched_domains.
+ */
+
+/* Don't have all balancing operations going off at once: */
+static inline unsigned long cpu_offset(int cpu)
+{
+	return jiffies + cpu * HZ / NR_CPUS;
+}
+
+static void
+rebalance_tick(int this_cpu, struct rq *this_rq, enum idle_type idle)
+{
+	unsigned long interval, j = cpu_offset(this_cpu);
+	struct sched_domain *sd;
 
 	for_each_domain(this_cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
@@ -2893,6 +2898,9 @@ static inline void rebalance_tick(int cp
 static inline void idle_balance(int cpu, struct rq *rq)
 {
 }
+static inline void update_load(struct rq *this_rq)
+{
+}
 #endif
 
 static inline int wake_priority_sleeper(struct rq *rq)
@@ -3039,36 +3047,8 @@ void account_steal_time(struct task_stru
 		cpustat->steal = cputime64_add(cpustat->steal, tmp);
 }
 
-/*
- * This function gets called by the timer code, with HZ frequency.
- * We call it with interrupts disabled.
- *
- * It also gets called by the fork code, when changing the parent's
- * timeslices.
- */
-void scheduler_tick(void)
+void time_slice(struct rq *rq, struct task_struct *p)
 {
-	unsigned long long now = sched_clock();
-	struct task_struct *p = current;
-	int cpu = smp_processor_id();
-	struct rq *rq = cpu_rq(cpu);
-
-	update_cpu_clock(p, rq, now);
-
-	rq->timestamp_last_tick = now;
-
-	if (p == rq->idle) {
-		if (wake_priority_sleeper(rq))
-			goto out;
-		rebalance_tick(cpu, rq, SCHED_IDLE);
-		return;
-	}
-
-	/* Task might have expired already, but not scheduled off yet */
-	if (p->array != rq->active) {
-		set_tsk_need_resched(p);
-		goto out;
-	}
 	spin_lock(&rq->lock);
 	/*
 	 * The task was running during this tick - update the
@@ -3135,8 +3115,41 @@ void scheduler_tick(void)
 	}
 out_unlock:
 	spin_unlock(&rq->lock);
-out:
-	rebalance_tick(cpu, rq, NOT_IDLE);
+}
+
+/*
+ * This function gets called by the timer code, with HZ frequency.
+ * We call it with interrupts disabled.
+ *
+ * It also gets called by the fork code, when changing the parent's
+ * timeslices.
+ */
+void scheduler_tick(void)
+{
+	unsigned long long now = sched_clock();
+	struct task_struct *p = current;
+	int cpu = smp_processor_id();
+	struct rq *rq = cpu_rq(cpu);
+	enum idle_type idle = NOT_IDLE;
+
+	update_cpu_clock(p, rq, now);
+
+	rq->timestamp_last_tick = now;
+
+	if (p == rq->idle) {
+		/* Task on the idle queue */
+		if (!wake_priority_sleeper(rq))
+			idle = SCHED_IDLE;
+	} else {
+		/* Task on cpu queue */
+		if (p->array != rq->active)
+			/* Task has expired but was not scheduled yet */
+			set_tsk_need_resched(p);
+		else
+			time_slice(rq, p);
+	}
+	update_load(rq);
+	rebalance_tick(cpu, rq, idle);
 }
 
 #ifdef CONFIG_SCHED_SMT

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] Extract load calculation from rebalance_tick
  2006-10-24 18:31 ` [PATCH 2/5] Extract load calculation from rebalance_tick Christoph Lameter
@ 2006-10-26 12:03   ` Nick Piggin
  2006-10-26 16:12     ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-10-26 12:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Peter Williams, linux-kernel, Siddha, Suresh B,
	Dave Chinner, Ingo Molnar, KAMEZAWA Hiroyuki

Christoph Lameter wrote:
> Extract load calculation from rebalance_tick
> 
> A load calculation is always done in rebalance_tick() in addition
> to the real load balancing activities that only take place when certain
> jiffie counts have been reached. Move that processing into a separate
> function and call it directly from scheduler_tick().

Ack for this one.

> 
> Also extract the time slice handling from scheduler_tick and
> put it into a separate function. Then we can clean up scheduler_tick
> significantly. It will no longer have any gotos.

'time_slice' should be static, and it should be named better, and you
may as well also put the "task has expired but not rescheduled" part
in there too. That is part of the same logical op (which is to resched
the task when it finishes timeslice).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] Extract load calculation from rebalance_tick
  2006-10-26 12:03   ` Nick Piggin
@ 2006-10-26 16:12     ` Christoph Lameter
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Lameter @ 2006-10-26 16:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, Peter Williams, linux-kernel, Siddha, Suresh B,
	Dave Chinner, Ingo Molnar, KAMEZAWA Hiroyuki

On Thu, 26 Oct 2006, Nick Piggin wrote:

> 'time_slice' should be static, and it should be named better, and you
> may as well also put the "task has expired but not rescheduled" part
> in there too. That is part of the same logical op (which is to resched
> the task when it finishes timeslice).

This way?

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-26 11:04:43.000000000 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-26 11:11:20.458202627 -0500
@@ -3044,8 +3044,13 @@ void account_steal_time(struct task_stru
 		cpustat->steal = cputime64_add(cpustat->steal, tmp);
 }
 
-void time_slice(struct rq *rq, struct task_struct *p)
+static void task_running_tick(struct rq *rq, struct task_struct *p)
 {
+	if (p->array != rq->active) {
+		/* Task has expired but was not scheduled yet */
+		set_tsk_need_resched(p);
+		return;
+	}
 	spin_lock(&rq->lock);
 	/*
 	 * The task was running during this tick - update the
@@ -3135,14 +3140,8 @@ void scheduler_tick(void)
 	if (p == rq->idle)
 		/* Task on the idle queue */
 		wake_priority_sleeper(rq);
-	else {
-		/* Task on cpu queue */
-		if (p->array != rq->active)
-			/* Task has expired but was not scheduled yet */
-			set_tsk_need_resched(p);
-		else
-			time_slice(rq, p);
-	}
+	else
+		task_running_tick(rq, p);
 #ifdef CONFIG_SMP
 	update_load(rq);
 	if (jiffies >= __get_cpu_var(next_balance))

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/5] Use next_balance instead of last_balance
  2006-10-24 18:31 [PATCH 0/5] On demand sched_domain balancing in tasklet Christoph Lameter
  2006-10-24 18:31 ` [PATCH 1/5] Disable interrupts for locking in load_balance() Christoph Lameter
  2006-10-24 18:31 ` [PATCH 2/5] Extract load calculation from rebalance_tick Christoph Lameter
@ 2006-10-24 18:31 ` Christoph Lameter
  2006-10-26 12:13   ` Nick Piggin
  2006-10-24 18:31 ` [PATCH 4/5] Create rebalance_domains from rebalance_tick Christoph Lameter
  2006-10-24 18:31 ` [PATCH 5/5] Only call rebalance_domains when needed from scheduler_tick Christoph Lameter
  4 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-10-24 18:31 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	KAMEZAWA Hiroyuki, Dave Chinner, Ingo Molnar, Siddha, Suresh B

Use next_balance instead of last_balance ...

The cpu offset calculation in the sched_domains code makes it difficult to
figure out when the next event is supposed to happen since we only keep
track of the last_balancing. We want to know when the next load balance
is supposed to occur.

Move the cpu offset calculation into build_sched_domains(). Do the
setup of the staggered load balance schewduling when the sched domains
are initialized. That way we dont have to worry about it anymore later.

This also in turn simplifies the load balancing time checks.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc2-mm2/include/asm-ia64/topology.h
===================================================================
--- linux-2.6.19-rc2-mm2.orig/include/asm-ia64/topology.h	2006-10-24 10:37:49.925081728 -0500
+++ linux-2.6.19-rc2-mm2/include/asm-ia64/topology.h	2006-10-24 10:39:13.728037801 -0500
@@ -76,7 +76,6 @@ void build_cpu_to_node_map(void);
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE,	\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
 }
@@ -102,7 +101,6 @@ void build_cpu_to_node_map(void);
 				| SD_BALANCE_EXEC	\
 				| SD_BALANCE_FORK	\
 				| SD_WAKE_BALANCE,	\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 64,			\
 	.nr_balance_failed	= 0,			\
 }
Index: linux-2.6.19-rc2-mm2/include/linux/sched.h
===================================================================
--- linux-2.6.19-rc2-mm2.orig/include/linux/sched.h	2006-10-24 10:37:49.971955705 -0500
+++ linux-2.6.19-rc2-mm2/include/linux/sched.h	2006-10-24 10:39:13.743663180 -0500
@@ -692,7 +692,7 @@ struct sched_domain {
 	int flags;			/* See SD_* */
 
 	/* Runtime fields. */
-	unsigned long last_balance;	/* init to jiffies. units in jiffies */
+	unsigned long next_balance;	/* init to jiffies. units in jiffies */
 	unsigned int balance_interval;	/* initialise to 1. units in ms. */
 	unsigned int nr_balance_failed; /* initialise to 0 */
 
Index: linux-2.6.19-rc2-mm2/include/linux/topology.h
===================================================================
--- linux-2.6.19-rc2-mm2.orig/include/linux/topology.h	2006-10-24 10:37:49.979768034 -0500
+++ linux-2.6.19-rc2-mm2/include/linux/topology.h	2006-10-24 10:39:13.754405628 -0500
@@ -108,7 +108,6 @@
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_IDLE		\
 				| SD_SHARE_CPUPOWER,	\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
 }
@@ -140,7 +139,6 @@
 				| SD_WAKE_AFFINE	\
 				| SD_SHARE_PKG_RESOURCES\
 				| BALANCE_FOR_MC_POWER,	\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
 }
@@ -170,7 +168,6 @@
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
 				| BALANCE_FOR_PKG_POWER,\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
 }
@@ -195,7 +192,6 @@
 	.forkexec_idx		= 0, /* unused */	\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE,	\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 64,			\
 	.nr_balance_failed	= 0,			\
 }
Index: linux-2.6.19-rc2-mm2/kernel/sched.c
===================================================================
--- linux-2.6.19-rc2-mm2.orig/kernel/sched.c	2006-10-24 10:39:07.158552011 -0500
+++ linux-2.6.19-rc2-mm2/kernel/sched.c	2006-10-24 10:39:13.782726627 -0500
@@ -2849,17 +2849,10 @@ static void update_load(struct rq *this_
  *
  * Balancing parameters are set up in arch_init_sched_domains.
  */
-
-/* Don't have all balancing operations going off at once: */
-static inline unsigned long cpu_offset(int cpu)
-{
-	return jiffies + cpu * HZ / NR_CPUS;
-}
-
 static void
 rebalance_tick(int this_cpu, struct rq *this_rq, enum idle_type idle)
 {
-	unsigned long interval, j = cpu_offset(this_cpu);
+	unsigned long interval;
 	struct sched_domain *sd;
 
 	for_each_domain(this_cpu, sd) {
@@ -2875,7 +2868,7 @@ rebalance_tick(int this_cpu, struct rq *
 		if (unlikely(!interval))
 			interval = 1;
 
-		if (j - sd->last_balance >= interval) {
+		if (jiffies >= sd->next_balance) {
 			if (load_balance(this_cpu, this_rq, sd, idle)) {
 				/*
 				 * We've pulled tasks over so either we're no
@@ -2884,7 +2877,7 @@ rebalance_tick(int this_cpu, struct rq *
 				 */
 				idle = NOT_IDLE;
 			}
-			sd->last_balance += interval;
+			sd->next_balance += interval;
 		}
 	}
 }
@@ -6445,6 +6438,16 @@ static void init_sched_groups_power(int 
 }
 
 /*
+ * Calculate jiffies start to use for each cpu. On sched domain
+ * initialization this jiffy value is used to stagger the load balancing
+ * of the cpus so that they do not load balance all at at the same time.
+ */
+static inline unsigned long cpu_offset(int cpu)
+{
+	return jiffies + cpu * HZ / NR_CPUS;
+}
+
+/*
  * Build sched domains for a given set of cpus and attach the sched domains
  * to the individual cpus
  */
@@ -6500,6 +6503,7 @@ static int build_sched_domains(const cpu
 			sd->span = *cpu_map;
 			group = cpu_to_allnodes_group(i, cpu_map);
 			sd->groups = &sched_group_allnodes[group];
+			sd->next_balance = cpu_offset(i);
 			p = sd;
 		} else
 			p = NULL;
@@ -6508,6 +6512,7 @@ static int build_sched_domains(const cpu
 		*sd = SD_NODE_INIT;
 		sd->span = sched_domain_node_span(cpu_to_node(i));
 		sd->parent = p;
+		sd->next_balance = cpu_offset(i);
 		if (p)
 			p->child = sd;
 		cpus_and(sd->span, sd->span, *cpu_map);
@@ -6519,6 +6524,7 @@ static int build_sched_domains(const cpu
 		*sd = SD_CPU_INIT;
 		sd->span = nodemask;
 		sd->parent = p;
+		sd->next_balance = cpu_offset(i);
 		if (p)
 			p->child = sd;
 		sd->groups = &sched_group_phys[group];
@@ -6531,6 +6537,7 @@ static int build_sched_domains(const cpu
 		sd->span = cpu_coregroup_map(i);
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
+		sd->next_balance = cpu_offset(i);
 		p->child = sd;
 		sd->groups = &sched_group_core[group];
 #endif
@@ -6543,6 +6550,7 @@ static int build_sched_domains(const cpu
 		sd->span = cpu_sibling_map[i];
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
+		sd->next_balance = cpu_offset(i);
 		p->child = sd;
 		sd->groups = &sched_group_cpus[group];
 #endif
Index: linux-2.6.19-rc2-mm2/include/asm-i386/topology.h
===================================================================
--- linux-2.6.19-rc2-mm2.orig/include/asm-i386/topology.h	2006-10-24 10:37:49.988556905 -0500
+++ linux-2.6.19-rc2-mm2/include/asm-i386/topology.h	2006-10-24 10:39:13.799328592 -0500
@@ -90,7 +90,6 @@ static inline int node_to_first_cpu(int 
 				| SD_BALANCE_EXEC	\
 				| SD_BALANCE_FORK	\
 				| SD_WAKE_BALANCE,	\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
 }
Index: linux-2.6.19-rc2-mm2/include/asm-powerpc/topology.h
===================================================================
--- linux-2.6.19-rc2-mm2.orig/include/asm-powerpc/topology.h	2006-10-24 10:37:49.998322317 -0500
+++ linux-2.6.19-rc2-mm2/include/asm-powerpc/topology.h	2006-10-24 10:39:13.816907143 -0500
@@ -60,7 +60,6 @@ extern int pcibus_to_node(struct pci_bus
 				| SD_BALANCE_NEWIDLE	\
 				| SD_WAKE_IDLE		\
 				| SD_WAKE_BALANCE,	\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
 }
Index: linux-2.6.19-rc2-mm2/include/asm-x86_64/topology.h
===================================================================
--- linux-2.6.19-rc2-mm2.orig/include/asm-x86_64/topology.h	2006-10-24 10:37:50.007111192 -0500
+++ linux-2.6.19-rc2-mm2/include/asm-x86_64/topology.h	2006-10-24 10:39:13.826673004 -0500
@@ -48,7 +48,6 @@ extern int __node_distance(int, int);
 				| SD_BALANCE_FORK	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_BALANCE,	\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
 }
Index: linux-2.6.19-rc2-mm2/include/asm-mips/mach-ip27/topology.h
===================================================================
--- linux-2.6.19-rc2-mm2.orig/include/asm-mips/mach-ip27/topology.h	2006-10-24 10:37:50.018829694 -0500
+++ linux-2.6.19-rc2-mm2/include/asm-mips/mach-ip27/topology.h	2006-10-24 10:39:13.841321797 -0500
@@ -33,7 +33,6 @@ extern unsigned char __node_distances[MA
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_BALANCE,	\
-	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
 }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] Use next_balance instead of last_balance
  2006-10-24 18:31 ` [PATCH 3/5] Use next_balance instead of last_balance Christoph Lameter
@ 2006-10-26 12:13   ` Nick Piggin
  2006-10-26 12:32     ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-10-26 12:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Peter Williams, linux-kernel, KAMEZAWA Hiroyuki,
	Dave Chinner, Ingo Molnar, Siddha, Suresh B

Christoph Lameter wrote:
> Use next_balance instead of last_balance ...
> 
> The cpu offset calculation in the sched_domains code makes it difficult to
> figure out when the next event is supposed to happen since we only keep
> track of the last_balancing. We want to know when the next load balance
> is supposed to occur.
> 
> Move the cpu offset calculation into build_sched_domains(). Do the
> setup of the staggered load balance schewduling when the sched domains
> are initialized. That way we dont have to worry about it anymore later.
> 
> This also in turn simplifies the load balancing time checks.

OK. I think I made this overcomplex in order to cope with issues where
offset can get skewed so if we're unlucky they might all get into synch
... but this new code isn't any worse than the old, and it is cheaper.

So, Ack.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] Use next_balance instead of last_balance
  2006-10-26 12:13   ` Nick Piggin
@ 2006-10-26 12:32     ` Nick Piggin
  2006-10-26 16:44       ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-10-26 12:32 UTC (permalink / raw)
  Cc: Christoph Lameter, akpm, Peter Williams, linux-kernel,
	KAMEZAWA Hiroyuki, Dave Chinner, Ingo Molnar, Siddha, Suresh B

Nick Piggin wrote:
> Christoph Lameter wrote:
> 
>> Use next_balance instead of last_balance ...
>>
>> The cpu offset calculation in the sched_domains code makes it 
>> difficult to
>> figure out when the next event is supposed to happen since we only keep
>> track of the last_balancing. We want to know when the next load balance
>> is supposed to occur.
>>
>> Move the cpu offset calculation into build_sched_domains(). Do the
>> setup of the staggered load balance schewduling when the sched domains
>> are initialized. That way we dont have to worry about it anymore later.
>>
>> This also in turn simplifies the load balancing time checks.
> 
> 
> OK. I think I made this overcomplex in order to cope with issues where
> offset can get skewed so if we're unlucky they might all get into synch
> ... but this new code isn't any worse than the old, and it is cheaper.
> 
> So, Ack.

Actually, it is wrong, so nack.

You didn't take into account that balance_interval may have changed,
and so might the idle status.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] Use next_balance instead of last_balance
  2006-10-26 12:32     ` Nick Piggin
@ 2006-10-26 16:44       ` Christoph Lameter
  2006-10-26 17:13         ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-10-26 16:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, Peter Williams, linux-kernel, KAMEZAWA Hiroyuki,
	Dave Chinner, Ingo Molnar, Siddha, Suresh B

On Thu, 26 Oct 2006, Nick Piggin wrote:

> Actually, it is wrong, so nack.
> 
> You didn't take into account that balance_interval may have changed,
> and so might the idle status.

Hmmmm... We change the point at which we calculate the interval relative 
to load balancing. So move it after the load balance. This also avoids 
having to do the calculation if the sched_domain has not expired.

Want a new rollup/testing cycle for all of this?

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-26 11:31:04.000000000 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-26 11:41:07.129561438 -0500
@@ -2867,15 +2867,6 @@ static void rebalance_domains(unsigned l
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
 
-		interval = sd->balance_interval;
-		if (idle != SCHED_IDLE)
-			interval *= sd->busy_factor;
-
-		/* scale ms to jiffies */
-		interval = msecs_to_jiffies(interval);
-		if (unlikely(!interval))
-			interval = 1;
-
 		if (jiffies >= sd->next_balance) {
 			if (load_balance(this_cpu, this_rq, sd, idle)) {
 				/*
@@ -2885,6 +2876,14 @@ static void rebalance_domains(unsigned l
 				 */
 				idle = NOT_IDLE;
 			}
+			interval = sd->balance_interval;
+			if (idle != SCHED_IDLE)
+				interval *= sd->busy_factor;
+
+			/* scale ms to jiffies */
+			interval = msecs_to_jiffies(interval);
+			if (unlikely(!interval))
+				interval = 1;
 			sd->next_balance += interval;
 		}
 		next_balance = min(next_balance, sd->next_balance);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] Use next_balance instead of last_balance
  2006-10-26 16:44       ` Christoph Lameter
@ 2006-10-26 17:13         ` Nick Piggin
  2006-10-26 18:17           ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-10-26 17:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Peter Williams, linux-kernel, KAMEZAWA Hiroyuki,
	Dave Chinner, Ingo Molnar, Siddha, Suresh B

Christoph Lameter wrote:
> On Thu, 26 Oct 2006, Nick Piggin wrote:
> 
> 
>>Actually, it is wrong, so nack.
>>
>>You didn't take into account that balance_interval may have changed,
>>and so might the idle status.
> 
> 
> Hmmmm... We change the point at which we calculate the interval relative 
> to load balancing. So move it after the load balance. This also avoids 
> having to do the calculation if the sched_domain has not expired.

That still doesn't take into account if the CPU goes idle/busy during
the interval.

> 
> Want a new rollup/testing cycle for all of this?
> 
> Index: linux-2.6.19-rc3/kernel/sched.c
> ===================================================================
> --- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-26 11:31:04.000000000 -0500
> +++ linux-2.6.19-rc3/kernel/sched.c	2006-10-26 11:41:07.129561438 -0500
> @@ -2867,15 +2867,6 @@ static void rebalance_domains(unsigned l
>  		if (!(sd->flags & SD_LOAD_BALANCE))
>  			continue;
>  
> -		interval = sd->balance_interval;
> -		if (idle != SCHED_IDLE)
> -			interval *= sd->busy_factor;
> -
> -		/* scale ms to jiffies */
> -		interval = msecs_to_jiffies(interval);
> -		if (unlikely(!interval))
> -			interval = 1;
> -
>  		if (jiffies >= sd->next_balance) {
>  			if (load_balance(this_cpu, this_rq, sd, idle)) {
>  				/*
> @@ -2885,6 +2876,14 @@ static void rebalance_domains(unsigned l
>  				 */
>  				idle = NOT_IDLE;
>  			}
> +			interval = sd->balance_interval;
> +			if (idle != SCHED_IDLE)
> +				interval *= sd->busy_factor;
> +
> +			/* scale ms to jiffies */
> +			interval = msecs_to_jiffies(interval);
> +			if (unlikely(!interval))
> +				interval = 1;
>  			sd->next_balance += interval;
>  		}
>  		next_balance = min(next_balance, sd->next_balance);
> 


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] Use next_balance instead of last_balance
  2006-10-26 17:13         ` Nick Piggin
@ 2006-10-26 18:17           ` Christoph Lameter
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Lameter @ 2006-10-26 18:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, Peter Williams, linux-kernel, KAMEZAWA Hiroyuki,
	Dave Chinner, Ingo Molnar, Siddha, Suresh B

On Fri, 27 Oct 2006, Nick Piggin wrote:

> > Hmmmm... We change the point at which we calculate the interval relative to
> > load balancing. So move it after the load balance. This also avoids having
> > to do the calculation if the sched_domain has not expired.
> 
> That still doesn't take into account if the CPU goes idle/busy during
> the interval.

How does the current version take that into account? As far as I can tell 
we take the busy / idle stuation at the point in time when 
rebalance_tick() is called. We do not track whever the cpu gues idle/busy 
in the interval.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/5] Create rebalance_domains from rebalance_tick
  2006-10-24 18:31 [PATCH 0/5] On demand sched_domain balancing in tasklet Christoph Lameter
                   ` (2 preceding siblings ...)
  2006-10-24 18:31 ` [PATCH 3/5] Use next_balance instead of last_balance Christoph Lameter
@ 2006-10-24 18:31 ` Christoph Lameter
  2006-10-26 12:19   ` Nick Piggin
  2006-10-24 18:31 ` [PATCH 5/5] Only call rebalance_domains when needed from scheduler_tick Christoph Lameter
  4 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-10-24 18:31 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	Siddha, Suresh B, Dave Chinner, Ingo Molnar, KAMEZAWA Hiroyuki

Create rebalance_domains() from rebalance_tick().

Essentially rebalance_domains = rebalance_tick. However, we
do the idle calculation on our own. This removes some processing
from scheduler_tick into rebalance_domains().

While we are at it: Take the opportunity to avoid taking
the request queue lock in wake_priority_sleeper if
there are no running processes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc2-mm2/kernel/sched.c
===================================================================
--- linux-2.6.19-rc2-mm2.orig/kernel/sched.c	2006-10-24 10:39:13.782726627 -0500
+++ linux-2.6.19-rc2-mm2/kernel/sched.c	2006-10-24 10:40:32.928543506 -0500
@@ -2842,18 +2842,22 @@ static void update_load(struct rq *this_
 }
 
 /*
- * rebalance_tick will get called every timer tick, on every CPU.
+ * rebalance_domains is called from the scheduler_tick.
  *
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
  *
  * Balancing parameters are set up in arch_init_sched_domains.
  */
-static void
-rebalance_tick(int this_cpu, struct rq *this_rq, enum idle_type idle)
+static void rebalance_domains(unsigned long dummy)
 {
+	int this_cpu = smp_processor_id();
+	struct rq *this_rq = cpu_rq(this_cpu);
 	unsigned long interval;
 	struct sched_domain *sd;
+	/* Idle means on the idle queue without a runnable task */
+	enum idle_type idle = (this_rq->idle && !this_rq->nr_running) ?
+				SCHED_IDLE : NOT_IDLE;
 
 	for_each_domain(this_cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
@@ -2885,34 +2889,26 @@ rebalance_tick(int this_cpu, struct rq *
 /*
  * on UP we do not need to balance between CPUs:
  */
-static inline void rebalance_tick(int cpu, struct rq *rq, enum idle_type idle)
-{
-}
 static inline void idle_balance(int cpu, struct rq *rq)
 {
 }
-static inline void update_load(struct rq *this_rq)
-{
-}
 #endif
 
-static inline int wake_priority_sleeper(struct rq *rq)
+static inline void wake_priority_sleeper(struct rq *rq)
 {
-	int ret = 0;
-
 #ifdef CONFIG_SCHED_SMT
+	if (!rq->nr_running)
+		return;
+
 	spin_lock(&rq->lock);
 	/*
 	 * If an SMT sibling task has been put to sleep for priority
 	 * reasons reschedule the idle task to see if it can now run.
 	 */
-	if (rq->nr_running) {
+	if (rq->nr_running)
 		resched_task(rq->idle);
-		ret = 1;
-	}
 	spin_unlock(&rq->lock);
 #endif
-	return ret;
 }
 
 DEFINE_PER_CPU(struct kernel_stat, kstat);
@@ -3123,17 +3119,15 @@ void scheduler_tick(void)
 	struct task_struct *p = current;
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
-	enum idle_type idle = NOT_IDLE;
 
 	update_cpu_clock(p, rq, now);
 
 	rq->timestamp_last_tick = now;
 
-	if (p == rq->idle) {
+	if (p == rq->idle)
 		/* Task on the idle queue */
-		if (!wake_priority_sleeper(rq))
-			idle = SCHED_IDLE;
-	} else {
+		wake_priority_sleeper(rq);
+	else {
 		/* Task on cpu queue */
 		if (p->array != rq->active)
 			/* Task has expired but was not scheduled yet */
@@ -3141,8 +3135,10 @@ void scheduler_tick(void)
 		else
 			time_slice(rq, p);
 	}
+#ifdef CONFIG_SMP
 	update_load(rq);
-	rebalance_tick(cpu, rq, idle);
+	rebalance_domains(0L);
+#endif
 }
 
 #ifdef CONFIG_SCHED_SMT

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] Create rebalance_domains from rebalance_tick
  2006-10-24 18:31 ` [PATCH 4/5] Create rebalance_domains from rebalance_tick Christoph Lameter
@ 2006-10-26 12:19   ` Nick Piggin
  2006-10-26 16:19     ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-10-26 12:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Peter Williams, linux-kernel, Siddha, Suresh B,
	Dave Chinner, Ingo Molnar, KAMEZAWA Hiroyuki

Christoph Lameter wrote:
> Create rebalance_domains() from rebalance_tick().
> 
> Essentially rebalance_domains = rebalance_tick. However, we
> do the idle calculation on our own. This removes some processing
> from scheduler_tick into rebalance_domains().
> 
> While we are at it: Take the opportunity to avoid taking
> the request queue lock in wake_priority_sleeper if
> there are no running processes.

Can you split this out? It is good without the tasklet based
rebalancing.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] Create rebalance_domains from rebalance_tick
  2006-10-26 12:19   ` Nick Piggin
@ 2006-10-26 16:19     ` Christoph Lameter
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Lameter @ 2006-10-26 16:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, Peter Williams, linux-kernel, Siddha, Suresh B,
	Dave Chinner, Ingo Molnar, KAMEZAWA Hiroyuki

On Thu, 26 Oct 2006, Nick Piggin wrote:

> > While we are at it: Take the opportunity to avoid taking
> > the request queue lock in wake_priority_sleeper if
> > there are no running processes.
> 
> Can you split this out? It is good without the tasklet based
> rebalancing.

Sure next rollup will have this:


Avoid taking the rq lock in wake_priority sleeper

Avoid taking the request queue lock in wake_priority_sleeper if
there are no running processes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-26 11:13:29.000000000 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-26 11:16:44.896476659 -0500
@@ -2900,6 +2900,9 @@ static inline int wake_priority_sleeper(
 	int ret = 0;
 
 #ifdef CONFIG_SCHED_SMT
+	if (!rq->nr_running)
+		return 0;
+
 	spin_lock(&rq->lock);
 	/*
 	 * If an SMT sibling task has been put to sleep for priority

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/5] Only call rebalance_domains when needed from scheduler_tick
  2006-10-24 18:31 [PATCH 0/5] On demand sched_domain balancing in tasklet Christoph Lameter
                   ` (3 preceding siblings ...)
  2006-10-24 18:31 ` [PATCH 4/5] Create rebalance_domains from rebalance_tick Christoph Lameter
@ 2006-10-24 18:31 ` Christoph Lameter
  2006-10-26 12:26   ` Nick Piggin
  4 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-10-24 18:31 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	KAMEZAWA Hiroyuki, Dave Chinner, Ingo Molnar, Siddha, Suresh B

Only call rebalance_domains when needed from scheduler_tick.

Call rebalance_domains from a tasklet with interrupt enabled.
Only call it when one of the sched domains is to be rebalanced.
The jiffies when the next balancing action is to take place is
kept in a per cpu variable next_balance.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc2-mm2/kernel/sched.c
===================================================================
--- linux-2.6.19-rc2-mm2.orig/kernel/sched.c	2006-10-24 10:40:32.000000000 -0500
+++ linux-2.6.19-rc2-mm2/kernel/sched.c	2006-10-24 10:42:02.135978934 -0500
@@ -2841,8 +2841,11 @@ static void update_load(struct rq *this_
 	}
 }
 
+static DEFINE_PER_CPU(unsigned long, next_balance);
+
 /*
- * rebalance_domains is called from the scheduler_tick.
+ * rebalance_domains is triggered when needed via a tasklet from the
+ * scheduler_tick.
  *
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
@@ -2858,6 +2861,8 @@ static void rebalance_domains(unsigned l
 	/* Idle means on the idle queue without a runnable task */
 	enum idle_type idle = (this_rq->idle && !this_rq->nr_running) ?
 				SCHED_IDLE : NOT_IDLE;
+	/* Maximum time between calls to rebalance_domains */
+	unsigned long next_balance = jiffies + 60*HZ;
 
 	for_each_domain(this_cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
@@ -2883,8 +2888,12 @@ static void rebalance_domains(unsigned l
 			}
 			sd->next_balance += interval;
 		}
+		next_balance = min(next_balance, sd->next_balance);
 	}
+      	__get_cpu_var(next_balance) = next_balance;
 }
+
+DECLARE_TASKLET(rebalance, &rebalance_domains, 0L);
 #else
 /*
  * on UP we do not need to balance between CPUs:
@@ -3137,7 +3146,8 @@ void scheduler_tick(void)
 	}
 #ifdef CONFIG_SMP
 	update_load(rq);
-	rebalance_domains(0L);
+	if (jiffies >= __get_cpu_var(next_balance))
+		tasklet_schedule(&rebalance);
 #endif
 }
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] Only call rebalance_domains when needed from scheduler_tick
  2006-10-24 18:31 ` [PATCH 5/5] Only call rebalance_domains when needed from scheduler_tick Christoph Lameter
@ 2006-10-26 12:26   ` Nick Piggin
  2006-10-26 16:24     ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-10-26 12:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Peter Williams, linux-kernel, KAMEZAWA Hiroyuki,
	Dave Chinner, Ingo Molnar, Siddha, Suresh B

Christoph Lameter wrote:
> Only call rebalance_domains when needed from scheduler_tick.
> 
> Call rebalance_domains from a tasklet with interrupt enabled.
> Only call it when one of the sched domains is to be rebalanced.
> The jiffies when the next balancing action is to take place is
> kept in a per cpu variable next_balance.

sched-domains was supposed to be able to build a whacky topology
so you didn't have to take the occasional big latency hit when
scanning 512 CPUs...

Ideas were: overlapping, non-covering top level domains, or a
SD_BALANCE_ROTOR, which scans only N (< all) groups on each
balance attempt, but more frequently.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] Only call rebalance_domains when needed from scheduler_tick
  2006-10-26 12:26   ` Nick Piggin
@ 2006-10-26 16:24     ` Christoph Lameter
  2006-10-26 17:12       ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-10-26 16:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, Peter Williams, linux-kernel, KAMEZAWA Hiroyuki,
	Dave Chinner, Ingo Molnar, Siddha, Suresh B

On Thu, 26 Oct 2006, Nick Piggin wrote:

> > Call rebalance_domains from a tasklet with interrupt enabled.
> > Only call it when one of the sched domains is to be rebalanced.
> > The jiffies when the next balancing action is to take place is
> > kept in a per cpu variable next_balance.
> 
> sched-domains was supposed to be able to build a whacky topology
> so you didn't have to take the occasional big latency hit when
> scanning 512 CPUs...

How is that supposed to work? The load calculations will be off
in that case and also the load balancing algorithm wont work anymore. 
This is going to be a pretty significant rework of how the scheduler 
works but given the problems with pinned tasks... maybe that is 
necessary?
duler?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] Only call rebalance_domains when needed from scheduler_tick
  2006-10-26 16:24     ` Christoph Lameter
@ 2006-10-26 17:12       ` Nick Piggin
  2006-10-26 18:13         ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-10-26 17:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Peter Williams, linux-kernel, KAMEZAWA Hiroyuki,
	Dave Chinner, Ingo Molnar, Siddha, Suresh B

Christoph Lameter wrote:
> On Thu, 26 Oct 2006, Nick Piggin wrote:
> 
> 
>>>Call rebalance_domains from a tasklet with interrupt enabled.
>>>Only call it when one of the sched domains is to be rebalanced.
>>>The jiffies when the next balancing action is to take place is
>>>kept in a per cpu variable next_balance.
>>
>>sched-domains was supposed to be able to build a whacky topology
>>so you didn't have to take the occasional big latency hit when
>>scanning 512 CPUs...
> 
> 
> How is that supposed to work? The load calculations will be off
> in that case and also the load balancing algorithm wont work anymore. 
> This is going to be a pretty significant rework of how the scheduler 
> works but given the problems with pinned tasks... maybe that is 
> necessary?
> duler?

What will the problem be? Sure it may pull tasks fom one group to
another when both could actually be pulling from a third, but it
the load balancing algorithm should work fine and not require any
rework.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] Only call rebalance_domains when needed from scheduler_tick
  2006-10-26 17:12       ` Nick Piggin
@ 2006-10-26 18:13         ` Christoph Lameter
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Lameter @ 2006-10-26 18:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, Peter Williams, linux-kernel, KAMEZAWA Hiroyuki,
	Dave Chinner, Ingo Molnar, Siddha, Suresh B

On Fri, 27 Oct 2006, Nick Piggin wrote:

> > > sched-domains was supposed to be able to build a whacky topology
> > > so you didn't have to take the occasional big latency hit when
> > > scanning 512 CPUs...
> > 
> > 
> > How is that supposed to work? The load calculations will be off
> > in that case and also the load balancing algorithm wont work anymore. This
> > is going to be a pretty significant rework of how the scheduler works but
> > given the problems with pinned tasks... maybe that is necessary?
> > duler?
> 
> What will the problem be? Sure it may pull tasks fom one group to
> another when both could actually be pulling from a third, but it
> the load balancing algorithm should work fine and not require any
> rework.

Hmmm....
I think we already have what you want if we would disable the allnodes 
domain. The next sched domain layer contains 16 surrounding nodes 
from which it can pull processes. If there would be severe overload on one 
node  then processes would be gradually migrated away from it but it would 
require multiple migration steps.

Nevertheless, I still think we need this patchset because of the general 
interrupt hold off issue. The livelock is extreme case but it general we 
do no want long interrupt holdoffs but keep the period in which we would 
be executing with interrupts off minimal.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2006-10-26 18:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-24 18:31 [PATCH 0/5] On demand sched_domain balancing in tasklet Christoph Lameter
2006-10-24 18:31 ` [PATCH 1/5] Disable interrupts for locking in load_balance() Christoph Lameter
2006-10-24 18:31 ` [PATCH 2/5] Extract load calculation from rebalance_tick Christoph Lameter
2006-10-26 12:03   ` Nick Piggin
2006-10-26 16:12     ` Christoph Lameter
2006-10-24 18:31 ` [PATCH 3/5] Use next_balance instead of last_balance Christoph Lameter
2006-10-26 12:13   ` Nick Piggin
2006-10-26 12:32     ` Nick Piggin
2006-10-26 16:44       ` Christoph Lameter
2006-10-26 17:13         ` Nick Piggin
2006-10-26 18:17           ` Christoph Lameter
2006-10-24 18:31 ` [PATCH 4/5] Create rebalance_domains from rebalance_tick Christoph Lameter
2006-10-26 12:19   ` Nick Piggin
2006-10-26 16:19     ` Christoph Lameter
2006-10-24 18:31 ` [PATCH 5/5] Only call rebalance_domains when needed from scheduler_tick Christoph Lameter
2006-10-26 12:26   ` Nick Piggin
2006-10-26 16:24     ` Christoph Lameter
2006-10-26 17:12       ` Nick Piggin
2006-10-26 18:13         ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox