public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] sched_domain balancing via tasklet V2
@ 2006-10-28  2:41 Christoph Lameter
  2006-10-28  2:41 ` [PATCH 1/7] Avoid taking rq lock in wake_priority_sleeper Christoph Lameter
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Christoph Lameter @ 2006-10-28  2:41 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	Siddha, Suresh B, Ingo Molnar, KAMEZAWA Hiroyuki

This patchset moves potentially expensive load balancing out of the scheduler
tick (where we run with interrupts disabled) into a tasklet that is triggered
if necessary from scheduler_tick(). Load balancing will then run with interrupts
enabled. This eliminates interrupt holdoff times and avoids potential machine
livelock if f.e. load balancing is performed over a large number of processors
and many of the nodes experience heavy load which may lead to delays in
fetching cachelines. We have currently up to 1024 processors and may go up
to 4096 soon. Similar issues were seen on a Fujitsu system in the past.

However, this issue also highlights the general problem of interrupt
holdoff during scheduler load balancing.

The moving of the load balancing into a tasklet also allows some
cleanup in scheduler_tick(). It gets easier to read and the determination
of the state for load balancing can be moved out of scheduler_tick().

Further optimization of scheduler_tick() processing occurs because we
no longer check all the sched domains on each tick.
We determine the time of the next load balancing on every load balancing
and check against this single value in scheduler_tick().

Another optimization is that we perform the staggering of the individual
load balance operations not during load balancing but shift that
to the setup of the sched domains.

For the earlier discussion see:
http://marc.theaimsgroup.com/?t=116119187800002&r=1&w=2
V1: http://marc.theaimsgroup.com/?l=linux-kernel&m=116171494001548&w=2


V1-V2:
- Keep last_balance and calculate the next balancing from that start
  point.
- Move more code into time_slice calculation and rename time_slice()
  to task_running_tick().
- Separate out the wake_priority_sleeper optimization as a first patch.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/7] Avoid taking rq lock in wake_priority_sleeper
  2006-10-28  2:41 [PATCH 0/7] sched_domain balancing via tasklet V2 Christoph Lameter
@ 2006-10-28  2:41 ` Christoph Lameter
  2006-10-28  2:41 ` [PATCH 2/7] Disable interrupts for locking in load_balance() Christoph Lameter
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2006-10-28  2:41 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	KAMEZAWA Hiroyuki, Ingo Molnar, Siddha, Suresh B

Avoid taking the rq lock in wake_priority sleeper

Avoid taking the request queue lock in wake_priority_sleeper if
there are no running processes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-26 21:30:11.328325096 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-27 11:58:02.142767971 -0500
@@ -2898,6 +2898,9 @@ static inline int wake_priority_sleeper(
 	int ret = 0;
 
 #ifdef CONFIG_SCHED_SMT
+	if (!rq->nr_running)
+		return 0;
+
 	spin_lock(&rq->lock);
 	/*
 	 * If an SMT sibling task has been put to sleep for priority

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/7] Disable interrupts for locking in load_balance()
  2006-10-28  2:41 [PATCH 0/7] sched_domain balancing via tasklet V2 Christoph Lameter
  2006-10-28  2:41 ` [PATCH 1/7] Avoid taking rq lock in wake_priority_sleeper Christoph Lameter
@ 2006-10-28  2:41 ` Christoph Lameter
  2006-10-28  2:41 ` [PATCH 3/7] Extract load calculation from rebalance_tick Christoph Lameter
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2006-10-28  2:41 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	Siddha, Suresh B, Ingo Molnar, KAMEZAWA Hiroyuki

scheduler: Disable interrupts for locking in load_balance()

Interrupts must be disabled for request queue locks if we want
to run load_balance() with interrupts enabled.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-23 18:02:02.000000000 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-25 13:28:12.653874252 -0500
@@ -2529,8 +2529,6 @@ static inline unsigned long minus_1_or_z
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
- *
- * Called with this_rq unlocked.
  */
 static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum idle_type idle)
@@ -2540,6 +2538,7 @@ static int load_balance(int this_cpu, st
 	unsigned long imbalance;
 	struct rq *busiest;
 	cpumask_t cpus = CPU_MASK_ALL;
+	unsigned long flags;
 
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
@@ -2579,11 +2578,13 @@ redo:
 		 * still unbalanced. nr_moved simply stays zero, so it is
 		 * correctly treated as an imbalance.
 		 */
+		local_irq_save(flags);
 		double_rq_lock(this_rq, busiest);
 		nr_moved = move_tasks(this_rq, this_cpu, busiest,
 				      minus_1_or_zero(busiest->nr_running),
 				      imbalance, sd, idle, &all_pinned);
 		double_rq_unlock(this_rq, busiest);
+		local_irq_restore(flags);
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(all_pinned)) {
@@ -2600,13 +2601,13 @@ redo:
 
 		if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
 
-			spin_lock(&busiest->lock);
+			spin_lock_irqsave(&busiest->lock, flags);
 
 			/* don't kick the migration_thread, if the curr
 			 * task on busiest cpu can't be moved to this_cpu
 			 */
 			if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
-				spin_unlock(&busiest->lock);
+				spin_unlock_irqrestore(&busiest->lock, flags);
 				all_pinned = 1;
 				goto out_one_pinned;
 			}
@@ -2616,7 +2617,7 @@ redo:
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			spin_unlock(&busiest->lock);
+			spin_unlock_irqrestore(&busiest->lock, flags);
 			if (active_balance)
 				wake_up_process(busiest->migration_thread);
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/7] Extract load calculation from rebalance_tick
  2006-10-28  2:41 [PATCH 0/7] sched_domain balancing via tasklet V2 Christoph Lameter
  2006-10-28  2:41 ` [PATCH 1/7] Avoid taking rq lock in wake_priority_sleeper Christoph Lameter
  2006-10-28  2:41 ` [PATCH 2/7] Disable interrupts for locking in load_balance() Christoph Lameter
@ 2006-10-28  2:41 ` Christoph Lameter
  2006-10-28  2:41 ` [PATCH 4/7] Stagger load balancing in build_sched_domains Christoph Lameter
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2006-10-28  2:41 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	KAMEZAWA Hiroyuki, Ingo Molnar, Siddha, Suresh B

Extract load calculation from rebalance_tick

A load calculation is always done in rebalance_tick() in addition
to the real load balancing activities that only take place when certain
jiffie counts have been reached. Move that processing into a separate
function and call it directly from scheduler_tick().

Also extract the time slice handling from scheduler_tick and
put it into a separate function. Then we can clean up scheduler_tick
significantly. It will no longer have any gotos.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-27 13:24:29.354896541 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-27 13:40:39.256955830 -0500
@@ -2816,27 +2816,10 @@ static void active_load_balance(struct r
 	spin_unlock(&target_rq->lock);
 }
 
-/*
- * rebalance_tick will get called every timer tick, on every CPU.
- *
- * It checks each scheduling domain to see if it is due to be balanced,
- * and initiates a balancing operation if so.
- *
- * Balancing parameters are set up in arch_init_sched_domains.
- */
-
-/* Don't have all balancing operations going off at once: */
-static inline unsigned long cpu_offset(int cpu)
+static void update_load(struct rq *this_rq)
 {
-	return jiffies + cpu * HZ / NR_CPUS;
-}
-
-static void
-rebalance_tick(int this_cpu, struct rq *this_rq, enum idle_type idle)
-{
-	unsigned long this_load, interval, j = cpu_offset(this_cpu);
-	struct sched_domain *sd;
 	int i, scale;
+	unsigned long this_load;
 
 	this_load = this_rq->raw_weighted_load;
 
@@ -2855,6 +2838,28 @@ rebalance_tick(int this_cpu, struct rq *
 			new_load += scale-1;
 		this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) / scale;
 	}
+}
+
+/*
+ * rebalance_tick will get called every timer tick, on every CPU.
+ *
+ * It checks each scheduling domain to see if it is due to be balanced,
+ * and initiates a balancing operation if so.
+ *
+ * Balancing parameters are set up in arch_init_sched_domains.
+ */
+
+/* Don't have all balancing operations going off at once: */
+static inline unsigned long cpu_offset(int cpu)
+{
+	return jiffies + cpu * HZ / NR_CPUS;
+}
+
+static void
+rebalance_tick(int this_cpu, struct rq *this_rq, enum idle_type idle)
+{
+	unsigned long interval, j = cpu_offset(this_cpu);
+	struct sched_domain *sd;
 
 	for_each_domain(this_cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
@@ -2886,12 +2891,15 @@ rebalance_tick(int this_cpu, struct rq *
 /*
  * on UP we do not need to balance between CPUs:
  */
-static inline void rebalance_tick(int cpu, struct rq *rq, enum idle_type idle)
+static inline void rebalance_tick(int cpu, struct rq *rq)
 {
 }
 static inline void idle_balance(int cpu, struct rq *rq)
 {
 }
+static inline void update_load(struct rq *this_rq)
+{
+}
 #endif
 
 static inline int wake_priority_sleeper(struct rq *rq)
@@ -3041,35 +3049,12 @@ void account_steal_time(struct task_stru
 		cpustat->steal = cputime64_add(cpustat->steal, tmp);
 }
 
-/*
- * This function gets called by the timer code, with HZ frequency.
- * We call it with interrupts disabled.
- *
- * It also gets called by the fork code, when changing the parent's
- * timeslices.
- */
-void scheduler_tick(void)
+static void task_running_tick(struct rq *rq, struct task_struct *p)
 {
-	unsigned long long now = sched_clock();
-	struct task_struct *p = current;
-	int cpu = smp_processor_id();
-	struct rq *rq = cpu_rq(cpu);
-
-	update_cpu_clock(p, rq, now);
-
-	rq->timestamp_last_tick = now;
-
-	if (p == rq->idle) {
-		if (wake_priority_sleeper(rq))
-			goto out;
-		rebalance_tick(cpu, rq, SCHED_IDLE);
-		return;
-	}
-
-	/* Task might have expired already, but not scheduled off yet */
 	if (p->array != rq->active) {
+		/* Task has expired but was not scheduled yet */
 		set_tsk_need_resched(p);
-		goto out;
+		return;
 	}
 	spin_lock(&rq->lock);
 	/*
@@ -3137,8 +3122,35 @@ void scheduler_tick(void)
 	}
 out_unlock:
 	spin_unlock(&rq->lock);
-out:
-	rebalance_tick(cpu, rq, NOT_IDLE);
+}
+
+/*
+ * This function gets called by the timer code, with HZ frequency.
+ * We call it with interrupts disabled.
+ *
+ * It also gets called by the fork code, when changing the parent's
+ * timeslices.
+ */
+void scheduler_tick(void)
+{
+	unsigned long long now = sched_clock();
+	struct task_struct *p = current;
+	int cpu = smp_processor_id();
+	struct rq *rq = cpu_rq(cpu);
+	enum idle_type idle = NOT_IDLE;
+
+	update_cpu_clock(p, rq, now);
+
+	rq->timestamp_last_tick = now;
+
+	if (p == rq->idle) {
+		/* Task on the idle queue */
+		if (!wake_priority_sleeper(rq))
+			idle = SCHED_IDLE;
+	} else
+		task_running_tick(rq, p);
+	update_load(rq);
+	rebalance_tick(cpu, rq, idle);
 }
 
 #ifdef CONFIG_SCHED_SMT

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4/7] Stagger load balancing in build_sched_domains
  2006-10-28  2:41 [PATCH 0/7] sched_domain balancing via tasklet V2 Christoph Lameter
                   ` (2 preceding siblings ...)
  2006-10-28  2:41 ` [PATCH 3/7] Extract load calculation from rebalance_tick Christoph Lameter
@ 2006-10-28  2:41 ` Christoph Lameter
  2006-10-28  2:41 ` [PATCH 5/7] Move idle stat calculation into rebalance_tick() Christoph Lameter
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2006-10-28  2:41 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	Siddha, Suresh B, Ingo Molnar, KAMEZAWA Hiroyuki

Stagger load balancing in build_sched_domains

Instead of dealing with the staggering of load balancing
during actual load balancing we just do it once when the sched domains
set up.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-27 15:35:40.221104772 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-27 15:37:36.526511796 -0500
@@ -2848,17 +2848,10 @@ static void update_load(struct rq *this_
  *
  * Balancing parameters are set up in arch_init_sched_domains.
  */
-
-/* Don't have all balancing operations going off at once: */
-static inline unsigned long cpu_offset(int cpu)
-{
-	return jiffies + cpu * HZ / NR_CPUS;
-}
-
 static void
 rebalance_tick(int this_cpu, struct rq *this_rq, enum idle_type idle)
 {
-	unsigned long interval, j = cpu_offset(this_cpu);
+	unsigned long interval;
 	struct sched_domain *sd;
 
 	for_each_domain(this_cpu, sd) {
@@ -2874,7 +2867,7 @@ rebalance_tick(int this_cpu, struct rq *
 		if (unlikely(!interval))
 			interval = 1;
 
-		if (j - sd->last_balance >= interval) {
+		if (jiffies - sd->last_balance >= interval) {
 			if (load_balance(this_cpu, this_rq, sd, idle)) {
 				/*
 				 * We've pulled tasks over so either we're no
@@ -6327,6 +6320,16 @@ static void init_sched_groups_power(int 
 }
 
 /*
+ * Calculate jiffies start to use for each cpu. On sched domain
+ * initialization this jiffy value is used to stagger the load balancing
+ * of the cpus so that they do not load balance all at at the same time.
+ */
+static inline unsigned long cpu_offset(int cpu)
+{
+	return jiffies + cpu * HZ / NR_CPUS;
+}
+
+/*
  * Build sched domains for a given set of cpus and attach the sched domains
  * to the individual cpus
  */
@@ -6382,6 +6385,7 @@ static int build_sched_domains(const cpu
 			sd->span = *cpu_map;
 			group = cpu_to_allnodes_group(i, cpu_map);
 			sd->groups = &sched_group_allnodes[group];
+			sd->last_balance = cpu_offset(i);
 			p = sd;
 		} else
 			p = NULL;
@@ -6390,6 +6394,7 @@ static int build_sched_domains(const cpu
 		*sd = SD_NODE_INIT;
 		sd->span = sched_domain_node_span(cpu_to_node(i));
 		sd->parent = p;
+		sd->last_balance = cpu_offset(i);
 		if (p)
 			p->child = sd;
 		cpus_and(sd->span, sd->span, *cpu_map);
@@ -6401,6 +6406,7 @@ static int build_sched_domains(const cpu
 		*sd = SD_CPU_INIT;
 		sd->span = nodemask;
 		sd->parent = p;
+		sd->last_balance = cpu_offset(i);
 		if (p)
 			p->child = sd;
 		sd->groups = &sched_group_phys[group];
@@ -6413,6 +6419,7 @@ static int build_sched_domains(const cpu
 		sd->span = cpu_coregroup_map(i);
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
+		sd->last_balance = cpu_offset(i);
 		p->child = sd;
 		sd->groups = &sched_group_core[group];
 #endif
@@ -6425,6 +6432,7 @@ static int build_sched_domains(const cpu
 		sd->span = cpu_sibling_map[i];
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
+		sd->last_balance = cpu_offset(i);
 		p->child = sd;
 		sd->groups = &sched_group_cpus[group];
 #endif

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5/7] Move idle stat calculation into rebalance_tick()
  2006-10-28  2:41 [PATCH 0/7] sched_domain balancing via tasklet V2 Christoph Lameter
                   ` (3 preceding siblings ...)
  2006-10-28  2:41 ` [PATCH 4/7] Stagger load balancing in build_sched_domains Christoph Lameter
@ 2006-10-28  2:41 ` Christoph Lameter
  2006-10-28 17:57   ` Siddha, Suresh B
  2006-10-28  2:41 ` [PATCH 6/7] Use tasklet to call balancing Christoph Lameter
  2006-10-28  2:41 ` [PATCH 7/7] Call tasklet less frequently Christoph Lameter
  6 siblings, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2006-10-28  2:41 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	KAMEZAWA Hiroyuki, Ingo Molnar, Siddha, Suresh B

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-27 15:43:45.467245352 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-27 15:45:30.794096498 -0500
@@ -2849,10 +2849,16 @@ static void update_load(struct rq *this_
  * Balancing parameters are set up in arch_init_sched_domains.
  */
 static void
-rebalance_tick(int this_cpu, struct rq *this_rq, enum idle_type idle)
+rebalance_tick(int this_cpu, struct rq *this_rq)
 {
 	unsigned long interval;
 	struct sched_domain *sd;
+	/*
+	 * A task is idle if this is the idle queue
+	 * and we have no runnable task
+	 */
+	enum idle_type idle = (this_rq->idle && !this_rq->nr_running) ?
+				SCHED_IDLE : NOT_IDLE;
 
 	for_each_domain(this_cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
@@ -2884,37 +2890,26 @@ rebalance_tick(int this_cpu, struct rq *
 /*
  * on UP we do not need to balance between CPUs:
  */
-static inline void rebalance_tick(int cpu, struct rq *rq)
-{
-}
 static inline void idle_balance(int cpu, struct rq *rq)
 {
 }
-static inline void update_load(struct rq *this_rq)
-{
-}
 #endif
 
-static inline int wake_priority_sleeper(struct rq *rq)
+static inline void wake_priority_sleeper(struct rq *rq)
 {
-	int ret = 0;
-
 #ifdef CONFIG_SCHED_SMT
 	if (!rq->nr_running)
-		return 0;
+		return;
 
 	spin_lock(&rq->lock);
 	/*
 	 * If an SMT sibling task has been put to sleep for priority
 	 * reasons reschedule the idle task to see if it can now run.
 	 */
-	if (rq->nr_running) {
+	if (rq->nr_running)
 		resched_task(rq->idle);
-		ret = 1;
-	}
 	spin_unlock(&rq->lock);
 #endif
-	return ret;
 }
 
 DEFINE_PER_CPU(struct kernel_stat, kstat);
@@ -3130,20 +3125,20 @@ void scheduler_tick(void)
 	struct task_struct *p = current;
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
-	enum idle_type idle = NOT_IDLE;
 
 	update_cpu_clock(p, rq, now);
 
 	rq->timestamp_last_tick = now;
 
-	if (p == rq->idle) {
+	if (p == rq->idle)
 		/* Task on the idle queue */
-		if (!wake_priority_sleeper(rq))
-			idle = SCHED_IDLE;
-	} else
+		wake_priority_sleeper(rq);
+	else
 		task_running_tick(rq, p);
+#ifdef CONFIG_SMP
 	update_load(rq);
-	rebalance_tick(cpu, rq, idle);
+	rebalance_tick(cpu, rq);
+#endif
 }
 
 #ifdef CONFIG_SCHED_SMT

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 6/7] Use tasklet to call balancing
  2006-10-28  2:41 [PATCH 0/7] sched_domain balancing via tasklet V2 Christoph Lameter
                   ` (4 preceding siblings ...)
  2006-10-28  2:41 ` [PATCH 5/7] Move idle stat calculation into rebalance_tick() Christoph Lameter
@ 2006-10-28  2:41 ` Christoph Lameter
  2006-10-28  2:41 ` [PATCH 7/7] Call tasklet less frequently Christoph Lameter
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2006-10-28  2:41 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	Siddha, Suresh B, Ingo Molnar, KAMEZAWA Hiroyuki

Use tasklet to balance sched domains.

Call rebalance_tick (renamed to rebalance_domains) from a tasklet.

We calculate the earliest time for each layer of sched domains to be
rescanned (this is the rescan time for idle) and use the earliest
of those to schedule the tasklet again via a new field "next_balance"
added to struct rq.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-27 15:45:30.000000000 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-27 20:12:42.225666940 -0500
@@ -227,6 +227,7 @@ struct rq {
 	unsigned long expired_timestamp;
 	unsigned long long timestamp_last_tick;
 	struct task_struct *curr, *idle;
+	unsigned long next_balance;
 	struct mm_struct *prev_mm;
 	struct prio_array *active, *expired, arrays[2];
 	int best_expired_prio;
@@ -2841,16 +2842,18 @@ static void update_load(struct rq *this_
 }
 
 /*
- * rebalance_tick will get called every timer tick, on every CPU.
+ * rebalance_domains is triggered when needed via a tasklet from the
+ * scheduler tick.
  *
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
  *
  * Balancing parameters are set up in arch_init_sched_domains.
  */
-static void
-rebalance_tick(int this_cpu, struct rq *this_rq)
+static void rebalance_domains(unsigned long dummy)
 {
+	int this_cpu = smp_processor_id();
+	struct rq *this_rq = cpu_rq(this_cpu);
 	unsigned long interval;
 	struct sched_domain *sd;
 	/*
@@ -2859,6 +2862,8 @@ rebalance_tick(int this_cpu, struct rq *
 	 */
 	enum idle_type idle = (this_rq->idle && !this_rq->nr_running) ?
 				SCHED_IDLE : NOT_IDLE;
+	/* Earliest time when we have to call rebalance_domains again */
+	unsigned long next_balance = jiffies + 60*HZ;
 
 	for_each_domain(this_cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
@@ -2884,8 +2889,13 @@ rebalance_tick(int this_cpu, struct rq *
 			}
 			sd->last_balance += interval;
 		}
+		next_balance = min(next_balance,
+				sd->last_balance + sd->balance_interval);
 	}
+	this_rq->next_balance = next_balance;
 }
+
+DECLARE_TASKLET(rebalance, &rebalance_domains, 0L);
 #else
 /*
  * on UP we do not need to balance between CPUs:
@@ -3137,7 +3147,8 @@ void scheduler_tick(void)
 		task_running_tick(rq, p);
 #ifdef CONFIG_SMP
 	update_load(rq);
-	rebalance_tick(cpu, rq);
+	if (jiffies >= rq->next_balance)
+		tasklet_schedule(&rebalance);
 #endif
 }
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 7/7] Call tasklet less frequently
  2006-10-28  2:41 [PATCH 0/7] sched_domain balancing via tasklet V2 Christoph Lameter
                   ` (5 preceding siblings ...)
  2006-10-28  2:41 ` [PATCH 6/7] Use tasklet to call balancing Christoph Lameter
@ 2006-10-28  2:41 ` Christoph Lameter
  6 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2006-10-28  2:41 UTC (permalink / raw)
  To: akpm
  Cc: Peter Williams, linux-kernel, Nick Piggin, Christoph Lameter,
	KAMEZAWA Hiroyuki, Ingo Molnar, Siddha, Suresh B

Schedule load balance tasklet less frequently

We schedule the tasklet before this patch always with the value in
sd->interval. However, if the queue is busy then it is sufficient
to schedule the tasklet with sd->interval*busy_factor.

So we modify the calculation of the next time to balance by taking
the interval added to last_balance again. This is only the
right value if the idle/busy situation continues as is.

There are two potential trouble spots:
- If the queue was idle and now gets busy then we call rebalance
  early. However, that is not a problem because we will then use
  the longer interval for the next period.

- If the queue was busy and becomes idle then we potentially
  wait too long before rebalancing. However, when the task
  goes idle then idle_balance is called. We add another calculation
  of the next balance time based on sd->interval in idle_balance
  so that we will rebalance soon.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-27 20:36:37.269918493 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-27 20:41:10.765080822 -0500
@@ -2757,14 +2757,26 @@ out_balanced:
 static void idle_balance(int this_cpu, struct rq *this_rq)
 {
 	struct sched_domain *sd;
+	int pulled_task = 0;
+	unsigned long next_balance = jiffies + 60 *  HZ;
 
 	for_each_domain(this_cpu, sd) {
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 			/* If we've pulled tasks over stop searching: */
-			if (load_balance_newidle(this_cpu, this_rq, sd))
+			pulled_task = load_balance_newidle(this_cpu,
+							this_rq, sd);
+			next_balance = min(next_balance,
+				sd->last_balance + sd->balance_interval);
+			if (pulled_task)
 				break;
 		}
 	}
+	if (!pulled_task)
+		/*
+		 * We are going idle. next_balance may be set based on
+		 * a busy processor. So reset next_balance.
+		 */
+		this_rq->next_balance = next_balance;
 }
 
 /*
@@ -2889,8 +2901,16 @@ static void rebalance_domains(unsigned l
 			}
 			sd->last_balance += interval;
 		}
+		/*
+		 * Calculate the next balancing point assuming that
+		 * the idle state does not change. If we are idle and then
+		 * start running a process then this will be recalculated.
+		 * If we are running a process and then become idle
+		 * then idle_balance will reset next_balance so that we
+		 * rebalance earlier.
+		 */
 		next_balance = min(next_balance,
-				sd->last_balance + sd->balance_interval);
+				sd->last_balance + interval);
 	}
 	this_rq->next_balance = next_balance;
 }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5/7] Move idle stat calculation into rebalance_tick()
  2006-10-28  2:41 ` [PATCH 5/7] Move idle stat calculation into rebalance_tick() Christoph Lameter
@ 2006-10-28 17:57   ` Siddha, Suresh B
  2006-10-29  1:05     ` Christoph Lameter
  0 siblings, 1 reply; 10+ messages in thread
From: Siddha, Suresh B @ 2006-10-28 17:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Peter Williams, linux-kernel, Nick Piggin,
	KAMEZAWA Hiroyuki, Ingo Molnar, Siddha, Suresh B

On Fri, Oct 27, 2006 at 07:41:38PM -0700, Christoph Lameter wrote:
> Index: linux-2.6.19-rc3/kernel/sched.c
> ===================================================================
> --- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-27 15:43:45.467245352 -0500
> +++ linux-2.6.19-rc3/kernel/sched.c	2006-10-27 15:45:30.794096498 -0500
> @@ -2849,10 +2849,16 @@ static void update_load(struct rq *this_
>   * Balancing parameters are set up in arch_init_sched_domains.
>   */
>  static void
> -rebalance_tick(int this_cpu, struct rq *this_rq, enum idle_type idle)
> +rebalance_tick(int this_cpu, struct rq *this_rq)
>  {
>  	unsigned long interval;
>  	struct sched_domain *sd;
> +	/*
> +	 * A task is idle if this is the idle queue
> +	 * and we have no runnable task
> +	 */
> +	enum idle_type idle = (this_rq->idle && !this_rq->nr_running) ?
> +				SCHED_IDLE : NOT_IDLE;

this_rq->idle will always be set to idle task. You wanted to check, if
the current task is idle or not, right? Perhaps we can skip that and
just check for nr_running..

comment needs to be fixed and also please mention that in case of SMT nice,
nr_running will determine if the processor is idle or not(rather than
checking for current task is idle)

thanks,
suresh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5/7] Move idle stat calculation into rebalance_tick()
  2006-10-28 17:57   ` Siddha, Suresh B
@ 2006-10-29  1:05     ` Christoph Lameter
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2006-10-29  1:05 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: akpm, Peter Williams, linux-kernel, Nick Piggin,
	KAMEZAWA Hiroyuki, Ingo Molnar

On Sat, 28 Oct 2006, Siddha, Suresh B wrote:

> comment needs to be fixed and also please mention that in case of SMT nice,
> nr_running will determine if the processor is idle or not(rather than
> checking for current task is idle)

Ah.. Thanks! Would this be okay?

Index: linux-2.6.19-rc3/kernel/sched.c
===================================================================
--- linux-2.6.19-rc3.orig/kernel/sched.c	2006-10-28 20:00:07.000000000 -0500
+++ linux-2.6.19-rc3/kernel/sched.c	2006-10-28 20:04:08.721364884 -0500
@@ -2869,10 +2869,10 @@ static void rebalance_domains(unsigned l
 	unsigned long interval;
 	struct sched_domain *sd;
 	/*
-	 * A task is idle if this is the idle queue
-	 * and we have no runnable task
+	 * We are idle if there are no processes running. This
+	 * is valid even if we are the idle process (SMT).
 	 */
-	enum idle_type idle = (this_rq->idle && !this_rq->nr_running) ?
+	enum idle_type idle = !this_rq->nr_running ?
 				SCHED_IDLE : NOT_IDLE;
 	/* Earliest time when we have to call rebalance_domains again */
 	unsigned long next_balance = jiffies + 60*HZ;


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-10-29  1:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-28  2:41 [PATCH 0/7] sched_domain balancing via tasklet V2 Christoph Lameter
2006-10-28  2:41 ` [PATCH 1/7] Avoid taking rq lock in wake_priority_sleeper Christoph Lameter
2006-10-28  2:41 ` [PATCH 2/7] Disable interrupts for locking in load_balance() Christoph Lameter
2006-10-28  2:41 ` [PATCH 3/7] Extract load calculation from rebalance_tick Christoph Lameter
2006-10-28  2:41 ` [PATCH 4/7] Stagger load balancing in build_sched_domains Christoph Lameter
2006-10-28  2:41 ` [PATCH 5/7] Move idle stat calculation into rebalance_tick() Christoph Lameter
2006-10-28 17:57   ` Siddha, Suresh B
2006-10-29  1:05     ` Christoph Lameter
2006-10-28  2:41 ` [PATCH 6/7] Use tasklet to call balancing Christoph Lameter
2006-10-28  2:41 ` [PATCH 7/7] Call tasklet less frequently Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox