[PATCH 0/8] RT: scheduler migration/wakeup enhancements

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] RT: scheduler migration/wakeup enhancements
@ 2007-11-05 23:45 Gregory Haskins
  2007-11-05 23:45 ` [PATCH 1/8] RT: Consistency cleanup for this_rq usage Gregory Haskins
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:45 UTC (permalink / raw)
  To: rostedt
  Cc: mingo, dvhltc, zijlstra, "linux-kernel, linux-rt-users,
	ghaskins

Ingo, Steven, Thomas,

Please consider this series for inclusion in 23-rt6, as it has shown
to make a substantial improvement in our local testing.  Independent
verification and/or comments/review are more than welcome.

-Greg

---------
RT: scheduler migration/wakeup enhancements

This series applies to 23.1-rt5 and includes numerous tweaks to the
scheduler.  The primary goal of this set of patches is to further improve
wake-up latencies (particularly on larger SMP systems) and decrease migration
overhead.  This is accomplished by making improvements on several fronts:

1) We factor in CPU topology in the routing decision to pick the best
   migration target according to cache hierarchy.

2) We moved some CFS load calculation code as a member function of the CFS
   sched_class.  This removes this unecessary code from the RT fastpath for
   tasks in the RT sched_class.

3) We make further improvements against non-migratable tasks by factoring in
   the RQ overload state, instead of just the RQ depth.

4) We replace the linear priority search with a 2-d algorithm.

In past -rt releases, latencies could become quickly absymal on larger SMP (8+
cpus) to the order of 350us+.  The recent work in -rt2 and -rt4 dropped this
figure by a large margin, bringing things in the order of approximately
~120us.  This new series improves upon this work even further, bringing
latencies down to the sub 80us mark on our reference 8-way Intel C2D 5335
Xeon.

These figures were obtained by simultaneous execution of:

# ./cyclictest -n -p 90 -t 8 -d 100 -i 100
# while true; do make mrproper; make alldefconfig; make -j 128; done

for long durations.  The following are the results from one particular run,
though the results are similar across various short and long term trials in
our labs.

23.1-rt5-baseline
--------------------------
138.60 110.62 70.96 23/808 10246

T: 0 ( 5179) P:90 I:100 C:9011636 Min:      2 Act:    4 Avg:    4 Max:     117
T: 1 ( 5180) P:89 I:200 C:4505819 Min:      2 Act:    5 Avg:    4 Max:      95
T: 2 ( 5181) P:88 I:300 C:3003879 Min:      2 Act:    9 Avg:    5 Max:      85
T: 3 ( 5182) P:87 I:400 C:2252910 Min:      2 Act:    3 Avg:    4 Max:      75
T: 4 ( 5183) P:86 I:500 C:1802328 Min:      2 Act:    7 Avg:    5 Max:      71
T: 5 ( 5184) P:85 I:600 C:1501940 Min:      2 Act:    4 Avg:    5 Max:      74
T: 6 ( 5185) P:84 I:700 C:1287377 Min:      2 Act:    6 Avg:    6 Max:      85
T: 7 ( 5186) P:83 I:800 C:1126455 Min:      2 Act:    4 Avg:    5 Max:      75

23.1-rt5-gh
--------------------------
146.47 127.99 85.35 30/815 32289

T: 0 ( 5027) P:90 I:100 C:10856538 Min:      2 Act:    4 Avg:    4 Max:
60
T: 1 ( 5028) P:89 I:200 C:5428270 Min:      2 Act:    7 Avg:    5 Max:      57
T: 2 ( 5029) P:88 I:300 C:3618846 Min:      2 Act:    5 Avg:    5 Max:      48
T: 3 ( 5030) P:87 I:400 C:2714135 Min:      2 Act:    7 Avg:    5 Max:      61
T: 4 ( 5031) P:86 I:500 C:2171308 Min:      2 Act:    6 Avg:    6 Max:      51
T: 5 ( 5032) P:85 I:600 C:1809424 Min:      2 Act:    5 Avg:    7 Max:      59
T: 6 ( 5033) P:84 I:700 C:1550935 Min:      2 Act:    6 Avg:    6 Max:      54
T: 7 ( 5034) P:83 I:800 C:1357068 Min:      2 Act:    7 Avg:    6 Max:      62

Signed-off-by: Gregory Haskins <ghaskins@novell.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/8] RT: Consistency cleanup for this_rq usage
  2007-11-05 23:45 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
@ 2007-11-05 23:45 ` Gregory Haskins
  2007-11-05 23:45 ` [PATCH 2/8] RT: Remove some CFS specific code from the wakeup path of RT tasks Gregory Haskins
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:45 UTC (permalink / raw)
  To: rostedt
  Cc: mingo, dvhltc, zijlstra, "linux-kernel, linux-rt-users,
	ghaskins

"this_rq" is normally used to denote the RQ on the current cpu
(i.e. "cpu_rq(this_cpu)").  So clean up the usage of this_rq to be
more consistent with the rest of the code.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   46 +++++++++++++++++++++++-----------------------
 1 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 9b06d7c..0348423 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -307,7 +307,7 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
 
 /* Will lock the rq it finds */
 static struct rq *find_lock_lowest_rq(struct task_struct *task,
-				      struct rq *this_rq)
+				      struct rq *rq)
 {
 	struct rq *lowest_rq = NULL;
 	cpumask_t cpu_mask;
@@ -321,21 +321,21 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 		 * Scan each rq for the lowest prio.
 		 */
 		for_each_cpu_mask(cpu, cpu_mask) {
-			struct rq *rq = &per_cpu(runqueues, cpu);
+			struct rq *curr_rq = &per_cpu(runqueues, cpu);
 
-			if (cpu == this_rq->cpu)
+			if (cpu == rq->cpu)
 				continue;
 
 			/* We look for lowest RT prio or non-rt CPU */
-			if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-				lowest_rq = rq;
+			if (curr_rq->rt.highest_prio >= MAX_RT_PRIO) {
+				lowest_rq = curr_rq;
 				break;
 			}
 
 			/* no locking for now */
-			if (rq->rt.highest_prio > task->prio &&
-			    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-				lowest_rq = rq;
+			if (curr_rq->rt.highest_prio > task->prio &&
+			    (!lowest_rq || curr_rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
+				lowest_rq = curr_rq;
 			}
 		}
 
@@ -343,16 +343,16 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 			break;
 
 		/* if the prio of this runqueue changed, try again */
-		if (double_lock_balance(this_rq, lowest_rq)) {
+		if (double_lock_balance(rq, lowest_rq)) {
 			/*
 			 * We had to unlock the run queue. In
 			 * the mean time, task could have
 			 * migrated already or had its affinity changed.
 			 * Also make sure that it wasn't scheduled on its rq.
 			 */
-			if (unlikely(task_rq(task) != this_rq ||
+			if (unlikely(task_rq(task) != rq ||
 				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
-				     task_running(this_rq, task) ||
+				     task_running(rq, task) ||
 				     !task->se.on_rq)) {
 				spin_unlock(&lowest_rq->lock);
 				lowest_rq = NULL;
@@ -377,21 +377,21 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
  * running task can migrate over to a CPU that is running a task
  * of lesser priority.
  */
-static int push_rt_task(struct rq *this_rq)
+static int push_rt_task(struct rq *rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
 	int ret = 0;
 	int paranoid = RT_MAX_TRIES;
 
-	assert_spin_locked(&this_rq->lock);
+	assert_spin_locked(&rq->lock);
 
-	next_task = pick_next_highest_task_rt(this_rq, -1);
+	next_task = pick_next_highest_task_rt(rq, -1);
 	if (!next_task)
 		return 0;
 
  retry:
-	if (unlikely(next_task == this_rq->curr)) {
+	if (unlikely(next_task == rq->curr)) {
 		WARN_ON(1);
 		return 0;
 	}
@@ -401,24 +401,24 @@ static int push_rt_task(struct rq *this_rq)
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < this_rq->curr->prio)) {
-		resched_task(this_rq->curr);
+	if (unlikely(next_task->prio < rq->curr->prio)) {
+		resched_task(rq->curr);
 		return 0;
 	}
 
-	/* We might release this_rq lock */
+	/* We might release rq lock */
 	get_task_struct(next_task);
 
 	/* find_lock_lowest_rq locks the rq if found */
-	lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+	lowest_rq = find_lock_lowest_rq(next_task, rq);
 	if (!lowest_rq) {
 		struct task_struct *task;
 		/*
-		 * find lock_lowest_rq releases this_rq->lock
+		 * find lock_lowest_rq releases rq->lock
 		 * so it is possible that next_task has changed.
 		 * If it has, then try again.
 		 */
-		task = pick_next_highest_task_rt(this_rq, -1);
+		task = pick_next_highest_task_rt(rq, -1);
 		if (unlikely(task != next_task) && task && paranoid--) {
 			put_task_struct(next_task);
 			next_task = task;
@@ -429,13 +429,13 @@ static int push_rt_task(struct rq *this_rq)
 
 	assert_spin_locked(&lowest_rq->lock);
 
-	deactivate_task(this_rq, next_task, 0);
+	deactivate_task(rq, next_task, 0);
 	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
 
 	resched_task(lowest_rq->curr);
 
-	schedstat_inc(this_rq, rto_pushed);
+	schedstat_inc(rq, rto_pushed);
 
 	spin_unlock(&lowest_rq->lock);
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/8] RT: Remove some CFS specific code from the wakeup path of RT tasks
  2007-11-05 23:45 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
  2007-11-05 23:45 ` [PATCH 1/8] RT: Consistency cleanup for this_rq usage Gregory Haskins
@ 2007-11-05 23:45 ` Gregory Haskins
  2007-11-05 23:45 ` [PATCH 3/8] RT: Break out the search function Gregory Haskins
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:45 UTC (permalink / raw)
  To: rostedt
  Cc: mingo, dvhltc, zijlstra, "linux-kernel, linux-rt-users,
	ghaskins

The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations.  The problem is that
these calculations are only relevant to CFS tasks where load is king.  For RT
tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h   |    1 
 kernel/sched.c          |  135 ++++-------------------------------------------
 kernel/sched_fair.c     |  134 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_idletask.c |    9 +++
 kernel/sched_rt.c       |   10 +++
 5 files changed, 165 insertions(+), 124 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fe3fd1d..cfef8fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1033,6 +1033,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq, struct task_struct *p);
+	int  (*select_task_rq)(struct task_struct *p, int sync);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 49e68be..d16c686 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -859,6 +859,10 @@ static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		      int *all_pinned, unsigned long *load_moved,
 		      int *this_best_prio, struct rq_iterator *iterator);
 
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long cpu_avg_load_per_task(int cpu);
+
 #include "sched_stats.h"
 #include "sched_rt.c"
 #include "sched_fair.c"
@@ -1266,7 +1270,7 @@ void kick_process(struct task_struct *p)
  * We want to under-estimate the load of migration sources, to
  * balance conservatively.
  */
-static inline unsigned long source_load(int cpu, int type)
+static unsigned long source_load(int cpu, int type)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
@@ -1281,7 +1285,7 @@ static inline unsigned long source_load(int cpu, int type)
  * Return a high guess at the load of a migration-target cpu weighted
  * according to the scheduling class and "nice" value.
  */
-static inline unsigned long target_load(int cpu, int type)
+static unsigned long target_load(int cpu, int type)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
@@ -1295,7 +1299,7 @@ static inline unsigned long target_load(int cpu, int type)
 /*
  * Return the average load per task on the cpu's run queue
  */
-static inline unsigned long cpu_avg_load_per_task(int cpu)
+static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
@@ -1454,53 +1458,6 @@ static int sched_balance_self(int cpu, int flag)
 
 #endif /* CONFIG_SMP */
 
-/*
- * wake_idle() will wake a task on an idle cpu if task->cpu is
- * not idle and an idle cpu is available.  The span of cpus to
- * search starts with cpus closest then further out as needed,
- * so we always favor a closer, idle cpu.
- *
- * Returns the CPU we should wake onto.
- */
-#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
-static int wake_idle(int cpu, struct task_struct *p)
-{
-	cpumask_t tmp;
-	struct sched_domain *sd;
-	int i;
-
-	/*
-	 * If it is idle, then it is the best cpu to run this task.
-	 *
-	 * This cpu is also the best, if it has more than one task already.
-	 * Siblings must be also busy(in most cases) as they didn't already
-	 * pickup the extra load from this cpu and hence we need not check
-	 * sibling runqueue info. This will avoid the checks and cache miss
-	 * penalities associated with that.
-	 */
-	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
-		return cpu;
-
-	for_each_domain(cpu, sd) {
-		if (sd->flags & SD_WAKE_IDLE) {
-			cpus_and(tmp, sd->span, p->cpus_allowed);
-			for_each_cpu_mask(i, tmp) {
-				if (idle_cpu(i))
-					return i;
-			}
-		} else {
-			break;
-		}
-	}
-	return cpu;
-}
-#else
-static inline int wake_idle(int cpu, struct task_struct *p)
-{
-	return cpu;
-}
-#endif
-
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -1523,8 +1480,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int sync, int mutex)
 	long old_state;
 	struct rq *rq;
 #ifdef CONFIG_SMP
-	struct sched_domain *sd, *this_sd = NULL;
-	unsigned long load, this_load;
 	int new_cpu;
 #endif
 
@@ -1551,81 +1506,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int sync, int mutex)
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-	new_cpu = cpu;
-
 	schedstat_inc(rq, ttwu_cnt);
-	if (cpu == this_cpu) {
+	if (cpu == this_cpu)
 		schedstat_inc(rq, ttwu_local);
-		goto out_set_cpu;
-	}
-
-	for_each_domain(this_cpu, sd) {
-		if (cpu_isset(cpu, sd->span)) {
-			schedstat_inc(sd, ttwu_wake_remote);
-			this_sd = sd;
-			break;
-		}
-	}
-
-	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
-		goto out_set_cpu;
-
-	/*
-	 * Check for affine wakeup and passive balancing possibilities.
-	 */
-	if (this_sd) {
-		int idx = this_sd->wake_idx;
-		unsigned int imbalance;
-
-		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
-
-		load = source_load(cpu, idx);
-		this_load = target_load(this_cpu, idx);
-
-		new_cpu = this_cpu; /* Wake to this CPU if we can */
-
-		if (this_sd->flags & SD_WAKE_AFFINE) {
-			unsigned long tl = this_load;
-			unsigned long tl_per_task;
-
-			tl_per_task = cpu_avg_load_per_task(this_cpu);
-
-			/*
-			 * If sync wakeup then subtract the (maximum possible)
-			 * effect of the currently running task from the load
-			 * of the current CPU:
-			 */
-			if (sync)
-				tl -= current->se.load.weight;
-
-			if ((tl <= load &&
-				tl + target_load(cpu, idx) <= tl_per_task) ||
-			       100*(tl + p->se.load.weight) <= imbalance*load) {
-				/*
-				 * This domain has SD_WAKE_AFFINE and
-				 * p is cache cold in this domain, and
-				 * there is no bad imbalance.
-				 */
-				schedstat_inc(this_sd, ttwu_move_affine);
-				goto out_set_cpu;
-			}
-		}
-
-		/*
-		 * Start passive balancing when half the imbalance_pct
-		 * limit is reached.
-		 */
-		if (this_sd->flags & SD_WAKE_BALANCE) {
-			if (imbalance*this_load <= 100*load) {
-				schedstat_inc(this_sd, ttwu_move_balance);
-				goto out_set_cpu;
-			}
-		}
-	}
+	else
+		schedstat_inc(rq->sd, ttwu_wake_remote);
 
-	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
-out_set_cpu:
-	new_cpu = wake_idle(new_cpu, p);
+	new_cpu = p->sched_class->select_task_rq(p, sync);
 	if (new_cpu != cpu) {
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 67c67a8..4a5af80 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -974,6 +974,137 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p)
 }
 
 /*
+ * wake_idle() will wake a task on an idle cpu if task->cpu is
+ * not idle and an idle cpu is available.  The span of cpus to
+ * search starts with cpus closest then further out as needed,
+ * so we always favor a closer, idle cpu.
+ *
+ * Returns the CPU we should wake onto.
+ */
+#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
+static int wake_idle(int cpu, struct task_struct *p)
+{
+	cpumask_t tmp;
+	struct sched_domain *sd;
+	int i;
+
+	/*
+	 * If it is idle, then it is the best cpu to run this task.
+	 *
+	 * This cpu is also the best, if it has more than one task already.
+	 * Siblings must be also busy(in most cases) as they didn't already
+	 * pickup the extra load from this cpu and hence we need not check
+	 * sibling runqueue info. This will avoid the checks and cache miss
+	 * penalities associated with that.
+	 */
+	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
+		return cpu;
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_IDLE) {
+			cpus_and(tmp, sd->span, p->cpus_allowed);
+			for_each_cpu_mask(i, tmp) {
+				if (idle_cpu(i))
+					return i;
+			}
+		} else {
+			break;
+		}
+	}
+	return cpu;
+}
+#else
+static inline int wake_idle(int cpu, struct task_struct *p)
+{
+	return cpu;
+}
+#endif
+
+#ifdef CONFIG_SMP
+static int select_task_rq_fair(struct task_struct *p, int sync)
+{
+	int cpu, this_cpu;
+	struct rq *rq;
+	struct sched_domain *sd, *this_sd = NULL;
+	int new_cpu;
+
+	cpu      = task_cpu(p);
+	rq       = task_rq(p);
+	this_cpu = smp_processor_id();
+	new_cpu  = cpu;
+
+	for_each_domain(this_cpu, sd) {
+		if (cpu_isset(cpu, sd->span)) {
+			this_sd = sd;
+			break;
+		}
+	}
+
+	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
+		goto out_set_cpu;
+
+	/*
+	 * Check for affine wakeup and passive balancing possibilities.
+	 */
+	if (this_sd) {
+		int idx = this_sd->wake_idx;
+		unsigned int imbalance;
+		unsigned long load, this_load;
+
+		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
+
+		load = source_load(cpu, idx);
+		this_load = target_load(this_cpu, idx);
+
+		new_cpu = this_cpu; /* Wake to this CPU if we can */
+
+		if (this_sd->flags & SD_WAKE_AFFINE) {
+			unsigned long tl = this_load;
+			unsigned long tl_per_task;
+
+			tl_per_task = cpu_avg_load_per_task(this_cpu);
+
+			/*
+			 * If sync wakeup then subtract the (maximum possible)
+			 * effect of the currently running task from the load
+			 * of the current CPU:
+			 */
+			if (sync)
+				tl -= current->se.load.weight;
+
+			if ((tl <= load &&
+				tl + target_load(cpu, idx) <= tl_per_task) ||
+			       100*(tl + p->se.load.weight) <= imbalance*load) {
+				/*
+				 * This domain has SD_WAKE_AFFINE and
+				 * p is cache cold in this domain, and
+				 * there is no bad imbalance.
+				 */
+				schedstat_inc(this_sd, ttwu_move_affine);
+				goto out_set_cpu;
+			}
+		}
+
+		/*
+		 * Start passive balancing when half the imbalance_pct
+		 * limit is reached.
+		 */
+		if (this_sd->flags & SD_WAKE_BALANCE) {
+			if (imbalance*this_load <= 100*load) {
+				schedstat_inc(this_sd, ttwu_move_balance);
+				goto out_set_cpu;
+			}
+		}
+	}
+
+	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
+out_set_cpu:
+	return wake_idle(new_cpu, p);
+}
+#endif /* CONFIG_SMP */
+
+
+/*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
@@ -1219,6 +1350,9 @@ struct sched_class fair_sched_class __read_mostly = {
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_fair,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_curr_fair,
 
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 3503fb2..6fc6ead 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -5,6 +5,12 @@
  *  handled in sched_fair.c)
  */
 
+#ifdef CONFIG_SMP
+static int select_task_rq_idle(struct task_struct *p, int sync)
+{
+	return task_cpu(p); /* IDLE tasks as never migrated */
+}
+#endif /* CONFIG_SMP */
 /*
  * Idle tasks are unconditionally rescheduled:
  */
@@ -58,6 +64,9 @@ static struct sched_class idle_sched_class __read_mostly = {
 
 	/* dequeue is not valid, we print a debug message there: */
 	.dequeue_task		= dequeue_task_idle,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_idle,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_curr_idle,
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0348423..f1fc1b4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -202,6 +202,13 @@ yield_task_rt(struct rq *rq, struct task_struct *p)
 	requeue_task_rt(rq, p);
 }
 
+#ifdef CONFIG_SMP
+static int select_task_rq_rt(struct task_struct *p, int sync)
+{
+	return task_cpu(p);
+}
+#endif /* CONFIG_SMP */
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -718,6 +725,9 @@ static struct sched_class rt_sched_class __read_mostly = {
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
 	.yield_task		= yield_task_rt,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_rt,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_curr_rt,
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/8] RT: Break out the search function
  2007-11-05 23:45 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
  2007-11-05 23:45 ` [PATCH 1/8] RT: Consistency cleanup for this_rq usage Gregory Haskins
  2007-11-05 23:45 ` [PATCH 2/8] RT: Remove some CFS specific code from the wakeup path of RT tasks Gregory Haskins
@ 2007-11-05 23:45 ` Gregory Haskins
  2007-11-05 23:45 ` [PATCH 4/8] RT: Allow current_cpu to be included in search Gregory Haskins
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:45 UTC (permalink / raw)
  To: rostedt
  Cc: mingo, dvhltc, zijlstra, "linux-kernel, linux-rt-users,
	ghaskins

Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   62 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index f1fc1b4..fbe7b8a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -312,43 +312,55 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
 	return next;
 }
 
-/* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(struct task_struct *task,
-				      struct rq *rq)
+static int find_lowest_rq(struct task_struct *task)
 {
-	struct rq *lowest_rq = NULL;
-	cpumask_t cpu_mask;
 	int cpu;
-	int tries;
+	cpumask_t cpu_mask;
+	struct rq *lowest_rq = NULL;
 
 	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
 
-	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
-		/*
-		 * Scan each rq for the lowest prio.
-		 */
-		for_each_cpu_mask(cpu, cpu_mask) {
-			struct rq *curr_rq = &per_cpu(runqueues, cpu);
+	/*
+	 * Scan each rq for the lowest prio.
+	 */
+	for_each_cpu_mask(cpu, cpu_mask) {
+		struct rq *rq = cpu_rq(cpu);
 
-			if (cpu == rq->cpu)
-				continue;
+		if (cpu == rq->cpu)
+			continue;
 
-			/* We look for lowest RT prio or non-rt CPU */
-			if (curr_rq->rt.highest_prio >= MAX_RT_PRIO) {
-				lowest_rq = curr_rq;
-				break;
-			}
+		/* We look for lowest RT prio or non-rt CPU */
+		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
+			lowest_rq = rq;
+			break;
+		}
 
-			/* no locking for now */
-			if (curr_rq->rt.highest_prio > task->prio &&
-			    (!lowest_rq || curr_rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-				lowest_rq = curr_rq;
-			}
+		/* no locking for now */
+		if (rq->rt.highest_prio > task->prio &&
+		    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
+			lowest_rq = rq;
 		}
+	}
+
+	return lowest_rq ? lowest_rq->cpu : -1;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+				      struct rq *rq)
+{
+	struct rq *lowest_rq = NULL;
+	int cpu;
+	int tries;
+
+	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+		cpu = find_lowest_rq(task);
 
-		if (!lowest_rq)
+		if (cpu == -1)
 			break;
 
+		lowest_rq = cpu_rq(cpu);
+
 		/* if the prio of this runqueue changed, try again */
 		if (double_lock_balance(rq, lowest_rq)) {
 			/*

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/8] RT: Allow current_cpu to be included in search
  2007-11-05 23:45 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
                   ` (2 preceding siblings ...)
  2007-11-05 23:45 ` [PATCH 3/8] RT: Break out the search function Gregory Haskins
@ 2007-11-05 23:45 ` Gregory Haskins
  2007-11-05 23:45 ` [PATCH 5/8] RT: Pre-route RT tasks on wakeup Gregory Haskins
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:45 UTC (permalink / raw)
  To: rostedt
  Cc: mingo, dvhltc, zijlstra, "linux-kernel, linux-rt-users,
	ghaskins

It doesn't hurt if we allow the current CPU to be included in the
search.  We will just simply skip it later if the current CPU turns out
to be the lowest.

We will use this later in the series

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index fbe7b8a..7dd67db 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -326,9 +326,6 @@ static int find_lowest_rq(struct task_struct *task)
 	for_each_cpu_mask(cpu, cpu_mask) {
 		struct rq *rq = cpu_rq(cpu);
 
-		if (cpu == rq->cpu)
-			continue;
-
 		/* We look for lowest RT prio or non-rt CPU */
 		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
 			lowest_rq = rq;
@@ -356,7 +353,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
 		cpu = find_lowest_rq(task);
 
-		if (cpu == -1)
+		if ((cpu == -1) || (cpu == rq->cpu))
 			break;
 
 		lowest_rq = cpu_rq(cpu);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/8] RT: Pre-route RT tasks on wakeup
  2007-11-05 23:45 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
                   ` (3 preceding siblings ...)
  2007-11-05 23:45 ` [PATCH 4/8] RT: Allow current_cpu to be included in search Gregory Haskins
@ 2007-11-05 23:45 ` Gregory Haskins
  2007-11-05 23:45 ` [PATCH 6/8] RT: Optimize our cpu selection based on topology Gregory Haskins
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:45 UTC (permalink / raw)
  To: rostedt
  Cc: mingo, dvhltc, zijlstra, "linux-kernel, linux-rt-users,
	ghaskins

In the original patch series that Steven Rostedt and I worked on together,
we both took different approaches to low-priority wakeup path.  I utilized
"pre-routing" (push the task away to a less important RQ before activating)
approach, while Steve utilized a "post-routing" approach.  The advantage of
my approach is that you avoid the overhead of a wasted activate/deactivate
cycle and peripherally related burdens.  The advantage of Steve's method is
that it neatly solves an issue preventing a "pull" optimization from being
deployed.

In the end, we ended up deploying Steve's idea.  But it later dawned on me
that we could get the best of both worlds by deploying both ideas together,
albeit slightly modified.

The idea is simple:  Use a "light-weight" lookup for pre-routing, since we
only need to approximate a good home for the task.  And we also retain the
post-routing push logic to clean up any inaccuracies caused by a condition
of "priority mistargeting" caused by the lightweight lookup.  Most of the
time, the pre-routing should work and yield lower overhead.  In the cases
where it doesnt, the post-router will bat cleanup.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 7dd67db..43d4ea6 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -202,9 +202,28 @@ yield_task_rt(struct rq *rq, struct task_struct *p)
 	requeue_task_rt(rq, p);
 }

+static int find_lowest_rq(struct task_struct *task);
+
 #ifdef CONFIG_SMP
 static int select_task_rq_rt(struct task_struct *p, int sync)
 {
+	struct rq *rq = task_rq(p);
+
+	/*
+	 * If the task will not preempt the RQ, try to find a better RQ
+	 * before we even activate the task
+	 */
+	if ((p->prio >= rq->rt.highest_prio)
+	    && (p->nr_cpus_allowed > 1)) {
+		int cpu = find_lowest_rq(p);
+
+		return (cpu == -1) ? task_cpu(p) : cpu;
+	}
+
+	/*
+	 * Otherwise, just let it ride on the affined RQ and the
+	 * post-schedule router will push the preempted task away
+	 */
 	return task_cpu(p);
 }
 #endif /* CONFIG_SMP */

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/8] RT: Optimize our cpu selection based on topology
  2007-11-05 23:45 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
                   ` (4 preceding siblings ...)
  2007-11-05 23:45 ` [PATCH 5/8] RT: Pre-route RT tasks on wakeup Gregory Haskins
@ 2007-11-05 23:45 ` Gregory Haskins
  2007-11-05 23:45 ` [PATCH 7/8] RT: Optimize rebalancing Gregory Haskins
  2007-11-05 23:45 ` [PATCH 8/8] RT: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
  7 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:45 UTC (permalink / raw)
  To: rostedt
  Cc: mingo, dvhltc, zijlstra, "linux-kernel, linux-rt-users,
	ghaskins

The current code base assumes a relatively flat CPU/core topology and will
route RT tasks to any CPU fairly equally.  In the real world, there are
various toplogies and affinities that govern where a task is best suited to
run with the smallest amount of overhead.  NUMA and multi-core CPUs are
prime examples of topologies that can impact cache performance.

Fortunately, linux is already structured to represent these topologies via
the sched_domains interface.  So we change our RT router to consult a
combination of topology and affinity policy to best place tasks during
migration.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c    |    1 +
 kernel/sched_rt.c |   99 +++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 88 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d16c686..8a27f09 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -26,6 +26,7 @@
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
  *  2007-10-22  RT overload balancing by Steven Rostedt
  *                 (with thanks to Gregory Haskins)
+ *  2007-11-05  RT migration/wakeup tuning by Gregory Haskins
  */
 
 #include <linux/mm.h>
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 43d4ea6..6ba5921 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -331,34 +331,109 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
 	return next;
 }
 
-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-	int cpu;
-	cpumask_t cpu_mask;
-	struct rq *lowest_rq = NULL;
+	int       cpu;
+	cpumask_t valid_mask;
+	int       lowest_prio = -1;
+	int       ret         = 0;
 
-	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
+	cpus_clear(*lowest_mask);
+	cpus_and(valid_mask, cpu_online_map, task->cpus_allowed);
 
 	/*
 	 * Scan each rq for the lowest prio.
 	 */
-	for_each_cpu_mask(cpu, cpu_mask) {
+	for_each_cpu_mask(cpu, valid_mask) {
 		struct rq *rq = cpu_rq(cpu);
 
 		/* We look for lowest RT prio or non-rt CPU */
 		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			lowest_rq = rq;
-			break;
+			if (ret)
+				cpus_clear(*lowest_mask);
+			cpu_set(rq->cpu, *lowest_mask);
+			return 1;
 		}
 
 		/* no locking for now */
-		if (rq->rt.highest_prio > task->prio &&
-		    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-			lowest_rq = rq;
+		if ((rq->rt.highest_prio > task->prio)
+		    && (rq->rt.highest_prio >= lowest_prio)) {
+			if (rq->rt.highest_prio > lowest_prio) {
+				/* new low - clear old data */
+				lowest_prio = rq->rt.highest_prio;
+				cpus_clear(*lowest_mask);
+			}
+			cpu_set(rq->cpu, *lowest_mask);
+			ret = 1;
+		}
+	}
+
+	return ret;
+}
+
+static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
+{
+	int first;
+
+	/* "this_cpu" is cheaper to preempt than a remote processor */
+	if ((this_cpu != -1) && cpu_isset(this_cpu, *mask))
+		return this_cpu;
+
+	first = first_cpu(*mask);
+	if (first != NR_CPUS)
+		return first;
+
+	return -1;
+}
+
+static int find_lowest_rq(struct task_struct *task)
+{
+	struct sched_domain *sd;
+	cpumask_t lowest_mask;
+	int this_cpu = smp_processor_id();
+	int cpu      = task_cpu(task);
+
+	if (!find_lowest_cpus(task, &lowest_mask))
+		return -1;
+
+	/*
+	 * At this point we have built a mask of cpus representing the
+	 * lowest priority tasks in the system.  Now we want to elect
+	 * the best one based on our affinity and topology.
+	 *
+	 * We prioritize the last cpu that the task executed on since
+	 * it is most likely cache-hot in that location.
+	 */
+	if (cpu_isset(cpu, lowest_mask))
+		return cpu;
+
+	/*
+	 * Otherwise, we consult the sched_domains span maps to figure
+	 * out which cpu is logically closest to our hot cache data.
+	 */
+	if (this_cpu == cpu)
+		this_cpu = -1; /* Skip this_cpu opt if the same */
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_AFFINE) {
+			cpumask_t domain_mask;
+			int       best_cpu;
+
+			cpus_and(domain_mask, sd->span, lowest_mask);
+
+			best_cpu = pick_optimal_cpu(this_cpu,
+						    &domain_mask);
+			if (best_cpu != -1)
+				return best_cpu;
 		}
 	}
 
-	return lowest_rq ? lowest_rq->cpu : -1;
+	/*
+	 * And finally, if there were no matches within the domains
+	 * just give the caller *something* to work with from the compatible
+	 * locations.
+	 */
+	return pick_optimal_cpu(this_cpu, &lowest_mask);
 }
 
 /* Will lock the rq it finds */

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/8] RT: Optimize rebalancing
  2007-11-05 23:45 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
                   ` (5 preceding siblings ...)
  2007-11-05 23:45 ` [PATCH 6/8] RT: Optimize our cpu selection based on topology Gregory Haskins
@ 2007-11-05 23:45 ` Gregory Haskins
  2007-11-05 23:45 ` [PATCH 8/8] RT: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
  7 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:45 UTC (permalink / raw)
  To: rostedt
  Cc: mingo, dvhltc, zijlstra, "linux-kernel, linux-rt-users,
	ghaskins

We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away.  So we add support
for considering this new information.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c    |    2 ++
 kernel/sched_rt.c |   10 ++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8a27f09..0eced8c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -268,6 +268,7 @@ struct rt_rq {
 	unsigned long rt_nr_uninterruptible;
 	/* highest queued rt task prio */
 	int highest_prio;
+	int overloaded;
 };
 
 /*
@@ -6869,6 +6870,7 @@ void __init sched_init(void)
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
 		rq->rt.highest_prio = MAX_RT_PRIO;
+		rq->rt.overloaded = 0;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 6ba5921..698f4d9 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -28,10 +28,12 @@ static inline cpumask_t *rt_overload(struct rq *rq)
 static inline void rt_set_overload(struct rq *rq)
 {
 	cpu_set(rq->cpu, *rt_overload_mask(rq->cpu));
+	rq->rt.overloaded = 1;
 }
 static inline void rt_clear_overload(struct rq *rq)
 {
 	cpu_clear(rq->cpu, *rt_overload_mask(rq->cpu));
+	rq->rt.overloaded = 0;
 }
 
 static void update_rt_migration(struct task_struct *p, struct rq *rq)
@@ -496,6 +498,9 @@ static int push_rt_task(struct rq *rq)
 
 	assert_spin_locked(&rq->lock);
 
+	if (!rq->rt.overloaded)
+		return 0;
+
 	next_task = pick_next_highest_task_rt(rq, -1);
 	if (!next_task)
 		return 0;
@@ -737,7 +742,7 @@ static void schedule_tail_balance_rt(struct rq *rq)
 	 * the lock was owned by prev, we need to release it
 	 * first via finish_lock_switch and then reaquire it here.
 	 */
-	if (unlikely(rq->rt.rt_nr_running > 1)) {
+	if (unlikely(rq->rt.overloaded)) {
 		spin_lock(&rq->lock);
 		push_rt_tasks(rq);
 		schedstat_inc(rq, rto_schedule_tail);
@@ -749,7 +754,8 @@ static void wakeup_balance_rt(struct rq *rq, struct task_struct *p)
 {
 	if (unlikely(rt_task(p)) &&
 	    !task_running(rq, p) &&
-	    (p->prio >= rq->curr->prio)) {
+	    (p->prio >= rq->rt.highest_prio) &&
+	    rq->rt.overloaded) {
 		push_rt_tasks(rq);
 		schedstat_inc(rq, rto_wakeup);
 	}

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 8/8] RT: Use a 2-d bitmap for searching lowest-pri CPU
  2007-11-05 23:45 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
                   ` (6 preceding siblings ...)
  2007-11-05 23:45 ` [PATCH 7/8] RT: Optimize rebalancing Gregory Haskins
@ 2007-11-05 23:45 ` Gregory Haskins
  7 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:45 UTC (permalink / raw)
  To: rostedt
  Cc: mingo, dvhltc, zijlstra, "linux-kernel, linux-rt-users,
	ghaskins

The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/Makefile       |    1 
 kernel/sched.c        |    4 +
 kernel/sched_cpupri.c |  186 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_cpupri.h |   10 +++
 kernel/sched_rt.c     |   52 ++------------
 5 files changed, 210 insertions(+), 43 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..a822706 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_SMP) += sched_cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index 0eced8c..6f24aa0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -70,6 +70,8 @@
 
 #include <asm/tlb.h>
 
+#include "sched_cpupri.h"
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -6842,6 +6844,8 @@ void __init sched_init(void)
 	fair_sched_class.next = &idle_sched_class;
 	idle_sched_class.next = NULL;
 
+	cpupri_init();
+
 	for_each_possible_cpu(i) {
 		struct rt_prio_array *array;
 		struct rq *rq;
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 0000000..e6280b1
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,186 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins <ghaskins@novell.com>
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ *                 (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include "sched_cpupri.h"
+
+#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO
+#define CPUPRI_NR_PRI_WORDS CPUPRI_NR_PRIORITIES/BITS_PER_LONG
+
+#define CPUPRI_INVALID -1
+#define CPUPRI_IDLE     0
+#define CPUPRI_NORMAL   1
+/* values 2-101 are RT priorities 0-99 */
+
+struct pri_vec
+{
+	raw_spinlock_t lock;
+	cpumask_t      mask;
+};
+
+struct cpu_priority {
+	struct pri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
+	long           pri_active[CPUPRI_NR_PRI_WORDS];
+	int            cpu_to_pri[NR_CPUS];
+};
+
+static __cacheline_aligned_in_smp struct cpu_priority cpu_priority;
+
+/* Convert between a 140 based task->prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+	int cpupri;
+
+	if (prio == MAX_PRIO)
+		cpupri = CPUPRI_IDLE;
+	else if (prio >= MAX_RT_PRIO)
+		cpupri = CPUPRI_NORMAL;
+	else
+		cpupri = MAX_RT_PRIO - prio + 1;
+
+	return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)                    \
+  for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES);     \
+       idx < CPUPRI_NR_PRIORITIES;                            \
+       idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @p: The task
+ * @lowest_mask: A mask to fill in with selected CPUs
+ *
+ * Note: This function returns the recommended CPUs as calculated during the
+ * current invokation.  By the time the call returns, the CPUs may have in
+ * fact changed priorities any number of times.  While not ideal, it is not
+ * an issue of correctness since the normal rebalancer logic will correct
+ * any discrepancies created by racing against the uncertainty of the current
+ * priority configuration.
+ *
+ * Returns: (int)bool - CPUs were found
+ */
+int cpupri_find(struct task_struct *p, cpumask_t *lowest_mask)
+{
+	int                  idx      = 0;
+	struct cpu_priority *cp       = &cpu_priority;
+	int                  task_pri = convert_prio(p->prio);
+
+	for_each_cpupri_active(cp->pri_active, idx) {
+		struct pri_vec *vec  = &cp->pri_to_cpu[idx];
+		cpumask_t mask;
+
+		if (idx >= task_pri)
+			break;
+
+		cpus_and(mask, p->cpus_allowed, vec->mask);
+		
+		if (cpus_empty(mask))
+			continue;
+		
+		*lowest_mask = mask;
+		return 1;
+	}
+
+	return 0;
+}
+
+/**
+ * cpupri_set - update the cpu priority setting
+ * @cpu: The target cpu
+ * @pri: The priority (INVALID-RT99) to assign to this CPU
+ *
+ * Note: Assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpupri_set(int cpu, int newpri)
+{
+	struct cpu_priority *cp      = &cpu_priority;
+	int                 *currpri = &cp->cpu_to_pri[cpu];
+	int                  oldpri  = *currpri;
+	unsigned long        flags;
+
+	newpri = convert_prio(newpri);
+
+	if (newpri == oldpri)
+		return;
+
+	/*
+	 * If the cpu was currently mapped to a different value, we
+	 * first need to unmap the old value
+	 */
+	if (likely(oldpri != CPUPRI_INVALID)) {
+		struct pri_vec *vec  = &cp->pri_to_cpu[oldpri];
+
+		spin_lock_irqsave(&vec->lock, flags);
+
+		cpu_clear(cpu, vec->mask);
+		if (cpus_empty(vec->mask))
+			clear_bit(oldpri, cp->pri_active);
+
+		spin_unlock_irqrestore(&vec->lock, flags);
+	}
+
+	if (likely(newpri != CPUPRI_INVALID)) {
+		struct pri_vec *vec = &cp->pri_to_cpu[newpri];
+
+		spin_lock_irqsave(&vec->lock, flags);
+
+		cpu_set(cpu, vec->mask);
+		set_bit(newpri, cp->pri_active);
+
+		spin_unlock_irqrestore(&vec->lock, flags);
+	}
+
+	*currpri = newpri;
+}
+
+/**
+ * cpupri_init - initialize the cpupri subsystem
+ *
+ * This must be called during the scheduler initialization before the
+ * other methods may be used.
+ *
+ * Returns: (void)
+ */
+void cpupri_init(void)
+{
+	struct cpu_priority *cp = &cpu_priority;
+	int i;
+
+	memset(cp, 0, sizeof(*cp));
+
+	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++)
+		spin_lock_init(&cp->pri_to_cpu[i].lock);
+
+	for_each_possible_cpu(i)
+		cp->cpu_to_pri[i] = CPUPRI_INVALID;
+}
+
+
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
new file mode 100644
index 0000000..8cdd15d
--- /dev/null
+++ b/kernel/sched_cpupri.h
@@ -0,0 +1,10 @@
+#ifndef _LINUX_CPUPRI_H
+#define _LINUX_CPUPRI_H
+
+#include <linux/sched.h>
+
+int  cpupri_find(struct task_struct *p, cpumask_t *lowest_mask);
+void cpupri_set(int cpu, int pri);
+void cpupri_init(void);
+
+#endif /* _LINUX_CPUPRI_H */
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 698f4d9..71ae9e6 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -72,8 +72,10 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 	WARN_ON(!rt_task(p));
 	rq->rt.rt_nr_running++;
 #ifdef CONFIG_SMP
-	if (p->prio < rq->rt.highest_prio)
+	if (p->prio < rq->rt.highest_prio) {
 		rq->rt.highest_prio = p->prio;
+		cpupri_set(rq->cpu, p->prio);
+	}
 	if (p->nr_cpus_allowed > 1)
 		rq->rt.rt_nr_migratory++;
 
@@ -96,6 +98,7 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 			array = &rq->rt.active;
 			rq->rt.highest_prio =
 				sched_find_first_bit(array->bitmap);
+			cpupri_set(rq->cpu, rq->rt.highest_prio);
 		} /* otherwise leave rq->highest prio alone */
 	} else
 		rq->rt.highest_prio = MAX_RT_PRIO;
@@ -333,46 +336,6 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
 	return next;
 }
 
-static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
-{
-	int       cpu;
-	cpumask_t valid_mask;
-	int       lowest_prio = -1;
-	int       ret         = 0;
-
-	cpus_clear(*lowest_mask);
-	cpus_and(valid_mask, cpu_online_map, task->cpus_allowed);
-
-	/*
-	 * Scan each rq for the lowest prio.
-	 */
-	for_each_cpu_mask(cpu, valid_mask) {
-		struct rq *rq = cpu_rq(cpu);
-
-		/* We look for lowest RT prio or non-rt CPU */
-		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			if (ret)
-				cpus_clear(*lowest_mask);
-			cpu_set(rq->cpu, *lowest_mask);
-			return 1;
-		}
-
-		/* no locking for now */
-		if ((rq->rt.highest_prio > task->prio)
-		    && (rq->rt.highest_prio >= lowest_prio)) {
-			if (rq->rt.highest_prio > lowest_prio) {
-				/* new low - clear old data */
-				lowest_prio = rq->rt.highest_prio;
-				cpus_clear(*lowest_mask);
-			}
-			cpu_set(rq->cpu, *lowest_mask);
-			ret = 1;
-		}
-	}
-
-	return ret;
-}
-
 static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
 {
 	int first;
@@ -394,9 +357,12 @@ static int find_lowest_rq(struct task_struct *task)
 	cpumask_t lowest_mask;
 	int this_cpu = smp_processor_id();
 	int cpu      = task_cpu(task);
+	
+	if (task->nr_cpus_allowed == 1)
+		return -1; /* No other targets possible */
 
-	if (!find_lowest_cpus(task, &lowest_mask))
-		return -1;
+	if (!cpupri_find(task, &lowest_mask))
+		return -1; /* No better targets found */
 
 	/*
 	 * At this point we have built a mask of cpus representing the

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/8] RT: Break out the search function
  2007-11-05 23:48 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
@ 2007-11-05 23:48 ` Gregory Haskins
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2007-11-05 23:48 UTC (permalink / raw)
  To: rostedt; +Cc: mingo, dvhltc, zijlstra, linux-kernel, linux-rt-users, ghaskins

Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   62 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index f1fc1b4..fbe7b8a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -312,43 +312,55 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
 	return next;
 }
 
-/* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(struct task_struct *task,
-				      struct rq *rq)
+static int find_lowest_rq(struct task_struct *task)
 {
-	struct rq *lowest_rq = NULL;
-	cpumask_t cpu_mask;
 	int cpu;
-	int tries;
+	cpumask_t cpu_mask;
+	struct rq *lowest_rq = NULL;
 
 	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
 
-	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
-		/*
-		 * Scan each rq for the lowest prio.
-		 */
-		for_each_cpu_mask(cpu, cpu_mask) {
-			struct rq *curr_rq = &per_cpu(runqueues, cpu);
+	/*
+	 * Scan each rq for the lowest prio.
+	 */
+	for_each_cpu_mask(cpu, cpu_mask) {
+		struct rq *rq = cpu_rq(cpu);
 
-			if (cpu == rq->cpu)
-				continue;
+		if (cpu == rq->cpu)
+			continue;
 
-			/* We look for lowest RT prio or non-rt CPU */
-			if (curr_rq->rt.highest_prio >= MAX_RT_PRIO) {
-				lowest_rq = curr_rq;
-				break;
-			}
+		/* We look for lowest RT prio or non-rt CPU */
+		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
+			lowest_rq = rq;
+			break;
+		}
 
-			/* no locking for now */
-			if (curr_rq->rt.highest_prio > task->prio &&
-			    (!lowest_rq || curr_rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-				lowest_rq = curr_rq;
-			}
+		/* no locking for now */
+		if (rq->rt.highest_prio > task->prio &&
+		    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
+			lowest_rq = rq;
 		}
+	}
+
+	return lowest_rq ? lowest_rq->cpu : -1;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+				      struct rq *rq)
+{
+	struct rq *lowest_rq = NULL;
+	int cpu;
+	int tries;
+
+	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+		cpu = find_lowest_rq(task);
 
-		if (!lowest_rq)
+		if (cpu == -1)
 			break;
 
+		lowest_rq = cpu_rq(cpu);
+
 		/* if the prio of this runqueue changed, try again */
 		if (double_lock_balance(rq, lowest_rq)) {
 			/*

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-11-06  0:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-05 23:45 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
2007-11-05 23:45 ` [PATCH 1/8] RT: Consistency cleanup for this_rq usage Gregory Haskins
2007-11-05 23:45 ` [PATCH 2/8] RT: Remove some CFS specific code from the wakeup path of RT tasks Gregory Haskins
2007-11-05 23:45 ` [PATCH 3/8] RT: Break out the search function Gregory Haskins
2007-11-05 23:45 ` [PATCH 4/8] RT: Allow current_cpu to be included in search Gregory Haskins
2007-11-05 23:45 ` [PATCH 5/8] RT: Pre-route RT tasks on wakeup Gregory Haskins
2007-11-05 23:45 ` [PATCH 6/8] RT: Optimize our cpu selection based on topology Gregory Haskins
2007-11-05 23:45 ` [PATCH 7/8] RT: Optimize rebalancing Gregory Haskins
2007-11-05 23:45 ` [PATCH 8/8] RT: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
  -- strict thread matches above, loose matches on Subject: below --
2007-11-05 23:48 [PATCH 0/8] RT: scheduler migration/wakeup enhancements Gregory Haskins
2007-11-05 23:48 ` [PATCH 3/8] RT: Break out the search function Gregory Haskins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).