[PATCH v4 00/20] New RT Balancing version 4

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 00/20] New RT Balancing version 4
@ 2007-11-21  1:00 Steven Rostedt
  2007-11-21  1:00 ` [PATCH v4 01/20] Add rt_nr_running accounting Steven Rostedt
                   ` (20 more replies)
  0 siblings, 21 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:00 UTC (permalink / raw)
  To: LKML; +Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter

[
  Changes since V3:

    Updated to git tree 2ffbb8377c7a0713baf6644e285adc27a5654582

    Removed cpumask_t from stacks (using per_cpu masks).

    Optimized the searching for overloaded queues a bit.
      (a lot of work in this area)

    Run RT balance logic on waking of new tasks.

    The tarball of these patches is also available at
      http://rostedt.homelinux.com/rt/rt-balance-patches-v4.tar.bz2
]

Currently in mainline the balancing of multiple RT threads is quite broken.
That is to say that a high priority thread that is scheduled on a CPU
with a higher priority thread, may need to unnecessarily wait while it
can easily run on another CPU that's running a lower priority thread.

Balancing (or migrating) tasks in general is an art. Lots of considerations
must be taken into account. Cache lines, NUMA and more. This is true
with general processes which expect high through put and migration can
be done in batch.  But when it comes to RT tasks, we really need to
put them off to a CPU that they can run on as soon as possible. Even
if it means a bit of cache line flushing.

Right now an RT task can wait several milliseconds before it gets scheduled
to run. And perhaps even longer. The migration thread is not fast enough
to take care of RT tasks.

To demonstrate this, I wrote a simple test.

  http://rostedt.homelinux.com/rt/rt-migrate-test.c

  (gcc -o rt-migrate-test rt-migrate-test.c -lpthread)

This test expects a parameter to pass in the number of threads to create.
If you add the '-c' option (check) it will terminate if the test fails
one of the iterations. If you add this, pass in +1 threads.

For example, on a 4 way box, I used

  rt-migrate-test -c 5

What this test does is to create the number of threads specified (in this
case 5). Each thread is set as an RT FIFO task starting at a specified
prio (default 2) and each thread being one priority higher. So with this
example the 5 threads created are at priorities 2, 3, 4, 5, and 6.

The parent thread sets its priority to one higher than the highest of
the children (this example 7). It uses pthread_barrier_wait to synchronize
the threads.  Then it takes a time stamp and starts all the threads.
The threads when woken up take a time stamp and compares it to the parent
thread to see how long it took to be awoken. It then runs for an
interval (20ms default) in a busy loop. The busy loop ends when it reaches
the interval delta from the start time stamp. So if it is preempted, it
may not actually run for the full interval. This is expected behavior
of the test.

The numbers recorded are the delta from the thread's time stamp from the
parent time stamp. The number of iterations it ran the busy loop for, and
the delta from a thread time stamp taken at the end of the loop to the
parent time stamp.

Sometimes a lower priority task might wake up before a higher priority,
but this is OK, as long as the higher priority process gets the CPU when
it is awoken.

At the end of the test, the iteration data is printed to stdout. If a
higher priority task had to wait for a lower one to finish running, then
this is considered a failure. Here's an example of the output from
a run against git commit 4fa4d23fa20de67df919030c1216295664866ad7.

   1:       36      33   20041      39      33
 len:    20036   20033   40041   20039   20033
 loops: 167789  167693  227167  167829  167814

On iteration 1 (starts at 0) the third task started at 20ms after the parent
woke it up. We can see here that the first two tasks ran to completion
before the higher priority task was even able to start. That is a
20ms latency for the higher priority task!!!

So people who think that their audio would lose most latencies by upping 
the priority, may be in for a surprise. Since some kernel threads (like
the migration thread itself) may cause this latency.

To solve this issue, I've changed the RT task balancing from a passive
method (migration thread) to an active method.  This new method is
to actively push or pull RT tasks when they are woken up or scheduled.

On wake up of a task if it is an RT task, and there's already an RT task
of higher priority running on its runqueue, we initiate a push_rt_tasks
algorithm. This algorithm looks at the highest non-running RT task
and tries to find a CPU where it can run on. It only migrates the RT
task if it finds a CPU (of lowest priority) where the RT task
can run on and can preempt the currently running task on that CPU.
We continue pushing RT tasks until we can't push anymore.

If a RT task fails to be migrated we stop the pushing. This is possible
because we are always looking at the highest priority RT task on the
run queue. And if it can't migrate, then most likely the lower RT tasks
can not either.

There is one case that is not covered by this patch set. That is that
when the highest priority non running RT task has its CPU affinity
in such a way that it can not preempt any tasks on the CPUs running
on CPUs of its affinity. But a lower priority task has a larger affinity
to CPUs that it can run on. This is a case where the lower priority task
will not be migrated to those CPUS (although those CPUs may pull that task).
Currently this patch set ignores this scenario.

Another case where we push RT tasks is in the finish_task_switch. This is
done since the running RT task can not be migrated while it is running.
So if an RT task is preempted by a higher priority RT task, we can
migrate the RT task being preempted at that moment.

We also actively pull RT tasks. Whenever a runqueue is about to lower its
priority (schedule a lower priority task) a check is done to see if that
runqueue can pull RT tasks to it to run instead. A search is made of all
overloaded runqueues (runqueues with more than one RT task scheduled on it)
and checked to see if they have an RT task that can run on the runqueue
(affinity matches) and is of higher priority than the task the runqueue
is about to schedule. The pull algorithm pulls all RT tasks that match
this case.

With this patch set, I ran the rt-migrate-test over night in a while
loop with the -c option (which terminates upon failure) and it passed
over 6500 tests (each doing 50 wakeups each).

Here's an example of the output from the patched kernel. This is just to
explain it a bit more.

   1:    20060      61      55      56      61
 len:    40060   20061   20055   20056   20061
 loops: 227255  146050  168104  145965  168144

   2:       40      46      31      35      42
 len:    20040   20046   20031   20035   20042
 loops:     28  167769  167781  167668  167804

The first iteration (really 2cd, since we start at zero), is a typical run.
The lowest prio task didn't start executing until the other 4 tasks finished
and gave up the CPU.

The second iteration seems at first like a failure. But this is actually fine.
The lowest priority task just happen to schedule onto a CPU before the
higher priority tasks were woken up. But as you can see from this example,
the higher priority tasks still were able to get scheduled right away and
in doing so preempted the lower priority task. This can be seen by the
number of loops that the lower priority task was able to complete. Only 28.
This is because the busy loop terminates when the time stamp reaches the
time interval (20ms here) from the start time stamp. Since the lower priority
task was able to sneak in and start, it's time stamp was low. So after it
got preempted, and rescheduled, it was already past the run time interval
so it simply ended the loop.

Finally, the CFS RT balancing had to be removed in order for this to work.
Testing showed that the CFS RT balancing would actually pull RT tasks
from runqueues already assigned to the proper runqueues, and again cause
latencies.  With this new approach, the CFS RT balancing is not needed,
and I suggest that these patches replace the current CFS RT balancing.

Also, let me stress, that I made a great attempt to have this cause
as little overhead (practically none) to the non RT cases. Most of these
algorithms only take place when more than one RT task is scheduled on the
same runqueue.

Special thanks goes to Gregory Haskins for his advice and back and forth
on IRC with ideas. Although I didn't use his actual patches (his were
against -rt) it did help me clean up some of my code.  Also, thanks go to
Ingo Molnar himself for taking some ideas from his RT balance code in
the -rt patch.

Comments welcomed!

-- Steve

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 01/20] Add rt_nr_running accounting
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
@ 2007-11-21  1:00 ` Steven Rostedt
  2007-11-21  1:00 ` [PATCH v4 02/20] track highest prio queued on runqueue Steven Rostedt
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:00 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: count-rt-queued.patch --]
[-- Type: text/plain, Size: 1887 bytes --]

This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched.c    |    1 +
 kernel/sched_rt.c |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:52:44.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:52:50.000000000 -0500
@@ -266,6 +266,7 @@ struct rt_rq {
 	struct rt_prio_array active;
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
+	unsigned long rt_nr_running;
 };
 
 /*
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:52:44.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:52:50.000000000 -0500
@@ -25,12 +25,27 @@ static void update_curr_rt(struct rq *rq
 	curr->se.exec_start = rq->clock;
 }
 
+static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+	WARN_ON(!rt_task(p));
+	rq->rt.rt_nr_running++;
+}
+
+static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+	WARN_ON(!rt_task(p));
+	WARN_ON(!rq->rt.rt_nr_running);
+	rq->rt.rt_nr_running--;
+}
+
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
 {
 	struct rt_prio_array *array = &rq->rt.active;
 
 	list_add_tail(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
+
+	inc_rt_tasks(p, rq);
 }
 
 /*
@@ -45,6 +60,8 @@ static void dequeue_task_rt(struct rq *r
 	list_del(&p->run_list);
 	if (list_empty(array->queue + p->prio))
 		__clear_bit(p->prio, array->bitmap);
+
+	dec_rt_tasks(p, rq);
 }
 
 /*

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 02/20] track highest prio queued on runqueue
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
  2007-11-21  1:00 ` [PATCH v4 01/20] Add rt_nr_running accounting Steven Rostedt
@ 2007-11-21  1:00 ` Steven Rostedt
  2007-11-21  1:00 ` [PATCH v4 03/20] push RT tasks Steven Rostedt
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:00 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: add-rq-highest-prio.patch --]
[-- Type: text/plain, Size: 2375 bytes --]

This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.

This information will be used for later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched.c    |    3 +++
 kernel/sched_rt.c |   18 ++++++++++++++++++
 2 files changed, 21 insertions(+)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:52:50.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:52:55.000000000 -0500
@@ -267,6 +267,8 @@ struct rt_rq {
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
 	unsigned long rt_nr_running;
+	/* highest queued rt task prio */
+	int highest_prio;
 };
 
 /*
@@ -6776,6 +6778,7 @@ void __init sched_init(void)
 		rq->cpu = i;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
+		rq->rt.highest_prio = MAX_RT_PRIO;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:52:50.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:52:55.000000000 -0500
@@ -29,6 +29,10 @@ static inline void inc_rt_tasks(struct t
 {
 	WARN_ON(!rt_task(p));
 	rq->rt.rt_nr_running++;
+#ifdef CONFIG_SMP
+	if (p->prio < rq->rt.highest_prio)
+		rq->rt.highest_prio = p->prio;
+#endif /* CONFIG_SMP */
 }
 
 static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -36,6 +40,20 @@ static inline void dec_rt_tasks(struct t
 	WARN_ON(!rt_task(p));
 	WARN_ON(!rq->rt.rt_nr_running);
 	rq->rt.rt_nr_running--;
+#ifdef CONFIG_SMP
+	if (rq->rt.rt_nr_running) {
+		struct rt_prio_array *array;
+
+		WARN_ON(p->prio < rq->rt.highest_prio);
+		if (p->prio == rq->rt.highest_prio) {
+			/* recalculate */
+			array = &rq->rt.active;
+			rq->rt.highest_prio =
+				sched_find_first_bit(array->bitmap);
+		} /* otherwise leave rq->highest prio alone */
+	} else
+		rq->rt.highest_prio = MAX_RT_PRIO;
+#endif /* CONFIG_SMP */
 }
 
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 03/20] push RT tasks
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
  2007-11-21  1:00 ` [PATCH v4 01/20] Add rt_nr_running accounting Steven Rostedt
  2007-11-21  1:00 ` [PATCH v4 02/20] track highest prio queued on runqueue Steven Rostedt
@ 2007-11-21  1:00 ` Steven Rostedt
  2007-11-21  1:00 ` [PATCH v4 04/20] RT overloaded runqueues accounting Steven Rostedt
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:00 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-push-tasks.patch --]
[-- Type: text/plain, Size: 9450 bytes --]

This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.

When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority.  The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.

We continue pushing RT tasks off the current runqueue until we don't push any
more.  The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.

TODO: The algorithm may stop when there are still RT tasks that can be
 migrated. Specifically, if the highest non running RT task CPU affinity
 is restricted to CPUs that are running higher priority tasks, there may
 be a lower priority task queued that has an affinity with a CPU that is
 running a lower priority task that it could be migrated to.  This
 patch set does not address this issue.

Note: checkpatch reveals two over 80 character instances. I'm not sure
 that breaking them up will help visually, so I left them as is.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched.c    |    8 +
 kernel/sched_rt.c |  225 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 231 insertions(+), 2 deletions(-)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:52:55.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:52:56.000000000 -0500
@@ -1877,6 +1877,8 @@ static void finish_task_switch(struct rq
 	prev_state = prev->state;
 	finish_arch_switch(prev);
 	finish_lock_switch(rq, prev);
+	schedule_tail_balance_rt(rq);
+
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
@@ -2110,11 +2112,13 @@ static void double_rq_unlock(struct rq *
 /*
  * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
  */
-static void double_lock_balance(struct rq *this_rq, struct rq *busiest)
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(this_rq->lock)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
+	int ret = 0;
+
 	if (unlikely(!irqs_disabled())) {
 		/* printk() doesn't work good under rq->lock */
 		spin_unlock(&this_rq->lock);
@@ -2125,9 +2129,11 @@ static void double_lock_balance(struct r
 			spin_unlock(&this_rq->lock);
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
+			ret = 1;
 		} else
 			spin_lock(&busiest->lock);
 	}
+	return ret;
 }
 
 /*
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:52:55.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:52:56.000000000 -0500
@@ -134,6 +134,227 @@ static void put_prev_task_rt(struct rq *
 }
 
 #ifdef CONFIG_SMP
+/* Only try algorithms three times */
+#define RT_MAX_TRIES 3
+
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
+static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
+
+/* Return the second highest RT task, NULL otherwise */
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+{
+	struct rt_prio_array *array = &rq->rt.active;
+	struct task_struct *next;
+	struct list_head *queue;
+	int idx;
+
+	assert_spin_locked(&rq->lock);
+
+	if (likely(rq->rt.rt_nr_running < 2))
+		return NULL;
+
+	idx = sched_find_first_bit(array->bitmap);
+	if (unlikely(idx >= MAX_RT_PRIO)) {
+		WARN_ON(1); /* rt_nr_running is bad */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+	if (unlikely(next != rq->curr))
+		return next;
+
+	if (queue->next->next != queue) {
+		/* same prio task */
+		next = list_entry(queue->next->next, struct task_struct, run_list);
+		return next;
+	}
+
+	/* slower, but more flexible */
+	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
+	if (unlikely(idx >= MAX_RT_PRIO)) {
+		WARN_ON(1); /* rt_nr_running was 2 and above! */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+
+	return next;
+}
+
+static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+				      struct rq *this_rq)
+{
+	struct rq *lowest_rq = NULL;
+	int cpu;
+	int tries;
+	cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask);
+
+	cpus_and(*cpu_mask, cpu_online_map, task->cpus_allowed);
+
+	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+		/*
+		 * Scan each rq for the lowest prio.
+		 */
+		for_each_cpu_mask(cpu, *cpu_mask) {
+			struct rq *rq = &per_cpu(runqueues, cpu);
+
+			if (cpu == this_rq->cpu)
+				continue;
+
+			/* We look for lowest RT prio or non-rt CPU */
+			if (rq->rt.highest_prio >= MAX_RT_PRIO) {
+				lowest_rq = rq;
+				break;
+			}
+
+			/* no locking for now */
+			if (rq->rt.highest_prio > task->prio &&
+			    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
+				lowest_rq = rq;
+			}
+		}
+
+		if (!lowest_rq)
+			break;
+
+		/* if the prio of this runqueue changed, try again */
+		if (double_lock_balance(this_rq, lowest_rq)) {
+			/*
+			 * We had to unlock the run queue. In
+			 * the mean time, task could have
+			 * migrated already or had its affinity changed.
+			 * Also make sure that it wasn't scheduled on its rq.
+			 */
+			if (unlikely(task_rq(task) != this_rq ||
+				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
+				     task_running(this_rq, task) ||
+				     !task->se.on_rq)) {
+				spin_unlock(&lowest_rq->lock);
+				lowest_rq = NULL;
+				break;
+			}
+		}
+
+		/* If this rq is still suitable use it. */
+		if (lowest_rq->rt.highest_prio > task->prio)
+			break;
+
+		/* try again */
+		spin_unlock(&lowest_rq->lock);
+		lowest_rq = NULL;
+	}
+
+	return lowest_rq;
+}
+
+/*
+ * If the current CPU has more than one RT task, see if the non
+ * running task can migrate over to a CPU that is running a task
+ * of lesser priority.
+ */
+static int push_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next_task;
+	struct rq *lowest_rq;
+	int ret = 0;
+	int paranoid = RT_MAX_TRIES;
+
+	assert_spin_locked(&this_rq->lock);
+
+	next_task = pick_next_highest_task_rt(this_rq);
+	if (!next_task)
+		return 0;
+
+ retry:
+	if (unlikely(next_task == this_rq->curr))
+		return 0;
+
+	/*
+	 * It's possible that the next_task slipped in of
+	 * higher priority than current. If that's the case
+	 * just reschedule current.
+	 */
+	if (unlikely(next_task->prio < this_rq->curr->prio)) {
+		resched_task(this_rq->curr);
+		return 0;
+	}
+
+	/* We might release this_rq lock */
+	get_task_struct(next_task);
+
+	/* find_lock_lowest_rq locks the rq if found */
+	lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+	if (!lowest_rq) {
+		struct task_struct *task;
+		/*
+		 * find lock_lowest_rq releases this_rq->lock
+		 * so it is possible that next_task has changed.
+		 * If it has, then try again.
+		 */
+		task = pick_next_highest_task_rt(this_rq);
+		if (unlikely(task != next_task) && task && paranoid--) {
+			put_task_struct(next_task);
+			next_task = task;
+			goto retry;
+		}
+		goto out;
+	}
+
+	assert_spin_locked(&lowest_rq->lock);
+
+	deactivate_task(this_rq, next_task, 0);
+	set_task_cpu(next_task, lowest_rq->cpu);
+	activate_task(lowest_rq, next_task, 0);
+
+	resched_task(lowest_rq->curr);
+
+	spin_unlock(&lowest_rq->lock);
+
+	ret = 1;
+out:
+	put_task_struct(next_task);
+
+	return ret;
+}
+
+/*
+ * TODO: Currently we just use the second highest prio task on
+ *       the queue, and stop when it can't migrate (or there's
+ *       no more RT tasks).  There may be a case where a lower
+ *       priority RT task has a different affinity than the
+ *       higher RT task. In this case the lower RT task could
+ *       possibly be able to migrate where as the higher priority
+ *       RT task could not.  We currently ignore this issue.
+ *       Enhancements are welcome!
+ */
+static void push_rt_tasks(struct rq *rq)
+{
+	/* push_rt_task will return true if it moved an RT */
+	while (push_rt_task(rq))
+		;
+}
+
+static void schedule_tail_balance_rt(struct rq *rq)
+{
+	/*
+	 * If we have more than one rt_task queued, then
+	 * see if we can push the other rt_tasks off to other CPUS.
+	 * Note we may release the rq lock, and since
+	 * the lock was owned by prev, we need to release it
+	 * first via finish_lock_switch and then reaquire it here.
+	 */
+	if (unlikely(rq->rt.rt_nr_running > 1)) {
+		spin_lock_irq(&rq->lock);
+		push_rt_tasks(rq);
+		spin_unlock_irq(&rq->lock);
+	}
+}
+
 /*
  * Load-balancing iterator. Note: while the runqueue stays locked
  * during the whole iteration, the current task might be
@@ -238,7 +459,9 @@ move_one_task_rt(struct rq *this_rq, int
 	return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle,
 				  &rt_rq_iterator);
 }
-#endif
+#else /* CONFIG_SMP */
+# define schedule_tail_balance_rt(rq)	do { } while (0)
+#endif /* CONFIG_SMP */
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)
 {

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 04/20] RT overloaded runqueues accounting
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (2 preceding siblings ...)
  2007-11-21  1:00 ` [PATCH v4 03/20] push RT tasks Steven Rostedt
@ 2007-11-21  1:00 ` Steven Rostedt
  2007-11-21  1:00 ` [PATCH v4 05/20] pull RT tasks Steven Rostedt
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:00 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-overload.patch --]
[-- Type: text/plain, Size: 2060 bytes --]

This patch adds an RT overload accounting system. When a runqueue has
more than one RT task queued, it is marked as overloaded. That is that it
is a candidate to have RT tasks pulled from it.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:52:56.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:52:57.000000000 -0500
@@ -3,6 +3,38 @@
  * policies)
  */
 
+#ifdef CONFIG_SMP
+static cpumask_t rt_overload_mask;
+static atomic_t rto_count;
+static inline int rt_overloaded(void)
+{
+	return atomic_read(&rto_count);
+}
+static inline cpumask_t *rt_overload(void)
+{
+	return &rt_overload_mask;
+}
+static inline void rt_set_overload(struct rq *rq)
+{
+	cpu_set(rq->cpu, rt_overload_mask);
+	/*
+	 * Make sure the mask is visible before we set
+	 * the overload count. That is checked to determine
+	 * if we should look at the mask. It would be a shame
+	 * if we looked at the mask, but the mask was not
+	 * updated yet.
+	 */
+	wmb();
+	atomic_inc(&rto_count);
+}
+static inline void rt_clear_overload(struct rq *rq)
+{
+	/* the order here really doesn't matter */
+	atomic_dec(&rto_count);
+	cpu_clear(rq->cpu, rt_overload_mask);
+}
+#endif /* CONFIG_SMP */
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -32,6 +64,8 @@ static inline void inc_rt_tasks(struct t
 #ifdef CONFIG_SMP
 	if (p->prio < rq->rt.highest_prio)
 		rq->rt.highest_prio = p->prio;
+	if (rq->rt.rt_nr_running > 1)
+		rt_set_overload(rq);
 #endif /* CONFIG_SMP */
 }
 
@@ -53,6 +87,8 @@ static inline void dec_rt_tasks(struct t
 		} /* otherwise leave rq->highest prio alone */
 	} else
 		rq->rt.highest_prio = MAX_RT_PRIO;
+	if (rq->rt.rt_nr_running < 2)
+		rt_clear_overload(rq);
 #endif /* CONFIG_SMP */
 }
 

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 05/20] pull RT tasks
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (3 preceding siblings ...)
  2007-11-21  1:00 ` [PATCH v4 04/20] RT overloaded runqueues accounting Steven Rostedt
@ 2007-11-21  1:00 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 06/20] wake up balance RT Steven Rostedt
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:00 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-pull-tasks.patch --]
[-- Type: text/plain, Size: 7868 bytes --]

This patch adds the algorithm to pull tasks from RT overloaded runqueues.

When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched.c    |    2 
 kernel/sched_rt.c |  187 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 178 insertions(+), 11 deletions(-)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:52:56.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:52:59.000000000 -0500
@@ -3646,6 +3646,8 @@ need_resched_nonpreemptible:
 		switch_count = &prev->nvcsw;
 	}
 
+	schedule_balance_rt(rq, prev);
+
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:52:57.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:52:59.000000000 -0500
@@ -176,8 +176,17 @@ static void put_prev_task_rt(struct rq *
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
 
+static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+	if (!task_running(rq, p) &&
+	    (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)))
+		return 1;
+	return 0;
+}
+
 /* Return the second highest RT task, NULL otherwise */
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
+						     int cpu)
 {
 	struct rt_prio_array *array = &rq->rt.active;
 	struct task_struct *next;
@@ -196,26 +205,36 @@ static struct task_struct *pick_next_hig
 	}
 
 	queue = array->queue + idx;
+	BUG_ON(list_empty(queue));
+
 	next = list_entry(queue->next, struct task_struct, run_list);
-	if (unlikely(next != rq->curr))
-		return next;
+	if (unlikely(pick_rt_task(rq, next, cpu)))
+		goto out;
 
 	if (queue->next->next != queue) {
 		/* same prio task */
 		next = list_entry(queue->next->next, struct task_struct, run_list);
-		return next;
+		if (pick_rt_task(rq, next, cpu))
+			goto out;
 	}
 
+ retry:
 	/* slower, but more flexible */
 	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-	if (unlikely(idx >= MAX_RT_PRIO)) {
-		WARN_ON(1); /* rt_nr_running was 2 and above! */
+	if (unlikely(idx >= MAX_RT_PRIO))
 		return NULL;
-	}
 
 	queue = array->queue + idx;
-	next = list_entry(queue->next, struct task_struct, run_list);
+	BUG_ON(list_empty(queue));
+
+	list_for_each_entry(next, queue, run_list) {
+		if (pick_rt_task(rq, next, cpu))
+			goto out;
+	}
+
+	goto retry;
 
+ out:
 	return next;
 }
 
@@ -302,13 +321,15 @@ static int push_rt_task(struct rq *this_
 
 	assert_spin_locked(&this_rq->lock);
 
-	next_task = pick_next_highest_task_rt(this_rq);
+	next_task = pick_next_highest_task_rt(this_rq, -1);
 	if (!next_task)
 		return 0;
 
  retry:
-	if (unlikely(next_task == this_rq->curr))
+	if (unlikely(next_task == this_rq->curr)) {
+		WARN_ON(1);
 		return 0;
+	}
 
 	/*
 	 * It's possible that the next_task slipped in of
@@ -332,7 +353,7 @@ static int push_rt_task(struct rq *this_
 		 * so it is possible that next_task has changed.
 		 * If it has, then try again.
 		 */
-		task = pick_next_highest_task_rt(this_rq);
+		task = pick_next_highest_task_rt(this_rq, -1);
 		if (unlikely(task != next_task) && task && paranoid--) {
 			put_task_struct(next_task);
 			next_task = task;
@@ -375,6 +396,149 @@ static void push_rt_tasks(struct rq *rq)
 		;
 }
 
+static int pull_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next;
+	struct task_struct *p;
+	struct rq *src_rq;
+	cpumask_t *rto_cpumask;
+	int this_cpu = this_rq->cpu;
+	int cpu;
+	int ret = 0;
+
+	assert_spin_locked(&this_rq->lock);
+
+	/*
+	 * If cpusets are used, and we have overlapping
+	 * run queue cpusets, then this algorithm may not catch all.
+	 * This is just the price you pay on trying to keep
+	 * dirtying caches down on large SMP machines.
+	 */
+	if (likely(!rt_overloaded()))
+		return 0;
+
+	next = pick_next_task_rt(this_rq);
+
+	rto_cpumask = rt_overload();
+
+	for_each_cpu_mask(cpu, *rto_cpumask) {
+		if (this_cpu == cpu)
+			continue;
+
+		src_rq = cpu_rq(cpu);
+		if (unlikely(src_rq->rt.rt_nr_running <= 1)) {
+			/*
+			 * It is possible that overlapping cpusets
+			 * will miss clearing a non overloaded runqueue.
+			 * Clear it now.
+			 */
+			if (double_lock_balance(this_rq, src_rq)) {
+				/* unlocked our runqueue lock */
+				struct task_struct *old_next = next;
+				next = pick_next_task_rt(this_rq);
+				if (next != old_next)
+					ret = 1;
+			}
+			if (likely(src_rq->rt.rt_nr_running <= 1))
+				/*
+				 * Small chance that this_rq->curr changed
+				 * but it's really harmless here.
+				 */
+				rt_clear_overload(this_rq);
+			else
+				/*
+				 * Heh, the src_rq is now overloaded, since
+				 * we already have the src_rq lock, go straight
+				 * to pulling tasks from it.
+				 */
+				goto try_pulling;
+			spin_unlock(&src_rq->lock);
+			continue;
+		}
+
+		/*
+		 * We can potentially drop this_rq's lock in
+		 * double_lock_balance, and another CPU could
+		 * steal our next task - hence we must cause
+		 * the caller to recalculate the next task
+		 * in that case:
+		 */
+		if (double_lock_balance(this_rq, src_rq)) {
+			struct task_struct *old_next = next;
+			next = pick_next_task_rt(this_rq);
+			if (next != old_next)
+				ret = 1;
+		}
+
+		/*
+		 * Are there still pullable RT tasks?
+		 */
+		if (src_rq->rt.rt_nr_running <= 1) {
+			spin_unlock(&src_rq->lock);
+			continue;
+		}
+
+ try_pulling:
+		p = pick_next_highest_task_rt(src_rq, this_cpu);
+
+		/*
+		 * Do we have an RT task that preempts
+		 * the to-be-scheduled task?
+		 */
+		if (p && (!next || (p->prio < next->prio))) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!p->se.on_rq);
+
+			/*
+			 * There's a chance that p is higher in priority
+			 * than what's currently running on its cpu.
+			 * This is just that p is wakeing up and hasn't
+			 * had a chance to schedule. We only pull
+			 * p if it is lower in priority than the
+			 * current task on the run queue or
+			 * this_rq next task is lower in prio than
+			 * the current task on that rq.
+			 */
+			if (p->prio < src_rq->curr->prio ||
+			    (next && next->prio < src_rq->curr->prio))
+				goto bail;
+
+			ret = 1;
+
+			deactivate_task(src_rq, p, 0);
+			set_task_cpu(p, this_cpu);
+			activate_task(this_rq, p, 0);
+			/*
+			 * We continue with the search, just in
+			 * case there's an even higher prio task
+			 * in another runqueue. (low likelyhood
+			 * but possible)
+			 */
+
+			/*
+			 * Update next so that we won't pick a task
+			 * on another cpu with a priority lower (or equal)
+			 * than the one we just picked.
+			 */
+			next = p;
+
+		}
+ bail:
+		spin_unlock(&src_rq->lock);
+	}
+
+	return ret;
+}
+
+static void schedule_balance_rt(struct rq *rq,
+				struct task_struct *prev)
+{
+	/* Try to pull RT tasks here if we lower this rq's prio */
+	if (unlikely(rt_task(prev)) &&
+	    rq->rt.highest_prio > prev->prio)
+		pull_rt_task(rq);
+}
+
 static void schedule_tail_balance_rt(struct rq *rq)
 {
 	/*
@@ -497,6 +661,7 @@ move_one_task_rt(struct rq *this_rq, int
 }
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)
+# define schedule_balance_rt(rq, prev)	do { } while (0)
 #endif /* CONFIG_SMP */
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 06/20] wake up balance RT
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (4 preceding siblings ...)
  2007-11-21  1:00 ` [PATCH v4 05/20] pull RT tasks Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 07/20] disable CFS RT load balancing Steven Rostedt
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-wakeup.patch --]
[-- Type: text/plain, Size: 2101 bytes --]

This patch adds pushing of overloaded RT tasks from a runqueue that is
having tasks (most likely RT tasks) added to the run queue.

TODO: We don't cover the case of waking of new RT tasks (yet).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched.c    |    3 +++
 kernel/sched_rt.c |   10 ++++++++++
 2 files changed, 13 insertions(+)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:52:59.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:53:00.000000000 -0500
@@ -22,6 +22,8 @@
  *              by Peter Williams
  *  2007-05-06  Interactivity improvements to CFS by Mike Galbraith
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
+ *  2007-10-22  RT overload balancing by Steven Rostedt
+ *                 (with thanks to Gregory Haskins)
  */
 
 #include <linux/mm.h>
@@ -1635,6 +1637,7 @@ out_activate:
 
 out_running:
 	p->state = TASK_RUNNING;
+	wakeup_balance_rt(rq, p);
 out:
 	task_rq_unlock(rq, &flags);
 
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:52:59.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:00.000000000 -0500
@@ -555,6 +555,15 @@ static void schedule_tail_balance_rt(str
 	}
 }
 
+
+static void wakeup_balance_rt(struct rq *rq, struct task_struct *p)
+{
+	if (unlikely(rt_task(p)) &&
+	    !task_running(rq, p) &&
+	    (p->prio >= rq->curr->prio))
+		push_rt_tasks(rq);
+}
+
 /*
  * Load-balancing iterator. Note: while the runqueue stays locked
  * during the whole iteration, the current task might be
@@ -662,6 +671,7 @@ move_one_task_rt(struct rq *this_rq, int
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)
 # define schedule_balance_rt(rq, prev)	do { } while (0)
+# define wakeup_balance_rt(rq, p)	do { } while (0)
 #endif /* CONFIG_SMP */
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 07/20] disable CFS RT load balancing.
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (5 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 06/20] wake up balance RT Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 08/20] Cache cpus_allowed weight for optimizing migration Steven Rostedt
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: disable-CFS-rt-balance.patch --]
[-- Type: text/plain, Size: 3650 bytes --]

Since we now take an active approach to load balancing, we don't need to
balance RT tasks via CFS. In fact, this code was found to pull RT tasks
away from CPUS that the active movement performed, resulting in
large latencies.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |   95 ++----------------------------------------------------
 1 file changed, 4 insertions(+), 91 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:00.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:01.000000000 -0500
@@ -564,109 +564,22 @@ static void wakeup_balance_rt(struct rq 
 		push_rt_tasks(rq);
 }
 
-/*
- * Load-balancing iterator. Note: while the runqueue stays locked
- * during the whole iteration, the current task might be
- * dequeued so the iterator has to be dequeue-safe. Here we
- * achieve that by always pre-iterating before returning
- * the current task:
- */
-static struct task_struct *load_balance_start_rt(void *arg)
-{
-	struct rq *rq = arg;
-	struct rt_prio_array *array = &rq->rt.active;
-	struct list_head *head, *curr;
-	struct task_struct *p;
-	int idx;
-
-	idx = sched_find_first_bit(array->bitmap);
-	if (idx >= MAX_RT_PRIO)
-		return NULL;
-
-	head = array->queue + idx;
-	curr = head->prev;
-
-	p = list_entry(curr, struct task_struct, run_list);
-
-	curr = curr->prev;
-
-	rq->rt.rt_load_balance_idx = idx;
-	rq->rt.rt_load_balance_head = head;
-	rq->rt.rt_load_balance_curr = curr;
-
-	return p;
-}
-
-static struct task_struct *load_balance_next_rt(void *arg)
-{
-	struct rq *rq = arg;
-	struct rt_prio_array *array = &rq->rt.active;
-	struct list_head *head, *curr;
-	struct task_struct *p;
-	int idx;
-
-	idx = rq->rt.rt_load_balance_idx;
-	head = rq->rt.rt_load_balance_head;
-	curr = rq->rt.rt_load_balance_curr;
-
-	/*
-	 * If we arrived back to the head again then
-	 * iterate to the next queue (if any):
-	 */
-	if (unlikely(head == curr)) {
-		int next_idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-
-		if (next_idx >= MAX_RT_PRIO)
-			return NULL;
-
-		idx = next_idx;
-		head = array->queue + idx;
-		curr = head->prev;
-
-		rq->rt.rt_load_balance_idx = idx;
-		rq->rt.rt_load_balance_head = head;
-	}
-
-	p = list_entry(curr, struct task_struct, run_list);
-
-	curr = curr->prev;
-
-	rq->rt.rt_load_balance_curr = curr;
-
-	return p;
-}
-
 static unsigned long
 load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		unsigned long max_load_move,
 		struct sched_domain *sd, enum cpu_idle_type idle,
 		int *all_pinned, int *this_best_prio)
 {
-	struct rq_iterator rt_rq_iterator;
-
-	rt_rq_iterator.start = load_balance_start_rt;
-	rt_rq_iterator.next = load_balance_next_rt;
-	/* pass 'busiest' rq argument into
-	 * load_balance_[start|next]_rt iterators
-	 */
-	rt_rq_iterator.arg = busiest;
-
-	return balance_tasks(this_rq, this_cpu, busiest, max_load_move, sd,
-			     idle, all_pinned, this_best_prio, &rt_rq_iterator);
+	/* don't touch RT tasks */
+	return 0;
 }
 
 static int
 move_one_task_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		 struct sched_domain *sd, enum cpu_idle_type idle)
 {
-	struct rq_iterator rt_rq_iterator;
-
-	rt_rq_iterator.start = load_balance_start_rt;
-	rt_rq_iterator.next = load_balance_next_rt;
-	rt_rq_iterator.arg = busiest;
-
-	return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle,
-				  &rt_rq_iterator);
+	/* don't touch RT tasks */
+	return 0;
 }
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 08/20] Cache cpus_allowed weight for optimizing migration
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (6 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 07/20] disable CFS RT load balancing Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 09/20] RT: Consistency cleanup for this_rq usage Steven Rostedt
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-cpu-weight.patch --]
[-- Type: text/plain, Size: 7965 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run.  So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

This causes two problems in the current code:

1) If a high-priority bound task and a low-priority unbounded task queue
   up behind the running task, we will fail to ever relocate the unbounded
   task because we terminate the search on the first unmovable task.

2) We spend precious futile cycles in the fast-path trying to pull
   overloaded tasks over.  It is therefore optimial to strive to avoid the
   overhead all together if we can cheaply detect the condition before
   overload even occurs.

This patch tries to achieve this optimization by utilizing the hamming
weight of the task->cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated.  We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.

We introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 include/linux/init_task.h |    1 
 include/linux/sched.h     |    2 +
 kernel/fork.c             |    1 
 kernel/sched.c            |    9 +++++++-
 kernel/sched_rt.c         |   50 +++++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 57 insertions(+), 6 deletions(-)

Index: linux-compile.git/include/linux/init_task.h
===================================================================
--- linux-compile.git.orig/include/linux/init_task.h	2007-11-20 19:52:44.000000000 -0500
+++ linux-compile.git/include/linux/init_task.h	2007-11-20 19:53:02.000000000 -0500
@@ -130,6 +130,7 @@ extern struct group_info init_groups;
 	.normal_prio	= MAX_PRIO-20,					\
 	.policy		= SCHED_NORMAL,					\
 	.cpus_allowed	= CPU_MASK_ALL,					\
+	.nr_cpus_allowed = NR_CPUS,					\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
 	.run_list	= LIST_HEAD_INIT(tsk.run_list),			\
Index: linux-compile.git/include/linux/sched.h
===================================================================
--- linux-compile.git.orig/include/linux/sched.h	2007-11-20 19:52:44.000000000 -0500
+++ linux-compile.git/include/linux/sched.h	2007-11-20 19:53:02.000000000 -0500
@@ -843,6 +843,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
 	void (*task_new) (struct rq *rq, struct task_struct *p);
+	void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
 };
 
 struct load_weight {
@@ -952,6 +953,7 @@ struct task_struct {
 
 	unsigned int policy;
 	cpumask_t cpus_allowed;
+	int nr_cpus_allowed;
 	unsigned int time_slice;
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
Index: linux-compile.git/kernel/fork.c
===================================================================
--- linux-compile.git.orig/kernel/fork.c	2007-11-20 19:52:44.000000000 -0500
+++ linux-compile.git/kernel/fork.c	2007-11-20 19:53:02.000000000 -0500
@@ -1237,6 +1237,7 @@ static struct task_struct *copy_process(
 	 * parent's CPU). This avoids alot of nasty races.
 	 */
 	p->cpus_allowed = current->cpus_allowed;
+	p->nr_cpus_allowed = current->nr_cpus_allowed;
 	if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
 			!cpu_online(task_cpu(p))))
 		set_task_cpu(p, smp_processor_id());
Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:53:00.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:53:02.000000000 -0500
@@ -269,6 +269,7 @@ struct rt_rq {
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
 	unsigned long rt_nr_running;
+	unsigned long rt_nr_migratory;
 	/* highest queued rt task prio */
 	int highest_prio;
 };
@@ -5070,7 +5071,13 @@ int set_cpus_allowed(struct task_struct 
 		goto out;
 	}
 
-	p->cpus_allowed = new_mask;
+	if (p->sched_class->set_cpus_allowed)
+		p->sched_class->set_cpus_allowed(p, &new_mask);
+	else {
+		p->cpus_allowed    = new_mask;
+		p->nr_cpus_allowed = cpus_weight(new_mask);
+	}
+
 	/* Can the task run on the task's current CPU? If so, we're done */
 	if (cpu_isset(task_cpu(p), new_mask))
 		goto out;
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:01.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:02.000000000 -0500
@@ -33,6 +33,14 @@ static inline void rt_clear_overload(str
 	atomic_dec(&rto_count);
 	cpu_clear(rq->cpu, rt_overload_mask);
 }
+
+static void update_rt_migration(struct rq *rq)
+{
+	if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1))
+		rt_set_overload(rq);
+	else
+		rt_clear_overload(rq);
+}
 #endif /* CONFIG_SMP */
 
 /*
@@ -64,8 +72,10 @@ static inline void inc_rt_tasks(struct t
 #ifdef CONFIG_SMP
 	if (p->prio < rq->rt.highest_prio)
 		rq->rt.highest_prio = p->prio;
-	if (rq->rt.rt_nr_running > 1)
-		rt_set_overload(rq);
+	if (p->nr_cpus_allowed > 1)
+		rq->rt.rt_nr_migratory++;
+
+	update_rt_migration(rq);
 #endif /* CONFIG_SMP */
 }
 
@@ -87,8 +97,10 @@ static inline void dec_rt_tasks(struct t
 		} /* otherwise leave rq->highest prio alone */
 	} else
 		rq->rt.highest_prio = MAX_RT_PRIO;
-	if (rq->rt.rt_nr_running < 2)
-		rt_clear_overload(rq);
+	if (p->nr_cpus_allowed > 1)
+		rq->rt.rt_nr_migratory--;
+
+	update_rt_migration(rq);
 #endif /* CONFIG_SMP */
 }
 
@@ -179,7 +191,8 @@ static void deactivate_task(struct rq *r
 static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
 {
 	if (!task_running(rq, p) &&
-	    (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)))
+	    (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)) &&
+	    (p->nr_cpus_allowed > 1))
 		return 1;
 	return 0;
 }
@@ -581,6 +594,32 @@ move_one_task_rt(struct rq *this_rq, int
 	/* don't touch RT tasks */
 	return 0;
 }
+static void set_cpus_allowed_rt(struct task_struct *p, cpumask_t *new_mask)
+{
+	int weight = cpus_weight(*new_mask);
+
+	BUG_ON(!rt_task(p));
+
+	/*
+	 * Update the migration status of the RQ if we have an RT task
+	 * which is running AND changing its weight value.
+	 */
+	if (p->se.on_rq && (weight != p->nr_cpus_allowed)) {
+		struct rq *rq = task_rq(p);
+
+		if ((p->nr_cpus_allowed <= 1) && (weight > 1))
+			rq->rt.rt_nr_migratory++;
+		else if((p->nr_cpus_allowed > 1) && (weight <= 1)) {
+			BUG_ON(!rq->rt.rt_nr_migratory);
+			rq->rt.rt_nr_migratory--;
+		}
+
+		update_rt_migration(rq);
+	}
+
+	p->cpus_allowed    = *new_mask;
+	p->nr_cpus_allowed = weight;
+}
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)
 # define schedule_balance_rt(rq, prev)	do { } while (0)
@@ -632,6 +671,7 @@ const struct sched_class rt_sched_class 
 #ifdef CONFIG_SMP
 	.load_balance		= load_balance_rt,
 	.move_one_task		= move_one_task_rt,
+	.set_cpus_allowed       = set_cpus_allowed_rt,
 #endif
 
 	.set_curr_task          = set_curr_task_rt,

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 09/20] RT: Consistency cleanup for this_rq usage
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (7 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 08/20] Cache cpus_allowed weight for optimizing migration Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 10/20] RT: Remove some CFS specific code from the wakeup path of RT tasks Steven Rostedt
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: sched-cleanup-thisrq.patch --]
[-- Type: text/plain, Size: 2709 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

"this_rq" is normally used to denote the RQ on the current cpu
(i.e. "cpu_rq(this_cpu)").  So clean up the usage of this_rq to be
more consistent with the rest of the code.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |   22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:02.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:03.000000000 -0500
@@ -325,21 +325,21 @@ static struct rq *find_lock_lowest_rq(st
  * running task can migrate over to a CPU that is running a task
  * of lesser priority.
  */
-static int push_rt_task(struct rq *this_rq)
+static int push_rt_task(struct rq *rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
 	int ret = 0;
 	int paranoid = RT_MAX_TRIES;
 
-	assert_spin_locked(&this_rq->lock);
+	assert_spin_locked(&rq->lock);
 
-	next_task = pick_next_highest_task_rt(this_rq, -1);
+	next_task = pick_next_highest_task_rt(rq, -1);
 	if (!next_task)
 		return 0;
 
  retry:
-	if (unlikely(next_task == this_rq->curr)) {
+	if (unlikely(next_task == rq->curr)) {
 		WARN_ON(1);
 		return 0;
 	}
@@ -349,24 +349,24 @@ static int push_rt_task(struct rq *this_
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < this_rq->curr->prio)) {
-		resched_task(this_rq->curr);
+	if (unlikely(next_task->prio < rq->curr->prio)) {
+		resched_task(rq->curr);
 		return 0;
 	}
 
-	/* We might release this_rq lock */
+	/* We might release rq lock */
 	get_task_struct(next_task);
 
 	/* find_lock_lowest_rq locks the rq if found */
-	lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+	lowest_rq = find_lock_lowest_rq(next_task, rq);
 	if (!lowest_rq) {
 		struct task_struct *task;
 		/*
-		 * find lock_lowest_rq releases this_rq->lock
+		 * find lock_lowest_rq releases rq->lock
 		 * so it is possible that next_task has changed.
 		 * If it has, then try again.
 		 */
-		task = pick_next_highest_task_rt(this_rq, -1);
+		task = pick_next_highest_task_rt(rq, -1);
 		if (unlikely(task != next_task) && task && paranoid--) {
 			put_task_struct(next_task);
 			next_task = task;
@@ -377,7 +377,7 @@ static int push_rt_task(struct rq *this_
 
 	assert_spin_locked(&lowest_rq->lock);
 
-	deactivate_task(this_rq, next_task, 0);
+	deactivate_task(rq, next_task, 0);
 	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
 

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 10/20] RT: Remove some CFS specific code from the wakeup path of RT tasks
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (8 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 09/20] RT: Consistency cleanup for this_rq usage Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 11/20] RT: Break out the search function Steven Rostedt
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: sched-de-cfsize-rt-path.patch --]
[-- Type: text/plain, Size: 13580 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations.  The problem is that
these calculations are only relevant to CFS tasks where load is king.  For RT
tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 include/linux/sched.h   |    1 
 kernel/sched.c          |  167 +++++++-----------------------------------------
 kernel/sched_fair.c     |  148 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_idletask.c |    9 ++
 kernel/sched_rt.c       |   10 ++
 5 files changed, 195 insertions(+), 140 deletions(-)

Index: linux-compile.git/include/linux/sched.h
===================================================================
--- linux-compile.git.orig/include/linux/sched.h	2007-11-20 19:53:02.000000000 -0500
+++ linux-compile.git/include/linux/sched.h	2007-11-20 19:53:04.000000000 -0500
@@ -823,6 +823,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq);
+	int  (*select_task_rq)(struct task_struct *p, int sync);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:53:02.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:53:04.000000000 -0500
@@ -860,6 +860,13 @@ iter_move_one_task(struct rq *this_rq, i
 		   struct rq_iterator *iterator);
 #endif
 
+#ifdef CONFIG_SMP
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long cpu_avg_load_per_task(int cpu);
+static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
+#endif /* CONFIG_SMP */
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -1045,7 +1052,7 @@ static inline void __set_task_cpu(struct
 /*
  * Is this task likely cache-hot:
  */
-static inline int
+static int
 task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 {
 	s64 delta;
@@ -1270,7 +1277,7 @@ static unsigned long target_load(int cpu
 /*
  * Return the average load per task on the cpu's run queue
  */
-static inline unsigned long cpu_avg_load_per_task(int cpu)
+static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
@@ -1427,58 +1434,6 @@ static int sched_balance_self(int cpu, i
 
 #endif /* CONFIG_SMP */
 
-/*
- * wake_idle() will wake a task on an idle cpu if task->cpu is
- * not idle and an idle cpu is available.  The span of cpus to
- * search starts with cpus closest then further out as needed,
- * so we always favor a closer, idle cpu.
- *
- * Returns the CPU we should wake onto.
- */
-#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
-static int wake_idle(int cpu, struct task_struct *p)
-{
-	cpumask_t tmp;
-	struct sched_domain *sd;
-	int i;
-
-	/*
-	 * If it is idle, then it is the best cpu to run this task.
-	 *
-	 * This cpu is also the best, if it has more than one task already.
-	 * Siblings must be also busy(in most cases) as they didn't already
-	 * pickup the extra load from this cpu and hence we need not check
-	 * sibling runqueue info. This will avoid the checks and cache miss
-	 * penalities associated with that.
-	 */
-	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
-		return cpu;
-
-	for_each_domain(cpu, sd) {
-		if (sd->flags & SD_WAKE_IDLE) {
-			cpus_and(tmp, sd->span, p->cpus_allowed);
-			for_each_cpu_mask(i, tmp) {
-				if (idle_cpu(i)) {
-					if (i != task_cpu(p)) {
-						schedstat_inc(p,
-							se.nr_wakeups_idle);
-					}
-					return i;
-				}
-			}
-		} else {
-			break;
-		}
-	}
-	return cpu;
-}
-#else
-static inline int wake_idle(int cpu, struct task_struct *p)
-{
-	return cpu;
-}
-#endif
-
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -1500,8 +1455,6 @@ static int try_to_wake_up(struct task_st
 	long old_state;
 	struct rq *rq;
 #ifdef CONFIG_SMP
-	struct sched_domain *sd, *this_sd = NULL;
-	unsigned long load, this_load;
 	int new_cpu;
 #endif
 
@@ -1521,90 +1474,7 @@ static int try_to_wake_up(struct task_st
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-	new_cpu = cpu;
-
-	schedstat_inc(rq, ttwu_count);
-	if (cpu == this_cpu) {
-		schedstat_inc(rq, ttwu_local);
-		goto out_set_cpu;
-	}
-
-	for_each_domain(this_cpu, sd) {
-		if (cpu_isset(cpu, sd->span)) {
-			schedstat_inc(sd, ttwu_wake_remote);
-			this_sd = sd;
-			break;
-		}
-	}
-
-	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
-		goto out_set_cpu;
-
-	/*
-	 * Check for affine wakeup and passive balancing possibilities.
-	 */
-	if (this_sd) {
-		int idx = this_sd->wake_idx;
-		unsigned int imbalance;
-
-		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
-
-		load = source_load(cpu, idx);
-		this_load = target_load(this_cpu, idx);
-
-		new_cpu = this_cpu; /* Wake to this CPU if we can */
-
-		if (this_sd->flags & SD_WAKE_AFFINE) {
-			unsigned long tl = this_load;
-			unsigned long tl_per_task;
-
-			/*
-			 * Attract cache-cold tasks on sync wakeups:
-			 */
-			if (sync && !task_hot(p, rq->clock, this_sd))
-				goto out_set_cpu;
-
-			schedstat_inc(p, se.nr_wakeups_affine_attempts);
-			tl_per_task = cpu_avg_load_per_task(this_cpu);
-
-			/*
-			 * If sync wakeup then subtract the (maximum possible)
-			 * effect of the currently running task from the load
-			 * of the current CPU:
-			 */
-			if (sync)
-				tl -= current->se.load.weight;
-
-			if ((tl <= load &&
-				tl + target_load(cpu, idx) <= tl_per_task) ||
-			       100*(tl + p->se.load.weight) <= imbalance*load) {
-				/*
-				 * This domain has SD_WAKE_AFFINE and
-				 * p is cache cold in this domain, and
-				 * there is no bad imbalance.
-				 */
-				schedstat_inc(this_sd, ttwu_move_affine);
-				schedstat_inc(p, se.nr_wakeups_affine);
-				goto out_set_cpu;
-			}
-		}
-
-		/*
-		 * Start passive balancing when half the imbalance_pct
-		 * limit is reached.
-		 */
-		if (this_sd->flags & SD_WAKE_BALANCE) {
-			if (imbalance*this_load <= 100*load) {
-				schedstat_inc(this_sd, ttwu_move_balance);
-				schedstat_inc(p, se.nr_wakeups_passive);
-				goto out_set_cpu;
-			}
-		}
-	}
-
-	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
-out_set_cpu:
-	new_cpu = wake_idle(new_cpu, p);
+	new_cpu = p->sched_class->select_task_rq(p, sync);
 	if (new_cpu != cpu) {
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
@@ -1620,6 +1490,23 @@ out_set_cpu:
 		cpu = task_cpu(p);
 	}
 
+#ifdef CONFIG_SCHEDSTATS
+	schedstat_inc(rq, ttwu_count);
+	if (cpu == this_cpu)
+		schedstat_inc(rq, ttwu_local);
+	else {
+		struct sched_domain *sd;
+		for_each_domain(this_cpu, sd) {
+			if (cpu_isset(cpu, sd->span)) {
+				schedstat_inc(sd, ttwu_wake_remote);
+				break;
+			}
+		}
+	}
+
+#endif
+
+
 out_activate:
 #endif /* CONFIG_SMP */
 	schedstat_inc(p, se.nr_wakeups);
Index: linux-compile.git/kernel/sched_fair.c
===================================================================
--- linux-compile.git.orig/kernel/sched_fair.c	2007-11-20 19:52:43.000000000 -0500
+++ linux-compile.git/kernel/sched_fair.c	2007-11-20 19:53:04.000000000 -0500
@@ -830,6 +830,151 @@ static void yield_task_fair(struct rq *r
 }
 
 /*
+ * wake_idle() will wake a task on an idle cpu if task->cpu is
+ * not idle and an idle cpu is available.  The span of cpus to
+ * search starts with cpus closest then further out as needed,
+ * so we always favor a closer, idle cpu.
+ *
+ * Returns the CPU we should wake onto.
+ */
+#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
+static int wake_idle(int cpu, struct task_struct *p)
+{
+	cpumask_t tmp;
+	struct sched_domain *sd;
+	int i;
+
+	/*
+	 * If it is idle, then it is the best cpu to run this task.
+	 *
+	 * This cpu is also the best, if it has more than one task already.
+	 * Siblings must be also busy(in most cases) as they didn't already
+	 * pickup the extra load from this cpu and hence we need not check
+	 * sibling runqueue info. This will avoid the checks and cache miss
+	 * penalities associated with that.
+	 */
+	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
+		return cpu;
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_IDLE) {
+			cpus_and(tmp, sd->span, p->cpus_allowed);
+			for_each_cpu_mask(i, tmp) {
+				if (idle_cpu(i)) {
+					if (i != task_cpu(p)) {
+						schedstat_inc(p,
+						       se.nr_wakeups_idle);
+					}
+					return i;
+				}
+			}
+		} else {
+			break;
+		}
+	}
+	return cpu;
+}
+#else
+static inline int wake_idle(int cpu, struct task_struct *p)
+{
+	return cpu;
+}
+#endif
+
+#ifdef CONFIG_SMP
+static int select_task_rq_fair(struct task_struct *p, int sync)
+{
+	int cpu, this_cpu;
+	struct rq *rq;
+	struct sched_domain *sd, *this_sd = NULL;
+	int new_cpu;
+
+	cpu      = task_cpu(p);
+	rq       = task_rq(p);
+	this_cpu = smp_processor_id();
+	new_cpu  = cpu;
+
+	for_each_domain(this_cpu, sd) {
+		if (cpu_isset(cpu, sd->span)) {
+			this_sd = sd;
+			break;
+		}
+	}
+
+	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
+		goto out_set_cpu;
+
+	/*
+	 * Check for affine wakeup and passive balancing possibilities.
+	 */
+	if (this_sd) {
+		int idx = this_sd->wake_idx;
+		unsigned int imbalance;
+		unsigned long load, this_load;
+
+		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
+
+		load = source_load(cpu, idx);
+		this_load = target_load(this_cpu, idx);
+
+		new_cpu = this_cpu; /* Wake to this CPU if we can */
+
+		if (this_sd->flags & SD_WAKE_AFFINE) {
+			unsigned long tl = this_load;
+			unsigned long tl_per_task;
+
+			/*
+			 * Attract cache-cold tasks on sync wakeups:
+			 */
+			if (sync && !task_hot(p, rq->clock, this_sd))
+				goto out_set_cpu;
+
+			schedstat_inc(p, se.nr_wakeups_affine_attempts);
+			tl_per_task = cpu_avg_load_per_task(this_cpu);
+
+			/*
+			 * If sync wakeup then subtract the (maximum possible)
+			 * effect of the currently running task from the load
+			 * of the current CPU:
+			 */
+			if (sync)
+				tl -= current->se.load.weight;
+
+			if ((tl <= load &&
+				tl + target_load(cpu, idx) <= tl_per_task) ||
+			       100*(tl + p->se.load.weight) <= imbalance*load) {
+				/*
+				 * This domain has SD_WAKE_AFFINE and
+				 * p is cache cold in this domain, and
+				 * there is no bad imbalance.
+				 */
+				schedstat_inc(this_sd, ttwu_move_affine);
+				schedstat_inc(p, se.nr_wakeups_affine);
+				goto out_set_cpu;
+			}
+		}
+
+		/*
+		 * Start passive balancing when half the imbalance_pct
+		 * limit is reached.
+		 */
+		if (this_sd->flags & SD_WAKE_BALANCE) {
+			if (imbalance*this_load <= 100*load) {
+				schedstat_inc(this_sd, ttwu_move_balance);
+				schedstat_inc(p, se.nr_wakeups_passive);
+				goto out_set_cpu;
+			}
+		}
+	}
+
+	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
+out_set_cpu:
+	return wake_idle(new_cpu, p);
+}
+#endif /* CONFIG_SMP */
+
+
+/*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
@@ -1102,6 +1247,9 @@ static const struct sched_class fair_sch
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_fair,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_wakeup,
 
Index: linux-compile.git/kernel/sched_idletask.c
===================================================================
--- linux-compile.git.orig/kernel/sched_idletask.c	2007-11-20 19:52:43.000000000 -0500
+++ linux-compile.git/kernel/sched_idletask.c	2007-11-20 19:53:04.000000000 -0500
@@ -5,6 +5,12 @@
  *  handled in sched_fair.c)
  */
 
+#ifdef CONFIG_SMP
+static int select_task_rq_idle(struct task_struct *p, int sync)
+{
+	return task_cpu(p); /* IDLE tasks as never migrated */
+}
+#endif /* CONFIG_SMP */
 /*
  * Idle tasks are unconditionally rescheduled:
  */
@@ -72,6 +78,9 @@ const struct sched_class idle_sched_clas
 
 	/* dequeue is not valid, we print a debug message there: */
 	.dequeue_task		= dequeue_task_idle,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_idle,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_curr_idle,
 
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:03.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:04.000000000 -0500
@@ -147,6 +147,13 @@ yield_task_rt(struct rq *rq)
 	requeue_task_rt(rq, rq->curr);
 }
 
+#ifdef CONFIG_SMP
+static int select_task_rq_rt(struct task_struct *p, int sync)
+{
+	return task_cpu(p);
+}
+#endif /* CONFIG_SMP */
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -662,6 +669,9 @@ const struct sched_class rt_sched_class 
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
 	.yield_task		= yield_task_rt,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_rt,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_curr_rt,
 

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 11/20] RT: Break out the search function
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (9 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 10/20] RT: Remove some CFS specific code from the wakeup path of RT tasks Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 12/20] RT: Allow current_cpu to be included in search Steven Rostedt
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: sched-break-out-search.patch --]
[-- Type: text/plain, Size: 3266 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |   66 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 27 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:04.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:05.000000000 -0500
@@ -260,54 +260,66 @@ static struct task_struct *pick_next_hig
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 
-/* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(struct task_struct *task,
-				      struct rq *this_rq)
+static int find_lowest_rq(struct task_struct *task)
 {
-	struct rq *lowest_rq = NULL;
 	int cpu;
-	int tries;
 	cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask);
+	struct rq *lowest_rq = NULL;
 
 	cpus_and(*cpu_mask, cpu_online_map, task->cpus_allowed);
 
-	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
-		/*
-		 * Scan each rq for the lowest prio.
-		 */
-		for_each_cpu_mask(cpu, *cpu_mask) {
-			struct rq *rq = &per_cpu(runqueues, cpu);
+	/*
+	 * Scan each rq for the lowest prio.
+	 */
+	for_each_cpu_mask(cpu, *cpu_mask) {
+		struct rq *rq = cpu_rq(cpu);
 
-			if (cpu == this_rq->cpu)
-				continue;
+		if (cpu == rq->cpu)
+			continue;
 
-			/* We look for lowest RT prio or non-rt CPU */
-			if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-				lowest_rq = rq;
-				break;
-			}
+		/* We look for lowest RT prio or non-rt CPU */
+		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
+			lowest_rq = rq;
+			break;
+		}
 
-			/* no locking for now */
-			if (rq->rt.highest_prio > task->prio &&
-			    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-				lowest_rq = rq;
-			}
+		/* no locking for now */
+		if (rq->rt.highest_prio > task->prio &&
+		    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
+			lowest_rq = rq;
 		}
+	}
+
+	return lowest_rq ? lowest_rq->cpu : -1;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+				      struct rq *rq)
+{
+	struct rq *lowest_rq = NULL;
+	int cpu;
+	int tries;
 
-		if (!lowest_rq)
+	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+		cpu = find_lowest_rq(task);
+
+		if (cpu == -1)
 			break;
 
+		lowest_rq = cpu_rq(cpu);
+
 		/* if the prio of this runqueue changed, try again */
-		if (double_lock_balance(this_rq, lowest_rq)) {
+		if (double_lock_balance(rq, lowest_rq)) {
 			/*
 			 * We had to unlock the run queue. In
 			 * the mean time, task could have
 			 * migrated already or had its affinity changed.
 			 * Also make sure that it wasn't scheduled on its rq.
 			 */
-			if (unlikely(task_rq(task) != this_rq ||
+			if (unlikely(task_rq(task) != rq ||
 				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
-				     task_running(this_rq, task) ||
+				     task_running(rq, task) ||
 				     !task->se.on_rq)) {
 				spin_unlock(&lowest_rq->lock);
 				lowest_rq = NULL;

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 12/20] RT: Allow current_cpu to be included in search
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (10 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 11/20] RT: Break out the search function Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 13/20] RT: Pre-route RT tasks on wakeup Steven Rostedt
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: sched-include-thiscpu.patch --]
[-- Type: text/plain, Size: 1235 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

It doesn't hurt if we allow the current CPU to be included in the
search.  We will just simply skip it later if the current CPU turns out
to be the lowest.

We will use this later in the series

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:05.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:07.000000000 -0500
@@ -274,9 +274,6 @@ static int find_lowest_rq(struct task_st
 	for_each_cpu_mask(cpu, *cpu_mask) {
 		struct rq *rq = cpu_rq(cpu);
 
-		if (cpu == rq->cpu)
-			continue;
-
 		/* We look for lowest RT prio or non-rt CPU */
 		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
 			lowest_rq = rq;
@@ -304,7 +301,7 @@ static struct rq *find_lock_lowest_rq(st
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
 		cpu = find_lowest_rq(task);
 
-		if (cpu == -1)
+		if ((cpu == -1) || (cpu == rq->cpu))
 			break;
 
 		lowest_rq = cpu_rq(cpu);

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 13/20] RT: Pre-route RT tasks on wakeup
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (11 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 12/20] RT: Allow current_cpu to be included in search Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 14/20] RT: Optimize our cpu selection based on topology Steven Rostedt
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: sched-pre-route-on-wakeup.patch --]
[-- Type: text/plain, Size: 2333 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

In the original patch series that Steven Rostedt and I worked on together,
we both took different approaches to low-priority wakeup path.  I utilized
"pre-routing" (push the task away to a less important RQ before activating)
approach, while Steve utilized a "post-routing" approach.  The advantage of
my approach is that you avoid the overhead of a wasted activate/deactivate
cycle and peripherally related burdens.  The advantage of Steve's method is
that it neatly solves an issue preventing a "pull" optimization from being
deployed.

In the end, we ended up deploying Steve's idea.  But it later dawned on me
that we could get the best of both worlds by deploying both ideas together,
albeit slightly modified.

The idea is simple:  Use a "light-weight" lookup for pre-routing, since we
only need to approximate a good home for the task.  And we also retain the
post-routing push logic to clean up any inaccuracies caused by a condition
of "priority mistargeting" caused by the lightweight lookup.  Most of the
time, the pre-routing should work and yield lower overhead.  In the cases
where it doesnt, the post-router will bat cleanup.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:07.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:08.000000000 -0500
@@ -148,8 +148,27 @@ yield_task_rt(struct rq *rq)
 }

 #ifdef CONFIG_SMP
+static int find_lowest_rq(struct task_struct *task);
+
 static int select_task_rq_rt(struct task_struct *p, int sync)
 {
+	struct rq *rq = task_rq(p);
+
+	/*
+	 * If the task will not preempt the RQ, try to find a better RQ
+	 * before we even activate the task
+	 */
+	if ((p->prio >= rq->rt.highest_prio)
+	    && (p->nr_cpus_allowed > 1)) {
+		int cpu = find_lowest_rq(p);
+
+		return (cpu == -1) ? task_cpu(p) : cpu;
+	}
+
+	/*
+	 * Otherwise, just let it ride on the affined RQ and the
+	 * post-schedule router will push the preempted task away
+	 */
 	return task_cpu(p);
 }
 #endif /* CONFIG_SMP */

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 14/20] RT: Optimize our cpu selection based on topology
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (12 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 13/20] RT: Pre-route RT tasks on wakeup Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 15/20] RT: Optimize rebalancing Steven Rostedt
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: sched-optimize-affinity.patch --]
[-- Type: text/plain, Size: 5063 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

The current code base assumes a relatively flat CPU/core topology and will
route RT tasks to any CPU fairly equally.  In the real world, there are
various toplogies and affinities that govern where a task is best suited to
run with the smallest amount of overhead.  NUMA and multi-core CPUs are
prime examples of topologies that can impact cache performance.

Fortunately, linux is already structured to represent these topologies via
the sched_domains interface.  So we change our RT router to consult a
combination of topology and affinity policy to best place tasks during
migration.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched.c    |    1 
 kernel/sched_rt.c |  100 +++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 89 insertions(+), 12 deletions(-)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:53:04.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:53:09.000000000 -0500
@@ -24,6 +24,7 @@
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
  *  2007-10-22  RT overload balancing by Steven Rostedt
  *                 (with thanks to Gregory Haskins)
+ *  2007-11-05  RT migration/wakeup tuning by Gregory Haskins
  */
 
 #include <linux/mm.h>
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:08.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:09.000000000 -0500
@@ -278,35 +278,111 @@ static struct task_struct *pick_next_hig
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
+static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-	int cpu;
-	cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask);
-	struct rq *lowest_rq = NULL;
+	int       cpu;
+	cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
+	int       lowest_prio = -1;
+	int       ret         = 0;
 
-	cpus_and(*cpu_mask, cpu_online_map, task->cpus_allowed);
+	cpus_clear(*lowest_mask);
+	cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
 
 	/*
 	 * Scan each rq for the lowest prio.
 	 */
-	for_each_cpu_mask(cpu, *cpu_mask) {
+	for_each_cpu_mask(cpu, *valid_mask) {
 		struct rq *rq = cpu_rq(cpu);
 
 		/* We look for lowest RT prio or non-rt CPU */
 		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			lowest_rq = rq;
-			break;
+			if (ret)
+				cpus_clear(*lowest_mask);
+			cpu_set(rq->cpu, *lowest_mask);
+			return 1;
 		}
 
 		/* no locking for now */
-		if (rq->rt.highest_prio > task->prio &&
-		    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-			lowest_rq = rq;
+		if ((rq->rt.highest_prio > task->prio)
+		    && (rq->rt.highest_prio >= lowest_prio)) {
+			if (rq->rt.highest_prio > lowest_prio) {
+				/* new low - clear old data */
+				lowest_prio = rq->rt.highest_prio;
+				cpus_clear(*lowest_mask);
+			}
+			cpu_set(rq->cpu, *lowest_mask);
+			ret = 1;
+		}
+	}
+
+	return ret;
+}
+
+static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
+{
+	int first;
+
+	/* "this_cpu" is cheaper to preempt than a remote processor */
+	if ((this_cpu != -1) && cpu_isset(this_cpu, *mask))
+		return this_cpu;
+
+	first = first_cpu(*mask);
+	if (first != NR_CPUS)
+		return first;
+
+	return -1;
+}
+
+static int find_lowest_rq(struct task_struct *task)
+{
+	struct sched_domain *sd;
+	cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask);
+	int this_cpu = smp_processor_id();
+	int cpu      = task_cpu(task);
+
+	if (!find_lowest_cpus(task, lowest_mask))
+		return -1;
+
+	/*
+	 * At this point we have built a mask of cpus representing the
+	 * lowest priority tasks in the system.  Now we want to elect
+	 * the best one based on our affinity and topology.
+	 *
+	 * We prioritize the last cpu that the task executed on since
+	 * it is most likely cache-hot in that location.
+	 */
+	if (cpu_isset(cpu, *lowest_mask))
+		return cpu;
+
+	/*
+	 * Otherwise, we consult the sched_domains span maps to figure
+	 * out which cpu is logically closest to our hot cache data.
+	 */
+	if (this_cpu == cpu)
+		this_cpu = -1; /* Skip this_cpu opt if the same */
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_AFFINE) {
+			cpumask_t domain_mask;
+			int       best_cpu;
+
+			cpus_and(domain_mask, sd->span, *lowest_mask);
+
+			best_cpu = pick_optimal_cpu(this_cpu,
+						    &domain_mask);
+			if (best_cpu != -1)
+				return best_cpu;
 		}
 	}
 
-	return lowest_rq ? lowest_rq->cpu : -1;
+	/*
+	 * And finally, if there were no matches within the domains
+	 * just give the caller *something* to work with from the compatible
+	 * locations.
+	 */
+	return pick_optimal_cpu(this_cpu, lowest_mask);
 }
 
 /* Will lock the rq it finds */

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 15/20] RT: Optimize rebalancing
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (13 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 14/20] RT: Optimize our cpu selection based on topology Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 16/20] Avoid overload Steven Rostedt
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: sched-wake-balance-fixes.patch --]
[-- Type: text/plain, Size: 2675 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away.  So we add support
for considering this new information.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched.c    |    2 ++
 kernel/sched_rt.c |   10 ++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:53:09.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:53:10.000000000 -0500
@@ -273,6 +273,7 @@ struct rt_rq {
 	unsigned long rt_nr_migratory;
 	/* highest queued rt task prio */
 	int highest_prio;
+	int overloaded;
 };
 
 /*
@@ -6685,6 +6686,7 @@ void __init sched_init(void)
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
 		rq->rt.highest_prio = MAX_RT_PRIO;
+		rq->rt.overloaded = 0;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:09.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:10.000000000 -0500
@@ -16,6 +16,7 @@ static inline cpumask_t *rt_overload(voi
 }
 static inline void rt_set_overload(struct rq *rq)
 {
+	rq->rt.overloaded = 1;
 	cpu_set(rq->cpu, rt_overload_mask);
 	/*
 	 * Make sure the mask is visible before we set
@@ -32,6 +33,7 @@ static inline void rt_clear_overload(str
 	/* the order here really doesn't matter */
 	atomic_dec(&rto_count);
 	cpu_clear(rq->cpu, rt_overload_mask);
+	rq->rt.overloaded = 0;
 }
 
 static void update_rt_migration(struct rq *rq)
@@ -445,6 +447,9 @@ static int push_rt_task(struct rq *rq)
 
 	assert_spin_locked(&rq->lock);
 
+	if (!rq->rt.overloaded)
+		return 0;
+
 	next_task = pick_next_highest_task_rt(rq, -1);
 	if (!next_task)
 		return 0;
@@ -672,7 +677,7 @@ static void schedule_tail_balance_rt(str
 	 * the lock was owned by prev, we need to release it
 	 * first via finish_lock_switch and then reaquire it here.
 	 */
-	if (unlikely(rq->rt.rt_nr_running > 1)) {
+	if (unlikely(rq->rt.overloaded)) {
 		spin_lock_irq(&rq->lock);
 		push_rt_tasks(rq);
 		spin_unlock_irq(&rq->lock);
@@ -684,7 +689,8 @@ static void wakeup_balance_rt(struct rq 
 {
 	if (unlikely(rt_task(p)) &&
 	    !task_running(rq, p) &&
-	    (p->prio >= rq->curr->prio))
+	    (p->prio >= rq->rt.highest_prio) &&
+	    rq->rt.overloaded)
 		push_rt_tasks(rq);
 }
 

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 16/20] Avoid overload
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (14 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 15/20] RT: Optimize rebalancing Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 17/20] RT: restore the migratable conditional Steven Rostedt
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-avoid-overloading.patch --]
[-- Type: text/plain, Size: 2804 bytes --]

This patch changes the searching for a run queue by a waking RT task
to try to pick another runqueue if the currently running task
is an RT task.

The reason is that RT tasks behave different than normal
tasks. Preempting a normal task to run a RT task to keep
its cache hot is fine, because the preempted non-RT task
may wait on that same runqueue to run again unless the
migration thread comes along and pulls it off.

RT tasks behave differently. If one is preempted, it makes
an active effort to continue to run. So by having a high
priority task preempt a lower priority RT task, that lower
RT task will then quickly try to run on another runqueue.
This will cause that lower RT task to replace its nice
hot cache (and TLB) with a completely cold one. This is
for the hope that the new high priority RT task will keep
 its cache hot.

Remeber that this high priority RT task was just woken up.
So it may likely have been sleeping for several milliseconds,
and will end up with a cold cache anyway. RT tasks run till
they voluntarily stop, or are preempted by a higher priority
task. This means that it is unlikely that the woken RT task
will have a hot cache to wake up to. So pushing off a lower
RT task is just killing its cache for no good reason.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |   20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:10.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:11.000000000 -0500
@@ -157,11 +157,23 @@ static int select_task_rq_rt(struct task
 	struct rq *rq = task_rq(p);

 	/*
-	 * If the task will not preempt the RQ, try to find a better RQ
-	 * before we even activate the task
+	 * If the current task is an RT task, then
+	 * try to see if we can wake this RT task up on another
+	 * runqueue. Otherwise simply start this RT task
+	 * on its current runqueue.
+	 *
+	 * We want to avoid overloading runqueues. Even if
+	 * the RT task is of higher priority than the current RT task.
+	 * RT tasks behave differently than other tasks. If
+	 * one gets preempted, we try to push it off to another queue.
+	 * So trying to keep a preempting RT task on the same
+	 * cache hot CPU will force the running RT task to
+	 * a cold CPU. So we waste all the cache for the lower
+	 * RT task in hopes of saving some of a RT task
+	 * that is just being woken and probably will have
+	 * cold cache anyway.
 	 */
-	if ((p->prio >= rq->rt.highest_prio)
-	    && (p->nr_cpus_allowed > 1)) {
+	if (unlikely(rt_task(rq->curr))) {
 		int cpu = find_lowest_rq(p);

 		return (cpu == -1) ? task_cpu(p) : cpu;

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 17/20] RT: restore the migratable conditional
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (15 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 16/20] Avoid overload Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 18/20] Optimize cpu search with hamming weight Steven Rostedt
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: sched-renable-migratable-factor.patch --]
[-- Type: text/plain, Size: 902 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

We don't need to bother searching if the task cannot be migrated

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:11.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:13.000000000 -0500
@@ -173,7 +173,8 @@ static int select_task_rq_rt(struct task
 	 * that is just being woken and probably will have
 	 * cold cache anyway.
 	 */
-	if (unlikely(rt_task(rq->curr))) {
+	if (unlikely(rt_task(rq->curr)) &&
+	    (p->nr_cpus_allowed > 1)) {
 		int cpu = find_lowest_rq(p);
 
 		return (cpu == -1) ? task_cpu(p) : cpu;

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 18/20] Optimize cpu search with hamming weight
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (16 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 17/20] RT: restore the migratable conditional Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 19/20] Optimize out cpu_clears Steven Rostedt
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML; +Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter

[-- Attachment #1: rt-balance-find-cpu-count.patch --]
[-- Type: text/plain, Size: 2331 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

We can cheaply track the number of bits set in the cpumask for the lowest
priority CPUs.  Therefore, compute the mask's weight and use it to skip
the optimal domain search logic when there is only one CPU available.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:13.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:15.000000000 -0500
@@ -300,7 +300,7 @@ static int find_lowest_cpus(struct task_
 	int       cpu;
 	cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
 	int       lowest_prio = -1;
-	int       ret         = 0;
+	int       count       = 0;
 
 	cpus_clear(*lowest_mask);
 	cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
@@ -313,7 +313,7 @@ static int find_lowest_cpus(struct task_
 
 		/* We look for lowest RT prio or non-rt CPU */
 		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			if (ret)
+			if (count)
 				cpus_clear(*lowest_mask);
 			cpu_set(rq->cpu, *lowest_mask);
 			return 1;
@@ -325,14 +325,17 @@ static int find_lowest_cpus(struct task_
 			if (rq->rt.highest_prio > lowest_prio) {
 				/* new low - clear old data */
 				lowest_prio = rq->rt.highest_prio;
-				cpus_clear(*lowest_mask);
+				if (count) {
+					cpus_clear(*lowest_mask);
+					count = 0;
+				}
 			}
 			cpu_set(rq->cpu, *lowest_mask);
-			ret = 1;
+			count++;
 		}
 	}
 
-	return ret;
+	return count;
 }
 
 static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
@@ -356,9 +359,17 @@ static int find_lowest_rq(struct task_st
 	cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask);
 	int this_cpu = smp_processor_id();
 	int cpu      = task_cpu(task);
+	int count    = find_lowest_cpus(task, lowest_mask);
 
-	if (!find_lowest_cpus(task, lowest_mask))
-		return -1;
+	if (!count)
+		return -1; /* No targets found */
+
+	/*
+	 * There is no sense in performing an optimal search if only one
+	 * target is found.
+	 */
+	if (count == 1)
+		return first_cpu(*lowest_mask);
 
 	/*
 	 * At this point we have built a mask of cpus representing the

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 19/20] Optimize out cpu_clears
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (17 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 18/20] Optimize cpu search with hamming weight Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  2:10   ` Steven Rostedt
  2007-11-21  1:01 ` [PATCH v4 20/20] balance RT tasks no new wake up Steven Rostedt
  2007-11-21  4:44 ` [PATCH 0/4] more RT balancing enhancements Gregory Haskins
  20 siblings, 1 reply; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-optimize-cpu-search.patch --]
[-- Type: text/plain, Size: 2363 bytes --]

This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 kernel/sched_rt.c |   35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:15.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 19:53:17.000000000 -0500
@@ -293,22 +293,20 @@ static struct task_struct *pick_next_hig
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
-static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-	int       cpu;
-	cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
 	int       lowest_prio = -1;
+	int       lowest_cpu  = 0;
 	int       count       = 0;
+	int       cpu;
 
-	cpus_clear(*lowest_mask);
-	cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
+	cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
 
 	/*
 	 * Scan each rq for the lowest prio.
 	 */
-	for_each_cpu_mask(cpu, *valid_mask) {
+	for_each_cpu_mask(cpu, *lowest_mask) {
 		struct rq *rq = cpu_rq(cpu);
 
 		/* We look for lowest RT prio or non-rt CPU */
@@ -325,13 +323,30 @@ static int find_lowest_cpus(struct task_
 			if (rq->rt.highest_prio > lowest_prio) {
 				/* new low - clear old data */
 				lowest_prio = rq->rt.highest_prio;
-				if (count) {
-					cpus_clear(*lowest_mask);
-					count = 0;
-				}
+				lowest_cpu = cpu;
+				count = 0;
 			}
 			cpu_set(rq->cpu, *lowest_mask);
 			count++;
+		} else
+			cpu_clear(cpu, *lowest_mask);
+	}
+
+	/*
+	 * Clear out all the set bits that represent
+	 * runqueues that were of higher prio than
+	 * the lowest_prio.
+	 */
+	if (lowest_cpu) {
+		/*
+		 * Perhaps we could add another cpumask op to
+		 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);
+		 * Then that could be optimized to use memset and such.
+		 */
+		for_each_cpu_mask(cpu, *lowest_mask) {
+			if (cpu >= lowest_cpu)
+				break;
+			cpu_clear(cpu, *lowest_mask);
 		}
 	}
 

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 20/20] balance RT tasks no new wake up
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (18 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 19/20] Optimize out cpu_clears Steven Rostedt
@ 2007-11-21  1:01 ` Steven Rostedt
  2007-11-21  4:44 ` [PATCH 0/4] more RT balancing enhancements Gregory Haskins
  20 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  1:01 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-on-new-task.patch --]
[-- Type: text/plain, Size: 617 bytes --]

Run the RT balancing code on wake up to an RT task.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 kernel/sched.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-20 19:53:10.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-20 19:53:18.000000000 -0500
@@ -1650,6 +1650,7 @@ void fastcall wake_up_new_task(struct ta
 		inc_nr_running(p, rq);
 	}
 	check_preempt_curr(rq, p);
+	wakeup_balance_rt(rq, p);
 	task_rq_unlock(rq, &flags);
 }
 

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 19/20] Optimize out cpu_clears
  2007-11-21  1:01 ` [PATCH v4 19/20] Optimize out cpu_clears Steven Rostedt
@ 2007-11-21  2:10   ` Steven Rostedt
  2007-11-21  3:10     ` [PATCH] Fix optimized search Gregory Haskins
  0 siblings, 1 reply; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  2:10 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

On Tue, Nov 20, 2007 at 08:01:13PM -0500, Steven Rostedt wrote:
>  
>  static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
> -static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
>  
>  static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
>  {
> -	int       cpu;
> -	cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
>  	int       lowest_prio = -1;
> +	int       lowest_cpu  = 0;
>  	int       count       = 0;
> +	int       cpu;
>  
> -	cpus_clear(*lowest_mask);
> -	cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
> +	cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
>  
>  	/*
>  	 * Scan each rq for the lowest prio.
>  	 */
> -	for_each_cpu_mask(cpu, *valid_mask) {
> +	for_each_cpu_mask(cpu, *lowest_mask) {
>  		struct rq *rq = cpu_rq(cpu);
>  
>  		/* We look for lowest RT prio or non-rt CPU */
> @@ -325,13 +323,30 @@ static int find_lowest_cpus(struct task_
>  			if (rq->rt.highest_prio > lowest_prio) {
>  				/* new low - clear old data */
>  				lowest_prio = rq->rt.highest_prio;
> -				if (count) {
> -					cpus_clear(*lowest_mask);
> -					count = 0;
> -				}

Gregory Haskins pointed out to me that this logic is slightly wrong. I
originally wrote this patch before adding his "count" patch optimization.
I did not take into account that on finding a non RT queue, we may leave
on some extra bits because the clear_cpus is not preformed if count is
zero. And count gets set to zero here. Which means that we don't clean
up.

The fix is to check for lowest_cpus > 0 instead of count on finding an
non-RT runqueue. This lets us know that we need to clear the mask.
Otherwise, if lowest_cpus == 0, then we can return the mask untouched.
The proper bit would already be set, and the return of 1 will have
the rest of the algorithm use the first bit.

Below is the updated patch. The full series is at:

  http://rostedt.homelinux.com/rt/rt-balance-patches-v5.tar.bz2


> +				lowest_cpu = cpu;
> +				count = 0;
>  			}
>  			cpu_set(rq->cpu, *lowest_mask);
>  			count++;



From: Steven Rostedt <srostedt@redhat.com>

This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 kernel/sched_rt.c |   48 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 36 insertions(+), 12 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 19:53:15.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 20:35:04.000000000 -0500
@@ -293,29 +293,36 @@ static struct task_struct *pick_next_hig
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
-static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-	int       cpu;
-	cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
 	int       lowest_prio = -1;
+	int       lowest_cpu  = 0;
 	int       count       = 0;
+	int       cpu;
 
-	cpus_clear(*lowest_mask);
-	cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
+	cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
 
 	/*
 	 * Scan each rq for the lowest prio.
 	 */
-	for_each_cpu_mask(cpu, *valid_mask) {
+	for_each_cpu_mask(cpu, *lowest_mask) {
 		struct rq *rq = cpu_rq(cpu);
 
 		/* We look for lowest RT prio or non-rt CPU */
 		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			if (count)
+			/*
+			 * if we already found a low RT queue
+			 * and now we found this non-rt queue
+			 * clear the mask and set our bit.
+			 * Otherwise just return the queue as is
+			 * and the count==1 will cause the algorithm
+			 * to use the first bit found.
+			 */
+			if (lowest_cpu) {
 				cpus_clear(*lowest_mask);
-			cpu_set(rq->cpu, *lowest_mask);
+				cpu_set(rq->cpu, *lowest_mask);
+			}
 			return 1;
 		}
 
@@ -325,13 +332,30 @@ static int find_lowest_cpus(struct task_
 			if (rq->rt.highest_prio > lowest_prio) {
 				/* new low - clear old data */
 				lowest_prio = rq->rt.highest_prio;
-				if (count) {
-					cpus_clear(*lowest_mask);
-					count = 0;
-				}
+				lowest_cpu = cpu;
+				count = 0;
 			}
 			cpu_set(rq->cpu, *lowest_mask);
 			count++;
+		} else
+			cpu_clear(cpu, *lowest_mask);
+	}
+
+	/*
+	 * Clear out all the set bits that represent
+	 * runqueues that were of higher prio than
+	 * the lowest_prio.
+	 */
+	if (lowest_cpu) {
+		/*
+		 * Perhaps we could add another cpumask op to
+		 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);
+		 * Then that could be optimized to use memset and such.
+		 */
+		for_each_cpu_mask(cpu, *lowest_mask) {
+			if (cpu >= lowest_cpu)
+				break;
+			cpu_clear(cpu, *lowest_mask);
 		}
 	}
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] Fix optimized search
  2007-11-21  2:10   ` Steven Rostedt
@ 2007-11-21  3:10     ` Gregory Haskins
  2007-11-21  4:15       ` Steven Rostedt
  0 siblings, 1 reply; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21  3:10 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

I spied a few more issues from http://lkml.org/lkml/2007/11/20/590.

Patch is below..

Regards,
-Greg

-----------------

Include cpu 0 in the search, and eliminate the redundant cpu_set since
the bit should already be set in the mask.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 28feeff..fbf4fb1 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -297,7 +297,7 @@ static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
 	int       lowest_prio = -1;
-	int       lowest_cpu  = 0;
+	int       lowest_cpu  = -1;
 	int       count       = 0;
 	int       cpu;
 
@@ -319,7 +319,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 			 * and the count==1 will cause the algorithm
 			 * to use the first bit found.
 			 */
-			if (lowest_cpu) {
+			if (lowest_cpu != -1) {
 				cpus_clear(*lowest_mask);
 				cpu_set(rq->cpu, *lowest_mask);
 			}
@@ -335,7 +335,6 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 				lowest_cpu = cpu;
 				count = 0;
 			}
-			cpu_set(rq->cpu, *lowest_mask);
 			count++;
 		} else
 			cpu_clear(cpu, *lowest_mask);
@@ -346,7 +345,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 	 * runqueues that were of higher prio than
 	 * the lowest_prio.
 	 */
-	if (lowest_cpu) {
+	if (lowest_cpu != -1) {
 		/*
 		 * Perhaps we could add another cpumask op to
 		 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] Fix optimized search
  2007-11-21  3:10     ` [PATCH] Fix optimized search Gregory Haskins
@ 2007-11-21  4:15       ` Steven Rostedt
  2007-11-21  4:26         ` Steven Rostedt
  0 siblings, 1 reply; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  4:15 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: LKML, Christoph Lameter, Peter Zijlstra, Ingo Molnar

Gregory Haskins wrote:
> I spied a few more issues from http://lkml.org/lkml/2007/11/20/590.
> 
> Patch is below..

Thanks, but I have one update...

> 
> Regards,
> -Greg
> 
> -----------------
> 
> Include cpu 0 in the search, and eliminate the redundant cpu_set since
> the bit should already be set in the mask.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  kernel/sched_rt.c |    7 +++----
>  1 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 28feeff..fbf4fb1 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -297,7 +297,7 @@ static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
>  static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
>  {
>  	int       lowest_prio = -1;
> -	int       lowest_cpu  = 0;
> +	int       lowest_cpu  = -1;
>  	int       count       = 0;
>  	int       cpu;
>  
> @@ -319,7 +319,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
>  			 * and the count==1 will cause the algorithm
>  			 * to use the first bit found.
>  			 */
> -			if (lowest_cpu) {
> +			if (lowest_cpu != -1) {
>  				cpus_clear(*lowest_mask);
>  				cpu_set(rq->cpu, *lowest_mask);
>  			}
> @@ -335,7 +335,6 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
>  				lowest_cpu = cpu;
>  				count = 0;
>  			}
> -			cpu_set(rq->cpu, *lowest_mask);
>  			count++;
>  		} else
>  			cpu_clear(cpu, *lowest_mask);
> @@ -346,7 +345,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
>  	 * runqueues that were of higher prio than
>  	 * the lowest_prio.
>  	 */
> -	if (lowest_cpu) {
> +	if (lowest_cpu != -1) {

We can change this to

   if (lowest_cpu > 0) {

because if lowest_cpu == 0, we don't need to bother with clearing any bits.

I'll apply this next.

Thanks.

-- Steve

>  		/*
>  		 * Perhaps we could add another cpumask op to
>  		 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Fix optimized search
  2007-11-21  4:15       ` Steven Rostedt
@ 2007-11-21  4:26         ` Steven Rostedt
  2007-11-21  5:14           ` Gregory Haskins
  0 siblings, 1 reply; 36+ messages in thread
From: Steven Rostedt @ 2007-11-21  4:26 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: LKML, Christoph Lameter, Peter Zijlstra, Ingo Molnar

On Tue, Nov 20, 2007 at 11:15:48PM -0500, Steven Rostedt wrote:
> Gregory Haskins wrote:
>> I spied a few more issues from http://lkml.org/lkml/2007/11/20/590.
>> Patch is below..
>
> Thanks, but I have one update...
>

Here's the updated patch.

Oh, and Gregory, please email me at my rostedt@goodmis.org account. It
has better filters ;-)

This series is at:

  http://rostedt.homelinux.com/rt/rt-balance-patches-v6.tar.bz2

===

This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 kernel/sched_rt.c |   49 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 13 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-20 23:17:43.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-20 23:18:21.000000000 -0500
@@ -293,29 +293,36 @@ static struct task_struct *pick_next_hig
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
-static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-	int       cpu;
-	cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
 	int       lowest_prio = -1;
+	int       lowest_cpu  = -1;
 	int       count       = 0;
+	int       cpu;
 
-	cpus_clear(*lowest_mask);
-	cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
+	cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
 
 	/*
 	 * Scan each rq for the lowest prio.
 	 */
-	for_each_cpu_mask(cpu, *valid_mask) {
+	for_each_cpu_mask(cpu, *lowest_mask) {
 		struct rq *rq = cpu_rq(cpu);
 
 		/* We look for lowest RT prio or non-rt CPU */
 		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			if (count)
+			/*
+			 * if we already found a low RT queue
+			 * and now we found this non-rt queue
+			 * clear the mask and set our bit.
+			 * Otherwise just return the queue as is
+			 * and the count==1 will cause the algorithm
+			 * to use the first bit found.
+			 */
+			if (lowest_cpu != -1) {
 				cpus_clear(*lowest_mask);
-			cpu_set(rq->cpu, *lowest_mask);
+				cpu_set(rq->cpu, *lowest_mask);
+			}
 			return 1;
 		}
 
@@ -325,13 +332,29 @@ static int find_lowest_cpus(struct task_
 			if (rq->rt.highest_prio > lowest_prio) {
 				/* new low - clear old data */
 				lowest_prio = rq->rt.highest_prio;
-				if (count) {
-					cpus_clear(*lowest_mask);
-					count = 0;
-				}
+				lowest_cpu = cpu;
+				count = 0;
 			}
-			cpu_set(rq->cpu, *lowest_mask);
 			count++;
+		} else
+			cpu_clear(cpu, *lowest_mask);
+	}
+
+	/*
+	 * Clear out all the set bits that represent
+	 * runqueues that were of higher prio than
+	 * the lowest_prio.
+	 */
+	if (lowest_cpu > 0) {
+		/*
+		 * Perhaps we could add another cpumask op to
+		 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);
+		 * Then that could be optimized to use memset and such.
+		 */
+		for_each_cpu_mask(cpu, *lowest_mask) {
+			if (cpu >= lowest_cpu)
+				break;
+			cpu_clear(cpu, *lowest_mask);
 		}
 	}
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 0/4] more RT balancing enhancements
  2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
                   ` (19 preceding siblings ...)
  2007-11-21  1:01 ` [PATCH v4 20/20] balance RT tasks no new wake up Steven Rostedt
@ 2007-11-21  4:44 ` Gregory Haskins
  2007-11-21  4:44   ` [PATCH 1/4] Fix optimized search Gregory Haskins
                     ` (4 more replies)
  20 siblings, 5 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21  4:44 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

These are patches that apply to the end of the v5 series annouced here:

http://lkml.org/lkml/2007/11/20/558

Steven,
    These are patches that I could not finish in time to get in with the v4
    release. 

Ingo,
    If you accept the prior work submitted by Steven and myself, please also
    consider this series.

Comments/feedback welcome!

Regards,
-Greg 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/4] Fix optimized search
  2007-11-21  4:44 ` [PATCH 0/4] more RT balancing enhancements Gregory Haskins
@ 2007-11-21  4:44   ` Gregory Haskins
  2007-11-21  4:44   ` [PATCH 2/4] RT: Add sched-domain roots Gregory Haskins
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21  4:44 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

Include cpu 0 in the search, and eliminate the redundant cpu_set since
the bit should already be set in the mask.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 28feeff..fbf4fb1 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -297,7 +297,7 @@ static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
 	int       lowest_prio = -1;
-	int       lowest_cpu  = 0;
+	int       lowest_cpu  = -1;
 	int       count       = 0;
 	int       cpu;
 
@@ -319,7 +319,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 			 * and the count==1 will cause the algorithm
 			 * to use the first bit found.
 			 */
-			if (lowest_cpu) {
+			if (lowest_cpu != -1) {
 				cpus_clear(*lowest_mask);
 				cpu_set(rq->cpu, *lowest_mask);
 			}
@@ -335,7 +335,6 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 				lowest_cpu = cpu;
 				count = 0;
 			}
-			cpu_set(rq->cpu, *lowest_mask);
 			count++;
 		} else
 			cpu_clear(cpu, *lowest_mask);
@@ -346,7 +345,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 	 * runqueues that were of higher prio than
 	 * the lowest_prio.
 	 */
-	if (lowest_cpu) {
+	if (lowest_cpu != -1) {
 		/*
 		 * Perhaps we could add another cpumask op to
 		 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/4] RT: Add sched-domain roots
  2007-11-21  4:44 ` [PATCH 0/4] more RT balancing enhancements Gregory Haskins
  2007-11-21  4:44   ` [PATCH 1/4] Fix optimized search Gregory Haskins
@ 2007-11-21  4:44   ` Gregory Haskins
  2007-11-21  4:44   ` [PATCH 3/4] RT: Only balance our RT tasks within our root-domain Gregory Haskins
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21  4:44 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables.  Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset.  However, we currently still maintain some
policy/state as global variables which transcend all cpusets.  Consider,
for instance, rt-overload state.

Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).

We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one.  This logic doesn't have any clients
yet but it will later in the series.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
CC: Paul Jackson <pj@sgi.com>
CC: Simon Derr <simon.derr@bull.net>
---

 include/linux/sched.h |    3 ++
 kernel/sched.c        |   89 +++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 19d9f3f..0ba221e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -845,6 +845,9 @@ struct sched_class {
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
 	void (*task_new) (struct rq *rq, struct task_struct *p);
 	void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
+
+	void (*join_domain)(struct rq *rq);
+	void (*leave_domain)(struct rq *rq);
 };
 
 struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index e685402..fb619fb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -276,6 +276,17 @@ struct rt_rq {
 	int overloaded;
 };
 
+#ifdef CONFIG_SMP
+
+struct root_domain {
+	atomic_t refcount;
+	cpumask_t span;
+};
+
+struct root_domain def_root_domain;
+
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -333,6 +344,7 @@ struct rq {
 	atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
+	struct root_domain  *rd;
 	struct sched_domain *sd;
 
 	/* For active balancing */
@@ -5718,11 +5730,68 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 	return 1;
 }
 
+static void rq_attach_root(struct rq *rq, struct root_domain *rd)
+{
+	unsigned long flags;
+	const struct sched_class *class;
+
+	spin_lock_irqsave(&rq->lock, flags);
+
+	if (rq->rd) {
+		struct root_domain *old_rd = rq->rd;
+
+		for (class = sched_class_highest; class; class = class->next)
+			if (class->leave_domain)
+				class->leave_domain(rq);
+
+		if (atomic_dec_and_test(&old_rd->refcount))
+			kfree(old_rd);
+	}
+
+	atomic_inc(&rd->refcount);
+	rq->rd = rd;
+
+	for (class = sched_class_highest; class; class = class->next)
+		if (class->join_domain)
+			class->join_domain(rq);
+
+	spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+static void init_rootdomain(struct root_domain *rd, cpumask_t *map)
+{
+	memset(rd, 0, sizeof(*rd));
+
+	rd->span = *map;
+}
+
+static void init_defrootdomain(void)
+{
+	cpumask_t cpus = CPU_MASK_ALL;
+
+	init_rootdomain(&def_root_domain, &cpus);
+	atomic_set(&def_root_domain.refcount, 1);
+}
+
+static struct root_domain *alloc_rootdomain(cpumask_t *map)
+{
+	struct root_domain *rd;
+
+	rd = kmalloc(sizeof(*rd), GFP_KERNEL);
+	if (!rd)
+		return NULL;
+
+	init_rootdomain(rd, map);
+
+	return rd;
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain.  Callers must
  * hold the hotplug lock.
  */
-static void cpu_attach_domain(struct sched_domain *sd, int cpu)
+static void cpu_attach_domain(struct sched_domain *sd,
+			      struct root_domain *rd, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	struct sched_domain *tmp;
@@ -5747,6 +5816,7 @@ static void cpu_attach_domain(struct sched_domain *sd, int cpu)
 
 	sched_domain_debug(sd, cpu);
 
+	rq_attach_root(rq, rd);
 	rcu_assign_pointer(rq->sd, sd);
 }
 
@@ -6115,6 +6185,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 static int build_sched_domains(const cpumask_t *cpu_map)
 {
 	int i;
+	struct root_domain *rd;
 #ifdef CONFIG_NUMA
 	struct sched_group **sched_group_nodes = NULL;
 	int sd_allnodes = 0;
@@ -6131,6 +6202,12 @@ static int build_sched_domains(const cpumask_t *cpu_map)
 	sched_group_nodes_bycpu[first_cpu(*cpu_map)] = sched_group_nodes;
 #endif
 
+	rd = alloc_rootdomain(cpu_map);
+	if (!rd) {
+		printk(KERN_WARNING "Cannot alloc root domain\n");
+		return -ENOMEM;
+	}
+
 	/*
 	 * Set up domains for cpus specified by the cpu_map.
 	 */
@@ -6347,7 +6424,7 @@ static int build_sched_domains(const cpumask_t *cpu_map)
 #else
 		sd = &per_cpu(phys_domains, i);
 #endif
-		cpu_attach_domain(sd, i);
+		cpu_attach_domain(sd, rd, i);
 	}
 
 	return 0;
@@ -6405,7 +6482,7 @@ static void detach_destroy_domains(const cpumask_t *cpu_map)
 	unregister_sched_domain_sysctl();
 
 	for_each_cpu_mask(i, *cpu_map)
-		cpu_attach_domain(NULL, i);
+		cpu_attach_domain(NULL, &def_root_domain, i);
 	synchronize_sched();
 	arch_destroy_sched_domains(cpu_map);
 }
@@ -6641,6 +6718,10 @@ void __init sched_init(void)
 	int highest_cpu = 0;
 	int i, j;
 
+#ifdef CONFIG_SMP
+	init_defrootdomain();
+#endif
+
 	for_each_possible_cpu(i) {
 		struct rt_prio_array *array;
 		struct rq *rq;
@@ -6680,6 +6761,8 @@ void __init sched_init(void)
 			rq->cpu_load[j] = 0;
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
+		rq->rd = NULL;
+		rq_attach_root(rq, &def_root_domain);
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
 		rq->push_cpu = 0;


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/4] RT: Only balance our RT tasks within our root-domain
  2007-11-21  4:44 ` [PATCH 0/4] more RT balancing enhancements Gregory Haskins
  2007-11-21  4:44   ` [PATCH 1/4] Fix optimized search Gregory Haskins
  2007-11-21  4:44   ` [PATCH 2/4] RT: Add sched-domain roots Gregory Haskins
@ 2007-11-21  4:44   ` Gregory Haskins
  2007-11-21  4:44   ` [PATCH 4/4] RT: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
  2007-11-21 19:51   ` [PATCH 0/4] more RT balancing enhancements v6a Gregory Haskins
  4 siblings, 0 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21  4:44 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

We move the rt-overload data as the first global to per-domain
reclassification.  This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.

Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system.  Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down.  Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.

Note: In some cases, task->cpus_allowed will effectively reduce our search
to within our domain.  However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution.  If it can be optimized later, so be it.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
---

 kernel/sched.c    |    2 ++
 kernel/sched_rt.c |   65 +++++++++++++++++++++++++++++++++++------------------
 2 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index fb619fb..578c186 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -281,6 +281,8 @@ struct rt_rq {
 struct root_domain {
 	atomic_t refcount;
 	cpumask_t span;
+	cpumask_t rto_mask;
+	atomic_t  rto_count;
 };
 
 struct root_domain def_root_domain;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index fbf4fb1..3495762 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -4,20 +4,18 @@
  */
 
 #ifdef CONFIG_SMP
-static cpumask_t rt_overload_mask;
-static atomic_t rto_count;
-static inline int rt_overloaded(void)
+
+static inline int rt_overloaded(struct rq *rq)
 {
-	return atomic_read(&rto_count);
+	return atomic_read(&rq->rd->rto_count);
 }
-static inline cpumask_t *rt_overload(void)
+static inline cpumask_t *rt_overload(struct rq *rq)
 {
-	return &rt_overload_mask;
+	return &rq->rd->rto_mask;
 }
 static inline void rt_set_overload(struct rq *rq)
 {
-	rq->rt.overloaded = 1;
-	cpu_set(rq->cpu, rt_overload_mask);
+	cpu_set(rq->cpu, rq->rd->rto_mask);
 	/*
 	 * Make sure the mask is visible before we set
 	 * the overload count. That is checked to determine
@@ -26,22 +24,25 @@ static inline void rt_set_overload(struct rq *rq)
 	 * updated yet.
 	 */
 	wmb();
-	atomic_inc(&rto_count);
+	atomic_inc(&rq->rd->rto_count);
 }
 static inline void rt_clear_overload(struct rq *rq)
 {
 	/* the order here really doesn't matter */
-	atomic_dec(&rto_count);
-	cpu_clear(rq->cpu, rt_overload_mask);
-	rq->rt.overloaded = 0;
+	atomic_dec(&rq->rd->rto_count);
+	cpu_clear(rq->cpu, rq->rd->rto_mask);
 }
 
 static void update_rt_migration(struct rq *rq)
 {
-	if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1))
+	if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) {
 		rt_set_overload(rq);
-	else
+		rq->rt.overloaded = 1;
+	} else {
 		rt_clear_overload(rq);
+		rq->rt.overloaded = 0;
+	}
+	
 }
 #endif /* CONFIG_SMP */
 
@@ -304,6 +305,15 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 	cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
 
 	/*
+	 * Yes, I know doing two cpus_and() kind of sucks, especially on
+	 * those very large SMP systems.  We are going for correctness now,
+	 * optimization later.  A later patch will add rd->online to cache
+	 * the subset of rd->span that are online and then we can collapse
+	 * these two mask operations into one
+	 */
+	cpus_and(*lowest_mask, *lowest_mask, task_rq(task)->rd->span);
+
+	/*
 	 * Scan each rq for the lowest prio.
 	 */
 	for_each_cpu_mask(cpu, *lowest_mask) {
@@ -584,18 +594,12 @@ static int pull_rt_task(struct rq *this_rq)
 
 	assert_spin_locked(&this_rq->lock);
 
-	/*
-	 * If cpusets are used, and we have overlapping
-	 * run queue cpusets, then this algorithm may not catch all.
-	 * This is just the price you pay on trying to keep
-	 * dirtying caches down on large SMP machines.
-	 */
-	if (likely(!rt_overloaded()))
+	if (likely(!rt_overloaded(this_rq)))
 		return 0;
 
 	next = pick_next_task_rt(this_rq);
 
-	rto_cpumask = rt_overload();
+	rto_cpumask = rt_overload(this_rq);
 
 	for_each_cpu_mask(cpu, *rto_cpumask) {
 		if (this_cpu == cpu)
@@ -814,6 +818,20 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p)
 	}
 }
 
+/* Assumes rq->lock is held */
+static void join_domain_rt(struct rq *rq)
+{
+	if (rq->rt.overloaded)
+		rt_set_overload(rq);
+}
+
+/* Assumes rq->lock is held */
+static void leave_domain_rt(struct rq *rq)
+{
+	if (rq->rt.overloaded)
+		rt_clear_overload(rq);
+}
+
 static void set_curr_task_rt(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
@@ -843,4 +861,7 @@ const struct sched_class rt_sched_class = {
 
 	.set_curr_task          = set_curr_task_rt,
 	.task_tick		= task_tick_rt,
+
+	.join_domain            = join_domain_rt,
+	.leave_domain           = leave_domain_rt,
 };


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/4] RT: Use a 2-d bitmap for searching lowest-pri CPU
  2007-11-21  4:44 ` [PATCH 0/4] more RT balancing enhancements Gregory Haskins
                     ` (2 preceding siblings ...)
  2007-11-21  4:44   ` [PATCH 3/4] RT: Only balance our RT tasks within our root-domain Gregory Haskins
@ 2007-11-21  4:44   ` Gregory Haskins
  2007-11-21 19:51   ` [PATCH 0/4] more RT balancing enhancements v6a Gregory Haskins
  4 siblings, 0 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21  4:44 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

In this past, this patch has maintained the priorties as a global
property.  It now has a per-root-domain scope to shield unrelated
domains from unecessary cache collisions.

You may ask yourself why do we add logic to find_lowest_cpus() earlier
in the series and then rip it out in this patch.  The answer is that
this current patch has been controversial, and is likely to not be
accepted (at least in the short term).  Therefore, it is in our best
interest to optimize as much of the code prior to this patch as possible
even if we ultimately rip it out here.  That way, the system is still tuned
even if this patch goes to /dev/null.

You may now be asking yourself why bother including this patch?  The
answer is that our own independent testing shows
(http://article.gmane.org/gmane.linux.rt.user/1889) this patch makes a
significant performance improvement (at least for rt latencies) and
doesnt appear to cause any regressions in other dimensions.  However,
I also understand the reservations raised by Steve Rostedt et. al.
Therefore, I include this patch in the hopes that it is useful to
someone, but with the understanding that it is not likely to be accepted
without further demonstration of its benefits. 

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
---

 kernel/Makefile       |    1 
 kernel/sched.c        |    4 +
 kernel/sched_cpupri.c |  166 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_cpupri.h |   35 ++++++++++
 kernel/sched_rt.c     |  104 +++++--------------------------
 5 files changed, 223 insertions(+), 87 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..c013a6c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_SMP) += sched_cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index 578c186..d6be7e6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -70,6 +70,8 @@
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
 
+#include "sched_cpupri.h"
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -283,6 +285,7 @@ struct root_domain {
 	cpumask_t span;
 	cpumask_t rto_mask;
 	atomic_t  rto_count;
+	struct cpupri cpupri;
 };
 
 struct root_domain def_root_domain;
@@ -5765,6 +5768,7 @@ static void init_rootdomain(struct root_domain *rd, cpumask_t *map)
 	memset(rd, 0, sizeof(*rd));
 
 	rd->span = *map;
+	cpupri_init(&rd->cpupri);
 }
 
 static void init_defrootdomain(void)
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 0000000..e30d33f
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,166 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins <ghaskins@novell.com>
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ *                 (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, nr_domcpus)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include "sched_cpupri.h"
+
+/* Convert between a 140 based task->prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+	int cpupri;
+
+	if (prio == MAX_PRIO)
+		cpupri = CPUPRI_IDLE;
+	else if (prio >= MAX_RT_PRIO)
+		cpupri = CPUPRI_NORMAL;
+	else
+		cpupri = MAX_RT_PRIO - prio + 1;
+
+	return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)                    \
+  for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES);     \
+       idx < CPUPRI_NR_PRIORITIES;                            \
+       idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @cp: The cpupri context
+ * @p: The task
+ * @lowest_mask: A mask to fill in with selected CPUs
+ *
+ * Note: This function returns the recommended CPUs as calculated during the
+ * current invokation.  By the time the call returns, the CPUs may have in
+ * fact changed priorities any number of times.  While not ideal, it is not
+ * an issue of correctness since the normal rebalancer logic will correct
+ * any discrepancies created by racing against the uncertainty of the current
+ * priority configuration.
+ *
+ * Returns: (int)bool - CPUs were found
+ */
+int cpupri_find(struct cpupri *cp, struct task_struct *p,
+		cpumask_t *lowest_mask)
+{
+	int                  idx      = 0;
+	int                  task_pri = convert_prio(p->prio);
+
+	for_each_cpupri_active(cp->pri_active, idx) {
+		struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
+		cpumask_t mask;
+
+		if (idx >= task_pri)
+			break;
+
+		cpus_and(mask, p->cpus_allowed, vec->mask);
+		
+		if (cpus_empty(mask))
+			continue;
+		
+		*lowest_mask = mask;
+		return 1;
+	}
+
+	return 0;
+}
+
+/**
+ * cpupri_set - update the cpu priority setting
+ * @cp: The cpupri context
+ * @cpu: The target cpu
+ * @pri: The priority (INVALID-RT99) to assign to this CPU
+ *
+ * Note: Assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpupri_set(struct cpupri *cp, int cpu, int newpri)
+{
+	int                 *currpri = &cp->cpu_to_pri[cpu];
+	int                  oldpri  = *currpri;
+	unsigned long        flags;
+
+	newpri = convert_prio(newpri);
+
+	if (newpri == oldpri)
+		return;
+
+	/*
+	 * If the cpu was currently mapped to a different value, we
+	 * first need to unmap the old value
+	 */
+	if (likely(oldpri != CPUPRI_INVALID)) {
+		struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];
+
+		spin_lock_irqsave(&vec->lock, flags);
+
+		cpu_clear(cpu, vec->mask);
+		if (cpus_empty(vec->mask))
+			clear_bit(oldpri, cp->pri_active);
+
+		spin_unlock_irqrestore(&vec->lock, flags);
+	}
+
+	if (likely(newpri != CPUPRI_INVALID)) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
+
+		spin_lock_irqsave(&vec->lock, flags);
+
+		cpu_set(cpu, vec->mask);
+		set_bit(newpri, cp->pri_active);
+
+		spin_unlock_irqrestore(&vec->lock, flags);
+	}
+
+	*currpri = newpri;
+}
+
+/**
+ * cpupri_init - initialize the cpupri structure
+ * @cp: The cpupri context
+ *
+ * Returns: (void)
+ */
+void cpupri_init(struct cpupri *cp)
+{
+	int i;
+
+	memset(cp, 0, sizeof(*cp));
+
+	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+		spin_lock_init(&vec->lock);
+		cpus_clear(vec->mask);
+	}
+
+	for_each_possible_cpu(i)
+		cp->cpu_to_pri[i] = CPUPRI_INVALID;
+}
+
+
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
new file mode 100644
index 0000000..1392c93
--- /dev/null
+++ b/kernel/sched_cpupri.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_CPUPRI_H
+#define _LINUX_CPUPRI_H
+
+#include <linux/sched.h>
+
+#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO
+#define CPUPRI_NR_PRI_WORDS CPUPRI_NR_PRIORITIES/BITS_PER_LONG
+
+#define CPUPRI_INVALID -1
+#define CPUPRI_IDLE     0
+#define CPUPRI_NORMAL   1
+/* values 2-101 are RT priorities 0-99 */
+
+struct cpupri_vec
+{
+	spinlock_t lock;
+	cpumask_t  mask;
+};
+
+struct cpupri {
+	struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
+	long              pri_active[CPUPRI_NR_PRI_WORDS];
+	int               cpu_to_pri[NR_CPUS];
+};
+
+#ifdef CONFIG_SMP
+int  cpupri_find(struct cpupri *cp,
+		 struct task_struct *p, cpumask_t *lowest_mask);
+void cpupri_set(struct cpupri *cp, int cpu, int pri);
+void cpupri_init(struct cpupri *cp);
+#else
+#define cpupri_init() do { } while(0)
+#endif
+
+#endif /* _LINUX_CPUPRI_H */
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 3495762..41dcbb4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -73,8 +73,10 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 	WARN_ON(!rt_task(p));
 	rq->rt.rt_nr_running++;
 #ifdef CONFIG_SMP
-	if (p->prio < rq->rt.highest_prio)
+	if (p->prio < rq->rt.highest_prio) {
 		rq->rt.highest_prio = p->prio;
+		cpupri_set(&rq->rd->cpupri, rq->cpu, p->prio);
+	}
 	if (p->nr_cpus_allowed > 1)
 		rq->rt.rt_nr_migratory++;
 
@@ -84,6 +86,8 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 
 static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 {
+	int highest_prio = rq->rt.highest_prio;
+
 	WARN_ON(!rt_task(p));
 	WARN_ON(!rq->rt.rt_nr_running);
 	rq->rt.rt_nr_running--;
@@ -103,6 +107,9 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 	if (p->nr_cpus_allowed > 1)
 		rq->rt.rt_nr_migratory--;
 
+	if (rq->rt.highest_prio != highest_prio)
+		cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio);
+
 	update_rt_migration(rq);
 #endif /* CONFIG_SMP */
 }
@@ -295,82 +302,6 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 
-static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
-{
-	int       lowest_prio = -1;
-	int       lowest_cpu  = -1;
-	int       count       = 0;
-	int       cpu;
-
-	cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
-
-	/*
-	 * Yes, I know doing two cpus_and() kind of sucks, especially on
-	 * those very large SMP systems.  We are going for correctness now,
-	 * optimization later.  A later patch will add rd->online to cache
-	 * the subset of rd->span that are online and then we can collapse
-	 * these two mask operations into one
-	 */
-	cpus_and(*lowest_mask, *lowest_mask, task_rq(task)->rd->span);
-
-	/*
-	 * Scan each rq for the lowest prio.
-	 */
-	for_each_cpu_mask(cpu, *lowest_mask) {
-		struct rq *rq = cpu_rq(cpu);
-
-		/* We look for lowest RT prio or non-rt CPU */
-		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			/*
-			 * if we already found a low RT queue
-			 * and now we found this non-rt queue
-			 * clear the mask and set our bit.
-			 * Otherwise just return the queue as is
-			 * and the count==1 will cause the algorithm
-			 * to use the first bit found.
-			 */
-			if (lowest_cpu != -1) {
-				cpus_clear(*lowest_mask);
-				cpu_set(rq->cpu, *lowest_mask);
-			}
-			return 1;
-		}
-
-		/* no locking for now */
-		if ((rq->rt.highest_prio > task->prio)
-		    && (rq->rt.highest_prio >= lowest_prio)) {
-			if (rq->rt.highest_prio > lowest_prio) {
-				/* new low - clear old data */
-				lowest_prio = rq->rt.highest_prio;
-				lowest_cpu = cpu;
-				count = 0;
-			}
-			count++;
-		} else
-			cpu_clear(cpu, *lowest_mask);
-	}
-
-	/*
-	 * Clear out all the set bits that represent
-	 * runqueues that were of higher prio than
-	 * the lowest_prio.
-	 */
-	if (lowest_cpu != -1) {
-		/*
-		 * Perhaps we could add another cpumask op to
-		 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);
-		 * Then that could be optimized to use memset and such.
-		 */
-		for_each_cpu_mask(cpu, *lowest_mask) {
-			if (cpu >= lowest_cpu)
-				break;
-			cpu_clear(cpu, *lowest_mask);
-		}
-	}
-
-	return count;
-}
-
 static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
 {
 	int first;
@@ -392,18 +323,13 @@ static int find_lowest_rq(struct task_struct *task)
 	cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask);
 	int this_cpu = smp_processor_id();
 	int cpu      = task_cpu(task);
-	int count    = find_lowest_cpus(task, lowest_mask);
+ 
+	if (task->nr_cpus_allowed == 1)
+		return -1; /* No other targets possible */
 
-	if (!count)
+	if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
 		return -1; /* No targets found */
-
-	/*
-	 * There is no sense in performing an optimal search if only one
-	 * target is found.
-	 */
-	if (count == 1)
-		return first_cpu(*lowest_mask);
-
+ 
 	/*
 	 * At this point we have built a mask of cpus representing the
 	 * lowest priority tasks in the system.  Now we want to elect
@@ -823,6 +749,8 @@ static void join_domain_rt(struct rq *rq)
 {
 	if (rq->rt.overloaded)
 		rt_set_overload(rq);
+
+	cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio);
 }
 
 /* Assumes rq->lock is held */
@@ -830,6 +758,8 @@ static void leave_domain_rt(struct rq *rq)
 {
 	if (rq->rt.overloaded)
 		rt_clear_overload(rq);
+
+	cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID);
 }
 
 static void set_curr_task_rt(struct rq *rq)


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] Fix optimized search
  2007-11-21  4:26         ` Steven Rostedt
@ 2007-11-21  5:14           ` Gregory Haskins
  0 siblings, 0 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21  5:14 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Peter Zijlstra, Ingo Molnar, Christoph Lameter, LKML

>>> On Tue, Nov 20, 2007 at 11:26 PM, in message
<20071121042635.GB24815@goodmis.org>, Steven Rostedt <rostedt@goodmis.org>
wrote: 
> On Tue, Nov 20, 2007 at 11:15:48PM -0500, Steven Rostedt wrote:
>> Gregory Haskins wrote:
>>> I spied a few more issues from http://lkml.org/lkml/2007/11/20/590.
>>> Patch is below..
>>
>> Thanks, but I have one update...
>>
> 
> Here's the updated patch.
> 
> Oh, and Gregory, please email me at my rostedt@goodmis.org account. It
> has better filters ;-)
> 
> This series is at:
> 
>   http://rostedt.homelinux.com/rt/rt-balance-patches-v6.tar.bz2

Ah..mails crossed. ;)  Ignore my patch #1 from the 0/4 series I just sent out.

Regards,
-Greg


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 0/4] more RT balancing enhancements v6a
  2007-11-21  4:44 ` [PATCH 0/4] more RT balancing enhancements Gregory Haskins
                     ` (3 preceding siblings ...)
  2007-11-21  4:44   ` [PATCH 4/4] RT: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
@ 2007-11-21 19:51   ` Gregory Haskins
  2007-11-21 19:52     ` [PATCH 1/4] SCHED: Add sched-domain roots Gregory Haskins
                       ` (3 more replies)
  4 siblings, 4 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21 19:51 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

These patches apply to the end of the rt-balance-patches v6 annouced here:

http://lkml.org/lkml/2007/11/20/613

Changes since v1:

*) Rebased from v4 to v6
*) original patch #1 was folded into v6
*) added support for caching online cpus within the root-domain

Comments welcome!

Regards,
-Greg 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/4] SCHED: Add sched-domain roots
  2007-11-21 19:51   ` [PATCH 0/4] more RT balancing enhancements v6a Gregory Haskins
@ 2007-11-21 19:52     ` Gregory Haskins
  2007-11-21 19:52     ` [PATCH 2/4] SCHED: Track online cpus in the root-domain Gregory Haskins
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21 19:52 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables.  Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset.  However, we currently still maintain some
policy/state as global variables which transcend all cpusets.  Consider,
for instance, rt-overload state.

Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).

We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one.  This logic doesn't have any clients
yet but it will later in the series.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
CC: Paul Jackson <pj@sgi.com>
CC: Simon Derr <simon.derr@bull.net>
---

 include/linux/sched.h |    3 ++
 kernel/sched.c        |   89 +++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 19d9f3f..0ba221e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -845,6 +845,9 @@ struct sched_class {
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
 	void (*task_new) (struct rq *rq, struct task_struct *p);
 	void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
+
+	void (*join_domain)(struct rq *rq);
+	void (*leave_domain)(struct rq *rq);
 };
 
 struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index e685402..e5c902c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -276,6 +276,17 @@ struct rt_rq {
 	int overloaded;
 };
 
+#ifdef CONFIG_SMP
+
+struct root_domain {
+	atomic_t refcount;
+	cpumask_t span;
+};
+
+static struct root_domain def_root_domain;
+
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -333,6 +344,7 @@ struct rq {
 	atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
+	struct root_domain  *rd;
 	struct sched_domain *sd;
 
 	/* For active balancing */
@@ -5718,11 +5730,68 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 	return 1;
 }
 
+static void rq_attach_root(struct rq *rq, struct root_domain *rd)
+{
+	unsigned long flags;
+	const struct sched_class *class;
+
+	spin_lock_irqsave(&rq->lock, flags);
+
+	if (rq->rd) {
+		struct root_domain *old_rd = rq->rd;
+
+		for (class = sched_class_highest; class; class = class->next)
+			if (class->leave_domain)
+				class->leave_domain(rq);
+
+		if (atomic_dec_and_test(&old_rd->refcount))
+			kfree(old_rd);
+	}
+
+	atomic_inc(&rd->refcount);
+	rq->rd = rd;
+
+	for (class = sched_class_highest; class; class = class->next)
+		if (class->join_domain)
+			class->join_domain(rq);
+
+	spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+static void init_rootdomain(struct root_domain *rd, cpumask_t *map)
+{
+	memset(rd, 0, sizeof(*rd));
+
+	rd->span = *map;
+}
+
+static void init_defrootdomain(void)
+{
+	cpumask_t cpus = CPU_MASK_ALL;
+
+	init_rootdomain(&def_root_domain, &cpus);
+	atomic_set(&def_root_domain.refcount, 1);
+}
+
+static struct root_domain *alloc_rootdomain(cpumask_t *map)
+{
+	struct root_domain *rd;
+
+	rd = kmalloc(sizeof(*rd), GFP_KERNEL);
+	if (!rd)
+		return NULL;
+
+	init_rootdomain(rd, map);
+
+	return rd;
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain.  Callers must
  * hold the hotplug lock.
  */
-static void cpu_attach_domain(struct sched_domain *sd, int cpu)
+static void cpu_attach_domain(struct sched_domain *sd,
+			      struct root_domain *rd, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	struct sched_domain *tmp;
@@ -5747,6 +5816,7 @@ static void cpu_attach_domain(struct sched_domain *sd, int cpu)
 
 	sched_domain_debug(sd, cpu);
 
+	rq_attach_root(rq, rd);
 	rcu_assign_pointer(rq->sd, sd);
 }
 
@@ -6115,6 +6185,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 static int build_sched_domains(const cpumask_t *cpu_map)
 {
 	int i;
+	struct root_domain *rd;
 #ifdef CONFIG_NUMA
 	struct sched_group **sched_group_nodes = NULL;
 	int sd_allnodes = 0;
@@ -6131,6 +6202,12 @@ static int build_sched_domains(const cpumask_t *cpu_map)
 	sched_group_nodes_bycpu[first_cpu(*cpu_map)] = sched_group_nodes;
 #endif
 
+	rd = alloc_rootdomain(cpu_map);
+	if (!rd) {
+		printk(KERN_WARNING "Cannot alloc root domain\n");
+		return -ENOMEM;
+	}
+
 	/*
 	 * Set up domains for cpus specified by the cpu_map.
 	 */
@@ -6347,7 +6424,7 @@ static int build_sched_domains(const cpumask_t *cpu_map)
 #else
 		sd = &per_cpu(phys_domains, i);
 #endif
-		cpu_attach_domain(sd, i);
+		cpu_attach_domain(sd, rd, i);
 	}
 
 	return 0;
@@ -6405,7 +6482,7 @@ static void detach_destroy_domains(const cpumask_t *cpu_map)
 	unregister_sched_domain_sysctl();
 
 	for_each_cpu_mask(i, *cpu_map)
-		cpu_attach_domain(NULL, i);
+		cpu_attach_domain(NULL, &def_root_domain, i);
 	synchronize_sched();
 	arch_destroy_sched_domains(cpu_map);
 }
@@ -6641,6 +6718,10 @@ void __init sched_init(void)
 	int highest_cpu = 0;
 	int i, j;
 
+#ifdef CONFIG_SMP
+	init_defrootdomain();
+#endif
+
 	for_each_possible_cpu(i) {
 		struct rt_prio_array *array;
 		struct rq *rq;
@@ -6680,6 +6761,8 @@ void __init sched_init(void)
 			rq->cpu_load[j] = 0;
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
+		rq->rd = NULL;
+		rq_attach_root(rq, &def_root_domain);
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
 		rq->push_cpu = 0;


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/4] SCHED: Track online cpus in the root-domain
  2007-11-21 19:51   ` [PATCH 0/4] more RT balancing enhancements v6a Gregory Haskins
  2007-11-21 19:52     ` [PATCH 1/4] SCHED: Add sched-domain roots Gregory Haskins
@ 2007-11-21 19:52     ` Gregory Haskins
  2007-11-21 19:52     ` [PATCH 3/4] SCHED: Only balance our RT tasks within our root-domain Gregory Haskins
  2007-11-21 19:52     ` [PATCH 4/4] SCHED: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
  3 siblings, 0 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21 19:52 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

We cache the subset of rd->span cpus that are online in rd->online to
reduce the need for runtime computation.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e5c902c..c58d1bf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -281,6 +281,7 @@ struct rt_rq {
 struct root_domain {
 	atomic_t refcount;
 	cpumask_t span;
+	cpumask_t online;
 };
 
 static struct root_domain def_root_domain;
@@ -5491,6 +5492,15 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 	case CPU_ONLINE_FROZEN:
 		/* Strictly unnecessary, as first user will wake it. */
 		wake_up_process(cpu_rq(cpu)->migration_thread);
+
+		/* Update our root-domain */
+		rq = cpu_rq(cpu);
+		spin_lock_irqsave(&rq->lock, flags);
+		if (rq->rd) {
+			BUG_ON(!cpu_isset(cpu, rq->rd->span));
+			cpu_set(cpu, rq->rd->online);
+		}
+		spin_unlock_irqrestore(&rq->lock, flags);
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -5539,6 +5549,17 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		}
 		spin_unlock_irq(&rq->lock);
 		break;
+
+	case CPU_DOWN_PREPARE:
+		/* Update our root-domain */
+		rq = cpu_rq(cpu);
+		spin_lock_irqsave(&rq->lock, flags);
+		if (rq->rd) {
+			BUG_ON(!cpu_isset(cpu, rq->rd->span));
+			cpu_clear(cpu, rq->rd->online);
+		}
+		spin_unlock_irqrestore(&rq->lock, flags);	
+		break;
 #endif
 	case CPU_LOCK_RELEASE:
 		mutex_unlock(&sched_hotcpu_mutex);
@@ -5763,6 +5784,7 @@ static void init_rootdomain(struct root_domain *rd, cpumask_t *map)
 	memset(rd, 0, sizeof(*rd));
 
 	rd->span = *map;
+	cpus_and(rd->online, rd->span, cpu_online_map);
 }
 
 static void init_defrootdomain(void)


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/4] SCHED: Only balance our RT tasks within our root-domain
  2007-11-21 19:51   ` [PATCH 0/4] more RT balancing enhancements v6a Gregory Haskins
  2007-11-21 19:52     ` [PATCH 1/4] SCHED: Add sched-domain roots Gregory Haskins
  2007-11-21 19:52     ` [PATCH 2/4] SCHED: Track online cpus in the root-domain Gregory Haskins
@ 2007-11-21 19:52     ` Gregory Haskins
  2007-11-21 19:52     ` [PATCH 4/4] SCHED: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
  3 siblings, 0 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21 19:52 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

We move the rt-overload data as the first global to per-domain
reclassification.  This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.

Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system.  Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down.  Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.

Note: In some cases, task->cpus_allowed will effectively reduce our search
to within our domain.  However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution.  If it can be optimized later, so be it.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
---

 kernel/sched.c    |    2 ++
 kernel/sched_rt.c |   58 ++++++++++++++++++++++++++++++++---------------------
 2 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c58d1bf..55a2a59 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -282,6 +282,8 @@ struct root_domain {
 	atomic_t refcount;
 	cpumask_t span;
 	cpumask_t online;
+	cpumask_t rto_mask;
+	atomic_t  rto_count;
 };
 
 static struct root_domain def_root_domain;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0b3a0f2..6a5ac33 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -4,20 +4,18 @@
  */
 
 #ifdef CONFIG_SMP
-static cpumask_t rt_overload_mask;
-static atomic_t rto_count;
-static inline int rt_overloaded(void)
+
+static inline int rt_overloaded(struct rq *rq)
 {
-	return atomic_read(&rto_count);
+	return atomic_read(&rq->rd->rto_count);
 }
-static inline cpumask_t *rt_overload(void)
+static inline cpumask_t *rt_overload(struct rq *rq)
 {
-	return &rt_overload_mask;
+	return &rq->rd->rto_mask;
 }
 static inline void rt_set_overload(struct rq *rq)
 {
-	rq->rt.overloaded = 1;
-	cpu_set(rq->cpu, rt_overload_mask);
+	cpu_set(rq->cpu, rq->rd->rto_mask);
 	/*
 	 * Make sure the mask is visible before we set
 	 * the overload count. That is checked to determine
@@ -26,22 +24,25 @@ static inline void rt_set_overload(struct rq *rq)
 	 * updated yet.
 	 */
 	wmb();
-	atomic_inc(&rto_count);
+	atomic_inc(&rq->rd->rto_count);
 }
 static inline void rt_clear_overload(struct rq *rq)
 {
 	/* the order here really doesn't matter */
-	atomic_dec(&rto_count);
-	cpu_clear(rq->cpu, rt_overload_mask);
-	rq->rt.overloaded = 0;
+	atomic_dec(&rq->rd->rto_count);
+	cpu_clear(rq->cpu, rq->rd->rto_mask);
 }
 
 static void update_rt_migration(struct rq *rq)
 {
-	if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1))
+	if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) {
 		rt_set_overload(rq);
-	else
+		rq->rt.overloaded = 1;
+	} else {
 		rt_clear_overload(rq);
+		rq->rt.overloaded = 0;
+	}
+	
 }
 #endif /* CONFIG_SMP */
 
@@ -301,7 +302,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 	int       count       = 0;
 	int       cpu;
 
-	cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
+	cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
 
 	/*
 	 * Scan each rq for the lowest prio.
@@ -584,18 +585,12 @@ static int pull_rt_task(struct rq *this_rq)
 
 	assert_spin_locked(&this_rq->lock);
 
-	/*
-	 * If cpusets are used, and we have overlapping
-	 * run queue cpusets, then this algorithm may not catch all.
-	 * This is just the price you pay on trying to keep
-	 * dirtying caches down on large SMP machines.
-	 */
-	if (likely(!rt_overloaded()))
+	if (likely(!rt_overloaded(this_rq)))
 		return 0;
 
 	next = pick_next_task_rt(this_rq);
 
-	rto_cpumask = rt_overload();
+	rto_cpumask = rt_overload(this_rq);
 
 	for_each_cpu_mask(cpu, *rto_cpumask) {
 		if (this_cpu == cpu)
@@ -814,6 +809,20 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p)
 	}
 }
 
+/* Assumes rq->lock is held */
+static void join_domain_rt(struct rq *rq)
+{
+	if (rq->rt.overloaded)
+		rt_set_overload(rq);
+}
+
+/* Assumes rq->lock is held */
+static void leave_domain_rt(struct rq *rq)
+{
+	if (rq->rt.overloaded)
+		rt_clear_overload(rq);
+}
+
 static void set_curr_task_rt(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
@@ -843,4 +852,7 @@ const struct sched_class rt_sched_class = {
 
 	.set_curr_task          = set_curr_task_rt,
 	.task_tick		= task_tick_rt,
+
+	.join_domain            = join_domain_rt,
+	.leave_domain           = leave_domain_rt,
 };


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/4] SCHED: Use a 2-d bitmap for searching lowest-pri CPU
  2007-11-21 19:51   ` [PATCH 0/4] more RT balancing enhancements v6a Gregory Haskins
                       ` (2 preceding siblings ...)
  2007-11-21 19:52     ` [PATCH 3/4] SCHED: Only balance our RT tasks within our root-domain Gregory Haskins
@ 2007-11-21 19:52     ` Gregory Haskins
  3 siblings, 0 replies; 36+ messages in thread
From: Gregory Haskins @ 2007-11-21 19:52 UTC (permalink / raw)
  To: LKML, Steven Rostedt
  Cc: Christoph Lameter, Peter Zijlstra, Ingo Molnar, Gregory Haskins

The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

In this past, this patch has maintained the priorties as a global
property.  It now has a per-root-domain scope to shield unrelated
domains from unecessary cache collisions.

You may ask yourself why do we add logic to find_lowest_cpus() earlier
in the series and then rip it out in this patch.  The answer is that
this current patch has been controversial, and is likely to not be
accepted (at least in the short term).  Therefore, it is in our best
interest to optimize as much of the code prior to this patch as possible
even if we ultimately rip it out here.  That way, the system is still tuned
even if this patch goes to /dev/null.

You may now be asking yourself why bother including this patch?  The
answer is that our own independent testing shows
(http://article.gmane.org/gmane.linux.rt.user/1889) this patch makes a
significant performance improvement (at least for rt latencies) and
doesnt appear to cause any regressions in other dimensions.  However,
I also understand the reservations raised by Steve Rostedt et. al.
Therefore, I include this patch in the hopes that it is useful to
someone, but with the understanding that it is not likely to be accepted
without further demonstration of its benefits. 

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
---

 kernel/Makefile       |    1 
 kernel/sched.c        |    4 +
 kernel/sched_cpupri.c |  166 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_cpupri.h |   35 ++++++++++
 kernel/sched_rt.c     |   95 +++++-----------------------
 5 files changed, 223 insertions(+), 78 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..c013a6c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_SMP) += sched_cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index 55a2a59..436db58 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -70,6 +70,8 @@
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
 
+#include "sched_cpupri.h"
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -284,6 +286,7 @@ struct root_domain {
 	cpumask_t online;
 	cpumask_t rto_mask;
 	atomic_t  rto_count;
+	struct cpupri cpupri;
 };
 
 static struct root_domain def_root_domain;
@@ -5787,6 +5790,7 @@ static void init_rootdomain(struct root_domain *rd, cpumask_t *map)
 
 	rd->span = *map;
 	cpus_and(rd->online, rd->span, cpu_online_map);
+	cpupri_init(&rd->cpupri);
 }
 
 static void init_defrootdomain(void)
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 0000000..e30d33f
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,166 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins <ghaskins@novell.com>
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ *                 (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, nr_domcpus)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include "sched_cpupri.h"
+
+/* Convert between a 140 based task->prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+	int cpupri;
+
+	if (prio == MAX_PRIO)
+		cpupri = CPUPRI_IDLE;
+	else if (prio >= MAX_RT_PRIO)
+		cpupri = CPUPRI_NORMAL;
+	else
+		cpupri = MAX_RT_PRIO - prio + 1;
+
+	return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)                    \
+  for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES);     \
+       idx < CPUPRI_NR_PRIORITIES;                            \
+       idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @cp: The cpupri context
+ * @p: The task
+ * @lowest_mask: A mask to fill in with selected CPUs
+ *
+ * Note: This function returns the recommended CPUs as calculated during the
+ * current invokation.  By the time the call returns, the CPUs may have in
+ * fact changed priorities any number of times.  While not ideal, it is not
+ * an issue of correctness since the normal rebalancer logic will correct
+ * any discrepancies created by racing against the uncertainty of the current
+ * priority configuration.
+ *
+ * Returns: (int)bool - CPUs were found
+ */
+int cpupri_find(struct cpupri *cp, struct task_struct *p,
+		cpumask_t *lowest_mask)
+{
+	int                  idx      = 0;
+	int                  task_pri = convert_prio(p->prio);
+
+	for_each_cpupri_active(cp->pri_active, idx) {
+		struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
+		cpumask_t mask;
+
+		if (idx >= task_pri)
+			break;
+
+		cpus_and(mask, p->cpus_allowed, vec->mask);
+		
+		if (cpus_empty(mask))
+			continue;
+		
+		*lowest_mask = mask;
+		return 1;
+	}
+
+	return 0;
+}
+
+/**
+ * cpupri_set - update the cpu priority setting
+ * @cp: The cpupri context
+ * @cpu: The target cpu
+ * @pri: The priority (INVALID-RT99) to assign to this CPU
+ *
+ * Note: Assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpupri_set(struct cpupri *cp, int cpu, int newpri)
+{
+	int                 *currpri = &cp->cpu_to_pri[cpu];
+	int                  oldpri  = *currpri;
+	unsigned long        flags;
+
+	newpri = convert_prio(newpri);
+
+	if (newpri == oldpri)
+		return;
+
+	/*
+	 * If the cpu was currently mapped to a different value, we
+	 * first need to unmap the old value
+	 */
+	if (likely(oldpri != CPUPRI_INVALID)) {
+		struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];
+
+		spin_lock_irqsave(&vec->lock, flags);
+
+		cpu_clear(cpu, vec->mask);
+		if (cpus_empty(vec->mask))
+			clear_bit(oldpri, cp->pri_active);
+
+		spin_unlock_irqrestore(&vec->lock, flags);
+	}
+
+	if (likely(newpri != CPUPRI_INVALID)) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
+
+		spin_lock_irqsave(&vec->lock, flags);
+
+		cpu_set(cpu, vec->mask);
+		set_bit(newpri, cp->pri_active);
+
+		spin_unlock_irqrestore(&vec->lock, flags);
+	}
+
+	*currpri = newpri;
+}
+
+/**
+ * cpupri_init - initialize the cpupri structure
+ * @cp: The cpupri context
+ *
+ * Returns: (void)
+ */
+void cpupri_init(struct cpupri *cp)
+{
+	int i;
+
+	memset(cp, 0, sizeof(*cp));
+
+	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+		spin_lock_init(&vec->lock);
+		cpus_clear(vec->mask);
+	}
+
+	for_each_possible_cpu(i)
+		cp->cpu_to_pri[i] = CPUPRI_INVALID;
+}
+
+
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
new file mode 100644
index 0000000..1392c93
--- /dev/null
+++ b/kernel/sched_cpupri.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_CPUPRI_H
+#define _LINUX_CPUPRI_H
+
+#include <linux/sched.h>
+
+#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO
+#define CPUPRI_NR_PRI_WORDS CPUPRI_NR_PRIORITIES/BITS_PER_LONG
+
+#define CPUPRI_INVALID -1
+#define CPUPRI_IDLE     0
+#define CPUPRI_NORMAL   1
+/* values 2-101 are RT priorities 0-99 */
+
+struct cpupri_vec
+{
+	spinlock_t lock;
+	cpumask_t  mask;
+};
+
+struct cpupri {
+	struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
+	long              pri_active[CPUPRI_NR_PRI_WORDS];
+	int               cpu_to_pri[NR_CPUS];
+};
+
+#ifdef CONFIG_SMP
+int  cpupri_find(struct cpupri *cp,
+		 struct task_struct *p, cpumask_t *lowest_mask);
+void cpupri_set(struct cpupri *cp, int cpu, int pri);
+void cpupri_init(struct cpupri *cp);
+#else
+#define cpupri_init() do { } while(0)
+#endif
+
+#endif /* _LINUX_CPUPRI_H */
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 6a5ac33..41dcbb4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -73,8 +73,10 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 	WARN_ON(!rt_task(p));
 	rq->rt.rt_nr_running++;
 #ifdef CONFIG_SMP
-	if (p->prio < rq->rt.highest_prio)
+	if (p->prio < rq->rt.highest_prio) {
 		rq->rt.highest_prio = p->prio;
+		cpupri_set(&rq->rd->cpupri, rq->cpu, p->prio);
+	}
 	if (p->nr_cpus_allowed > 1)
 		rq->rt.rt_nr_migratory++;
 
@@ -84,6 +86,8 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 
 static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 {
+	int highest_prio = rq->rt.highest_prio;
+
 	WARN_ON(!rt_task(p));
 	WARN_ON(!rq->rt.rt_nr_running);
 	rq->rt.rt_nr_running--;
@@ -103,6 +107,9 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 	if (p->nr_cpus_allowed > 1)
 		rq->rt.rt_nr_migratory--;
 
+	if (rq->rt.highest_prio != highest_prio)
+		cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio);
+
 	update_rt_migration(rq);
 #endif /* CONFIG_SMP */
 }
@@ -295,73 +302,6 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 
-static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
-{
-	int       lowest_prio = -1;
-	int       lowest_cpu  = -1;
-	int       count       = 0;
-	int       cpu;
-
-	cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
-
-	/*
-	 * Scan each rq for the lowest prio.
-	 */
-	for_each_cpu_mask(cpu, *lowest_mask) {
-		struct rq *rq = cpu_rq(cpu);
-
-		/* We look for lowest RT prio or non-rt CPU */
-		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			/*
-			 * if we already found a low RT queue
-			 * and now we found this non-rt queue
-			 * clear the mask and set our bit.
-			 * Otherwise just return the queue as is
-			 * and the count==1 will cause the algorithm
-			 * to use the first bit found.
-			 */
-			if (lowest_cpu != -1) {
-				cpus_clear(*lowest_mask);
-				cpu_set(rq->cpu, *lowest_mask);
-			}
-			return 1;
-		}
-
-		/* no locking for now */
-		if ((rq->rt.highest_prio > task->prio)
-		    && (rq->rt.highest_prio >= lowest_prio)) {
-			if (rq->rt.highest_prio > lowest_prio) {
-				/* new low - clear old data */
-				lowest_prio = rq->rt.highest_prio;
-				lowest_cpu = cpu;
-				count = 0;
-			}
-			count++;
-		} else
-			cpu_clear(cpu, *lowest_mask);
-	}
-
-	/*
-	 * Clear out all the set bits that represent
-	 * runqueues that were of higher prio than
-	 * the lowest_prio.
-	 */
-	if (lowest_cpu > 0) {
-		/*
-		 * Perhaps we could add another cpumask op to
-		 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);
-		 * Then that could be optimized to use memset and such.
-		 */
-		for_each_cpu_mask(cpu, *lowest_mask) {
-			if (cpu >= lowest_cpu)
-				break;
-			cpu_clear(cpu, *lowest_mask);
-		}
-	}
-
-	return count;
-}
-
 static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
 {
 	int first;
@@ -383,18 +323,13 @@ static int find_lowest_rq(struct task_struct *task)
 	cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask);
 	int this_cpu = smp_processor_id();
 	int cpu      = task_cpu(task);
-	int count    = find_lowest_cpus(task, lowest_mask);
+ 
+	if (task->nr_cpus_allowed == 1)
+		return -1; /* No other targets possible */
 
-	if (!count)
+	if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
 		return -1; /* No targets found */
-
-	/*
-	 * There is no sense in performing an optimal search if only one
-	 * target is found.
-	 */
-	if (count == 1)
-		return first_cpu(*lowest_mask);
-
+ 
 	/*
 	 * At this point we have built a mask of cpus representing the
 	 * lowest priority tasks in the system.  Now we want to elect
@@ -814,6 +749,8 @@ static void join_domain_rt(struct rq *rq)
 {
 	if (rq->rt.overloaded)
 		rt_set_overload(rq);
+
+	cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio);
 }
 
 /* Assumes rq->lock is held */
@@ -821,6 +758,8 @@ static void leave_domain_rt(struct rq *rq)
 {
 	if (rq->rt.overloaded)
 		rt_clear_overload(rq);
+
+	cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID);
 }
 
 static void set_curr_task_rt(struct rq *rq)


^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2007-11-21 20:13 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-21  1:00 [PATCH v4 00/20] New RT Balancing version 4 Steven Rostedt
2007-11-21  1:00 ` [PATCH v4 01/20] Add rt_nr_running accounting Steven Rostedt
2007-11-21  1:00 ` [PATCH v4 02/20] track highest prio queued on runqueue Steven Rostedt
2007-11-21  1:00 ` [PATCH v4 03/20] push RT tasks Steven Rostedt
2007-11-21  1:00 ` [PATCH v4 04/20] RT overloaded runqueues accounting Steven Rostedt
2007-11-21  1:00 ` [PATCH v4 05/20] pull RT tasks Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 06/20] wake up balance RT Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 07/20] disable CFS RT load balancing Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 08/20] Cache cpus_allowed weight for optimizing migration Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 09/20] RT: Consistency cleanup for this_rq usage Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 10/20] RT: Remove some CFS specific code from the wakeup path of RT tasks Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 11/20] RT: Break out the search function Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 12/20] RT: Allow current_cpu to be included in search Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 13/20] RT: Pre-route RT tasks on wakeup Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 14/20] RT: Optimize our cpu selection based on topology Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 15/20] RT: Optimize rebalancing Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 16/20] Avoid overload Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 17/20] RT: restore the migratable conditional Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 18/20] Optimize cpu search with hamming weight Steven Rostedt
2007-11-21  1:01 ` [PATCH v4 19/20] Optimize out cpu_clears Steven Rostedt
2007-11-21  2:10   ` Steven Rostedt
2007-11-21  3:10     ` [PATCH] Fix optimized search Gregory Haskins
2007-11-21  4:15       ` Steven Rostedt
2007-11-21  4:26         ` Steven Rostedt
2007-11-21  5:14           ` Gregory Haskins
2007-11-21  1:01 ` [PATCH v4 20/20] balance RT tasks no new wake up Steven Rostedt
2007-11-21  4:44 ` [PATCH 0/4] more RT balancing enhancements Gregory Haskins
2007-11-21  4:44   ` [PATCH 1/4] Fix optimized search Gregory Haskins
2007-11-21  4:44   ` [PATCH 2/4] RT: Add sched-domain roots Gregory Haskins
2007-11-21  4:44   ` [PATCH 3/4] RT: Only balance our RT tasks within our root-domain Gregory Haskins
2007-11-21  4:44   ` [PATCH 4/4] RT: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
2007-11-21 19:51   ` [PATCH 0/4] more RT balancing enhancements v6a Gregory Haskins
2007-11-21 19:52     ` [PATCH 1/4] SCHED: Add sched-domain roots Gregory Haskins
2007-11-21 19:52     ` [PATCH 2/4] SCHED: Track online cpus in the root-domain Gregory Haskins
2007-11-21 19:52     ` [PATCH 3/4] SCHED: Only balance our RT tasks within our root-domain Gregory Haskins
2007-11-21 19:52     ` [PATCH 4/4] SCHED: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox