[patch 0/8] New RT Task Balancing

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 0/8] New RT Task Balancing
@ 2007-10-19 18:42 Steven Rostedt
  2007-10-19 18:42 ` [patch 1/8] Add rt_nr_running accounting Steven Rostedt
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 18:42 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

Currently in mainline the balancing of multiple RT threads is quite broken.
That is to say that a high priority thread that is scheduled on a CPU
with a higher priority thread, may need to unnecessarily wait while it
can easily run on another CPU that's running a lower priority thread.

Balancing (or migrating) tasks in general is an art. Lots of considerations
must be taken into account. Cache lines, NUMA and more. This is true
with general processes which expect high through put and migration can
be done in batch.  But when it comes to RT tasks, we really need to
put them off to a CPU that they can run on as soon as possible. Even
if it means a bit of cache line flushing.

Right now an RT task can wait several milliseconds before it gets scheduled
to run. And perhaps even longer. The migration thread is not fast enough
to take care of RT tasks.

To demonstrate this, I wrote a simple test.

  http://rostedt.homelinux.com/rt/rt-migrate-test.c

  (gcc -o rt-migrate-test rt-migrate-test.c -lpthread)

This test expects a parameter to pass in the number of threads to create.
If you add the '-c' option (check) it will terminate if the test fails
one of the iterations. If you add this, pass in +1 threads.

For example, on a 4 way box, I used

  rt-migrate-test -c 5

What this test does is to create the number of threads specified (in this
case 5). Each thread is set as an RT FIFO task starting at a specified
prio (default 2) and each thread being one priority higher. So with this
example the 5 threads created are at priorities 2, 3, 4, 5, and 6.

The parent thread sets its priority to one higher than the highest of
the children (this example 7). It uses pthread_barrier_wait to synchronize
the threads.  Then it takes a time stamp and starts all the threads.
The threads when woken up take a time stamp and compares it to the parent
thread to see how long it took to be awoken. It then runs for an
interval (20ms default) in a busy loop. The busy loop ends when it reaches
the interval delta from the start time stamp. So if it is preempted, it
may not actually run for the full interval. This is expected behavior
of the test.

The numbers recorded are the delta from the thread's time stamp from the
parent time stamp. The number of iterations it ran the busy loop for, and
the delta from a thread time stamp taken at the end of the loop to the
parent time stamp.

Sometimes a lower priority task might wake up before a higher priority,
but this is OK, as long as the higher priority process gets the CPU when
it is awoken.

At the end of the test, the iteration data is printed to stdout. If a
higher priority task had to wait for a lower one to finish running, then
this is considered a failure. Here's an example of the output from
a run against git commit 4fa4d23fa20de67df919030c1216295664866ad7.

   1:       36      33   20041      39      33
 len:    20036   20033   40041   20039   20033
 loops: 167789  167693  227167  167829  167814

On iteration 1 (starts at 0) the third task started at 20ms after the parent
woke it up. We can see here that the first two tasks ran to completion
before the higher priority task was even able to start. That is a
20ms latency for the higher priority task!!!

So people who think that their audio would lose most latencies by upping 
the priority, may be in for a surprise. Since some kernel threads (like
the migration thread itself) may cause this latency.

To solve this issue, I've changed the RT task balancing from a passive
method (migration thread) to an active method.  This new method is
to actively push or pull RT tasks when they are woken up or scheduled.

On wake up of a task if it is an RT task, and there's already an RT task
of higher priority running on its runqueue, we initiate a push_rt_tasks
algorithm. This algorithm looks at the highest non-running RT task
and tries to find a CPU where it can run on. It only migrates the RT
task if it finds a CPU (of lowest priority) where the RT task
can run on and can preempt the currently running task on that CPU.
We continue pushing RT tasks until we can't push anymore.

If a RT task fails to be migrated we stop the pushing. This is possible
because we are always looking at the highest priority RT task on the
run queue. And if it can't migrate, then most likely the lower RT tasks
can not either.

There is one case that is not covered by this patch set. That is that
when the highest priority non running RT task has its CPU affinity
in such a way that it can not preempt any tasks on the CPUs running
on CPUs of its affinity. But a lower priority task has a larger affinity
to CPUs that it can run on. This is a case where the lower priority task
will not be migrated to those CPUS (although those CPUs may pull that task).
Currently this patch set ignores this scenario.

Another case where we push RT tasks is in the finish_task_switch. This is
done since the running RT task can not be migrated while it is running.
So if an RT task is preempted by a higher priority RT task, we can
migrate the RT task being preempted at that moment.

We also actively pull RT tasks. Whenever a runqueue is about to lower its
priority (schedule a lower priority task) a check is done to see if that
runqueue can pull RT tasks to it to run instead. A search is made of all
overloaded runqueues (runqueues with more than one RT task scheduled on it)
and checked to see if they have an RT task that can run on the runqueue
(affinity matches) and is of higher priority than the task the runqueue
is about to schedule. The pull algorithm pulls all RT tasks that match
this case.

With this patch set, I ran the rt-migrate-test over night in a while
loop with the -c option (which terminates upon failure) and it passed
over 6500 tests (each doing 50 wakeups each).

Here's an example of the output from the patched kernel. This is just to
explain it a bit more.

   1:    20060      61      55      56      61
 len:    40060   20061   20055   20056   20061
 loops: 227255  146050  168104  145965  168144

   2:       40      46      31      35      42
 len:    20040   20046   20031   20035   20042
 loops:     28  167769  167781  167668  167804

The first iteration (really 2cd, since we start at zero), is a typical run.
The lowest prio task didn't start executing until the other 4 tasks finished
and gave up the CPU.

The second iteration seems at first like a failure. But this is actually fine.
The lowest priority task just happen to schedule onto a CPU before the
higher priority tasks were woken up. But as you can see from this example,
the higher priority tasks still were able to get scheduled right away and
in doing so preempted the lower priority task. This can be seen by the
number of loops that the lower priority task was able to complete. Only 28.
This is because the busy loop terminates when the time stamp reaches the
time interval (20ms here) from the start time stamp. Since the lower priority
task was able to sneak in and start, it's time stamp was low. So after it
got preempted, and rescheduled, it was already past the run time interval
so it simply ended the loop.

Finally, the CFS RT balancing had to be removed in order for this to work.
Testing showed that the CFS RT balancing would actually pull RT tasks
from runqueues already assigned to the proper runqueues, and again cause
latencies.  With this new approach, the CFS RT balancing is not needed,
and I suggest that these patches replace the current CFS RT balancing.

Also, let me stress, that I made a great attempt to have this cause
as little overhead (practically none) to the non RT cases. Most of these
algorithms only take place when more than one RT task is scheduled on the
same runqueue.

Special thanks goes to Gregory Haskins for his advice and back and forth
on IRC with ideas. Although I didn't use his actual patches (his were
against -rt) it did help me clean up some of my code.  Also, thanks go to
Ingo Molnar himself for taking some ideas from his RT balance code in
the -rt patch.

Comments welcomed!

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 1/8] Add rt_nr_running accounting
  2007-10-19 18:42 [patch 0/8] New RT Task Balancing Steven Rostedt
@ 2007-10-19 18:42 ` Steven Rostedt
  2007-10-20 16:45   ` Dmitry Adamushko
  2007-10-19 18:42 ` [patch 2/8] track highest prio queued on runqueue Steven Rostedt
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 18:42 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

[-- Attachment #1: count-rt-queued.patch --]
[-- Type: text/plain, Size: 1880 bytes --]

This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched.c    |    2 ++
 kernel/sched_rt.c |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-19 12:32:39.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-19 12:33:09.000000000 -0400
@@ -300,6 +300,8 @@ struct rq {
 	 */
 	unsigned long nr_uninterruptible;
 
+	unsigned long rt_nr_running;
+
 	struct task_struct *curr, *idle;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:31:08.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:33:09.000000000 -0400
@@ -25,12 +25,28 @@ static void update_curr_rt(struct rq *rq
 	curr->se.exec_start = rq->clock;
 }
 
+static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+	if (rt_task(p))
+		rq->rt_nr_running++;
+}
+
+static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+	if (rt_task(p)) {
+		WARN_ON(!rq->rt_nr_running);
+		rq->rt_nr_running--;
+	}
+}
+
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
 {
 	struct rt_prio_array *array = &rq->rt.active;
 
 	list_add_tail(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
+
+	inc_rt_tasks(p, rq);
 }
 
 /*
@@ -42,6 +58,8 @@ static void dequeue_task_rt(struct rq *r
 
 	update_curr_rt(rq);
 
+	dec_rt_tasks(p, rq);
+
 	list_del(&p->run_list);
 	if (list_empty(array->queue + p->prio))
 		__clear_bit(p->prio, array->bitmap);

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 2/8] track highest prio queued on runqueue
  2007-10-19 18:42 [patch 0/8] New RT Task Balancing Steven Rostedt
  2007-10-19 18:42 ` [patch 1/8] Add rt_nr_running accounting Steven Rostedt
@ 2007-10-19 18:42 ` Steven Rostedt
  2007-10-19 19:19   ` Steven Rostedt
                     ` (2 more replies)
  2007-10-19 18:42 ` [patch 3/8] push RT tasks Steven Rostedt
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 18:42 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

[-- Attachment #1: add-rq-highest-prio.patch --]
[-- Type: text/plain, Size: 2569 bytes --]

This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.

This information will be used for later patches.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched.c    |    7 +++++++
 kernel/sched_rt.c |   25 +++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-19 12:33:09.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-19 12:34:32.000000000 -0400
@@ -324,6 +324,8 @@ struct rq {
 	int push_cpu;
 	/* cpu of this runqueue: */
 	int cpu;
+	/* highest queued rt task prio */
+	int highest_prio;
 
 	struct task_struct *migration_thread;
 	struct list_head migration_queue;
@@ -972,6 +974,8 @@ static void activate_task(struct rq *rq,
 
 	enqueue_task(rq, p, wakeup);
 	inc_nr_running(p, rq);
+
+	rq_prio_add_task(rq, p);
 }
 
 /*
@@ -984,6 +988,8 @@ static void deactivate_task(struct rq *r
 
 	dequeue_task(rq, p, sleep);
 	dec_nr_running(p, rq);
+
+	rq_prio_remove_task(rq, p);
 }
 
 /**
@@ -6619,6 +6625,7 @@ void __init sched_init(void)
 		rq->cpu = i;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
+		rq->highest_prio = MAX_RT_PRIO;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 
Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:33:09.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:33:23.000000000 -0400
@@ -110,6 +110,31 @@ static struct task_struct *pick_next_tas
 	return next;
 }
 
+#ifdef CONFIG_SMP
+static inline void rq_prio_add_task(struct rq *rq, struct task_struct *p)
+{
+	if (unlikely(rt_task(p)) && p->prio < rq->highest_prio)
+		rq->highest_prio = p->prio;
+}
+
+static inline void rq_prio_remove_task(struct rq *rq, struct task_struct *p)
+{
+	struct rt_prio_array *array;
+
+	if (unlikely(rt_task(p))) {
+		if (rq->rt_nr_running) {
+			if (p->prio >= rq->highest_prio) {
+				/* recalculate */
+				array = &rq->rt.active;
+				rq->highest_prio =
+					sched_find_first_bit(array->bitmap);
+			} /* otherwise leave rq->highest prio alone */
+		} else
+			rq->highest_prio = MAX_RT_PRIO;
+	}
+}
+#endif /* CONFIG_SMP */
+
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 3/8] push RT tasks
  2007-10-19 18:42 [patch 0/8] New RT Task Balancing Steven Rostedt
  2007-10-19 18:42 ` [patch 1/8] Add rt_nr_running accounting Steven Rostedt
  2007-10-19 18:42 ` [patch 2/8] track highest prio queued on runqueue Steven Rostedt
@ 2007-10-19 18:42 ` Steven Rostedt
  2007-10-19 18:42 ` [patch 4/8] RT overloaded runqueues accounting Steven Rostedt
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 18:42 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

[-- Attachment #1: rt-balance-push-tasks.patch --]
[-- Type: text/plain, Size: 9361 bytes --]

This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.

When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority.  The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.

We continue pushing RT tasks off the current runqueue until we don't push any
more.  The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.

TODO: The algorithm may stop when there are still RT tasks that can be
 migrated. Specifically, if the highest non running RT task CPU affinity
 is restricted to CPUs that are running higher priority tasks, there may
 be a lower priority task queued that has an affinity with a CPU that is
 running a lower priority task that it could be migrated to.  This
 patch set does not address this issue.

Note: checkpatch reveals two over 80 charater instances. I'm not sure
 that breaking them up will help visually, so I left them as is.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched.c    |    8 +
 kernel/sched_rt.c |  227 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 234 insertions(+), 1 deletion(-)

Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-19 12:34:32.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-19 12:34:46.000000000 -0400
@@ -1855,6 +1855,8 @@ static void finish_task_switch(struct rq
 	prev_state = prev->state;
 	finish_arch_switch(prev);
 	finish_lock_switch(rq, prev);
+	schedule_tail_balance_rt(rq);
+
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
@@ -2088,11 +2090,13 @@ static void double_rq_unlock(struct rq *
 /*
  * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
  */
-static void double_lock_balance(struct rq *this_rq, struct rq *busiest)
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(this_rq->lock)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
+	int ret = 0;
+
 	if (unlikely(!irqs_disabled())) {
 		/* printk() doesn't work good under rq->lock */
 		spin_unlock(&this_rq->lock);
@@ -2103,9 +2107,11 @@ static void double_lock_balance(struct r
 			spin_unlock(&this_rq->lock);
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
+			ret = 1;
 		} else
 			spin_lock(&busiest->lock);
 	}
+	return ret;
 }
 
 /*
Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:33:23.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:34:46.000000000 -0400
@@ -111,6 +111,12 @@ static struct task_struct *pick_next_tas
 }
 
 #ifdef CONFIG_SMP
+/* Only try algorithms three times */
+#define RT_MAX_TRIES 3
+
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
+static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
+
 static inline void rq_prio_add_task(struct rq *rq, struct task_struct *p)
 {
 	if (unlikely(rt_task(p)) && p->prio < rq->highest_prio)
@@ -133,6 +139,227 @@ static inline void rq_prio_remove_task(s
 			rq->highest_prio = MAX_RT_PRIO;
 	}
 }
+
+/* Return the second highest RT task, NULL otherwise */
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+{
+	struct rt_prio_array *array = &rq->rt.active;
+	struct task_struct *next;
+	struct list_head *queue;
+	int idx;
+
+	assert_spin_locked(&rq->lock);
+
+	if (likely(rq->rt_nr_running < 2))
+		return NULL;
+
+	idx = sched_find_first_bit(array->bitmap);
+	if (unlikely(idx >= MAX_RT_PRIO)) {
+		WARN_ON(1); /* rt_nr_running is bad */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+	if (unlikely(next != rq->curr))
+		return next;
+
+	if (queue->next->next != queue) {
+		/* same prio task */
+		next = list_entry(queue->next->next, struct task_struct, run_list);
+		return next;
+	}
+
+	/* slower, but more flexible */
+	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
+	if (unlikely(idx >= MAX_RT_PRIO)) {
+		WARN_ON(1); /* rt_nr_running was 2 and above! */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+
+	return next;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+				      struct rq *this_rq)
+{
+	struct rq *lowest_rq = NULL;
+	cpumask_t cpu_mask;
+	int dst_cpu = -1;
+	int cpu;
+	int tries;
+
+	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
+
+	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+		/*
+		 * Scan each rq for the lowest prio.
+		 */
+		for_each_cpu_mask(cpu, cpu_mask) {
+			struct rq *rq = &per_cpu(runqueues, cpu);
+
+			if (cpu == this_rq->cpu)
+				continue;
+
+			/* We look for lowest RT prio or non-rt CPU */
+			if (rq->highest_prio >= MAX_RT_PRIO) {
+				lowest_rq = rq;
+				dst_cpu = cpu;
+				break;
+			}
+
+			/* no locking for now */
+			if (rq->highest_prio > task->prio &&
+			    (!lowest_rq || rq->highest_prio > lowest_rq->highest_prio)) {
+				dst_cpu = cpu;
+				lowest_rq = rq;
+			}
+		}
+
+		if (!lowest_rq)
+			break;
+
+		/* if the prio of this runqueue changed, try again */
+		if (double_lock_balance(this_rq, lowest_rq)) {
+			/*
+			 * We had to unlock the run queue. In
+			 * the mean time, task could have
+			 * migrated already or had its affinity changed.
+			 * Also make sure that it wasn't scheduled on its rq.
+			 */
+			if (unlikely(task_rq(task) != this_rq ||
+				     !cpu_isset(dst_cpu, task->cpus_allowed) ||
+				     task_running(this_rq, task) ||
+				     !task->se.on_rq)) {
+				spin_unlock(&lowest_rq->lock);
+				lowest_rq = NULL;
+				break;
+			}
+		}
+
+		/* If this rq is still suitable use it. */
+		if (lowest_rq->highest_prio > task->prio)
+			break;
+
+		/* try again */
+		spin_unlock(&lowest_rq->lock);
+		lowest_rq = NULL;
+	}
+
+	return lowest_rq;
+}
+
+/*
+ * If the current CPU has more than one RT task, see if the non
+ * running task can migrate over to a CPU that is running a task
+ * of lesser priority.
+ */
+static int push_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next_task;
+	struct rq *lowest_rq;
+	int dst_cpu;
+	int ret = 0;
+	int paranoid = RT_MAX_TRIES;
+
+	assert_spin_locked(&this_rq->lock);
+
+	next_task = pick_next_highest_task_rt(this_rq);
+	if (!next_task)
+		return 0;
+
+ retry:
+	if (unlikely(next_task == this_rq->curr))
+		return 0;
+
+	/*
+	 * It's possible that the next_task slipped in of
+	 * higher priority than current. If that's the case
+	 * just reschedule current.
+	 */
+	if (unlikely(next_task->prio < this_rq->curr->prio)) {
+		resched_task(this_rq->curr);
+		return 0;
+	}
+
+	/* We might release this_rq lock */
+	get_task_struct(next_task);
+
+	/* find_lock_lowest_rq locks the rq if found */
+	lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+	if (!lowest_rq) {
+		struct task_struct *task;
+		/*
+		 * find lock_lowest_rq releases this_rq->lock
+		 * so it is possible that next_task has changed.
+		 * If it has, then try again.
+		 */
+		task = pick_next_highest_task_rt(this_rq);
+		if (unlikely(task != next_task) && task && paranoid--) {
+			put_task_struct(next_task);
+			next_task = task;
+			goto retry;
+		}
+		goto out;
+	}
+
+	dst_cpu = lowest_rq->cpu;
+
+	assert_spin_locked(&lowest_rq->lock);
+
+	deactivate_task(this_rq, next_task, 0);
+	set_task_cpu(next_task, dst_cpu);
+	activate_task(lowest_rq, next_task, 0);
+
+	resched_task(lowest_rq->curr);
+
+	spin_unlock(&lowest_rq->lock);
+
+	ret = 1;
+out:
+	put_task_struct(next_task);
+
+	return ret;
+}
+
+/*
+ * TODO: Currently we just use the second highest prio task on
+ *       the queue, and stop when it can't migrate (or there's
+ *       no more RT tasks).  There may be a case where a lower
+ *       priority RT task has a different affinity than the
+ *       higher RT task. In this case the lower RT task could
+ *       possibly be able to migrate where as the higher priority
+ *       RT task could not.  We currently ignore this issue.
+ *       Enhancements are welcome!
+ */
+static void push_rt_tasks(struct rq *rq)
+{
+	/* push_rt_task will return true if it moved an RT */
+	while (push_rt_task(rq))
+		;
+}
+
+static void schedule_tail_balance_rt(struct rq *rq)
+{
+	/*
+	 * If we have more than one rt_task queued, then
+	 * see if we can push the other rt_tasks off to other CPUS.
+	 * Note we may release the rq lock, and since
+	 * the lock was owned by prev, we need to release it
+	 * first via finish_lock_switch and then reaquire it here.
+	 */
+	if (unlikely(rq->rt_nr_running > 1)) {
+		spin_lock_irq(&rq->lock);
+		push_rt_tasks(rq);
+		spin_unlock_irq(&rq->lock);
+	}
+}
+#else /* CONFIG_SMP */
+# define schedule_tail_balance_rt(rq) do { } while (0)
 #endif /* CONFIG_SMP */
 
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p)

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 4/8] RT overloaded runqueues accounting
  2007-10-19 18:42 [patch 0/8] New RT Task Balancing Steven Rostedt
                   ` (2 preceding siblings ...)
  2007-10-19 18:42 ` [patch 3/8] push RT tasks Steven Rostedt
@ 2007-10-19 18:42 ` Steven Rostedt
  2007-10-19 18:42 ` [patch 5/8] Move prototypes together Steven Rostedt
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 18:42 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

[-- Attachment #1: rt-overload.patch --]
[-- Type: text/plain, Size: 1673 bytes --]

This patch adds an RT overload accounting system. When a runqueue has
more than one RT task queued, it is marked as overloaded. That is that it
is a candidate to have RT tasks pulled from it.

Along with the rt_overload counter (number of overloaded runqueues) a
bit mask of overloaded runqueues with RT tasks is set.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched_rt.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:34:46.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:35:58.000000000 -0400
@@ -25,10 +25,23 @@ static void update_curr_rt(struct rq *rq
 	curr->se.exec_start = rq->clock;
 }
 
+#ifdef CONFIG_SMP
+static __cacheline_aligned_in_smp atomic_t rt_overload;
+static cpumask_t rto_cpumask;
+#endif /* CONFIG_SMP*/
+
 static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 {
 	if (rt_task(p))
 		rq->rt_nr_running++;
+#ifdef CONFIG_SMP
+	if (rq->rt_nr_running == 2) {
+		cpu_set(rq->cpu, rto_cpumask);
+		/* Make sure this cpu is visible on rt_overloads */
+		smp_wmb();
+		atomic_inc(&rt_overload);
+	}
+#endif
 }
 
 static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -37,6 +50,12 @@ static inline void dec_rt_tasks(struct t
 		WARN_ON(!rq->rt_nr_running);
 		rq->rt_nr_running--;
 	}
+#ifdef CONFIG_SMP
+	if (rq->rt_nr_running == 1) {
+		atomic_dec(&rt_overload);
+		cpu_clear(rq->cpu, rto_cpumask);
+	}
+#endif
 }
 
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 5/8] Move prototypes together.
  2007-10-19 18:42 [patch 0/8] New RT Task Balancing Steven Rostedt
                   ` (3 preceding siblings ...)
  2007-10-19 18:42 ` [patch 4/8] RT overloaded runqueues accounting Steven Rostedt
@ 2007-10-19 18:42 ` Steven Rostedt
  2007-10-19 18:43 ` [patch 6/8] pull RT tasks Steven Rostedt
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 18:42 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

[-- Attachment #1: mov-protos.patch --]
[-- Type: text/plain, Size: 1427 bytes --]

A later patch uses some of the functions that were declared, and
this patch is used to move those prototypes up with other prototypes
that were declared.

This patch is mainly for prettiness.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched_rt.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:35:58.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:36:02.000000000 -0400
@@ -28,6 +28,8 @@ static void update_curr_rt(struct rq *rq
 #ifdef CONFIG_SMP
 static __cacheline_aligned_in_smp atomic_t rt_overload;
 static cpumask_t rto_cpumask;
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
+static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
 #endif /* CONFIG_SMP*/
 
 static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -133,9 +135,6 @@ static struct task_struct *pick_next_tas
 /* Only try algorithms three times */
 #define RT_MAX_TRIES 3
 
-static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
-static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
-
 static inline void rq_prio_add_task(struct rq *rq, struct task_struct *p)
 {
 	if (unlikely(rt_task(p)) && p->prio < rq->highest_prio)

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 6/8] pull RT tasks
  2007-10-19 18:42 [patch 0/8] New RT Task Balancing Steven Rostedt
                   ` (4 preceding siblings ...)
  2007-10-19 18:42 ` [patch 5/8] Move prototypes together Steven Rostedt
@ 2007-10-19 18:43 ` Steven Rostedt
  2007-10-19 19:24   ` Peter Zijlstra
                     ` (2 more replies)
  2007-10-19 18:43 ` [patch 7/8] wake up balance RT Steven Rostedt
  2007-10-19 18:43 ` [patch 8/8] disable CFS RT load balancing Steven Rostedt
  7 siblings, 3 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 18:43 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

[-- Attachment #1: rt-balance-pull-tasks.patch --]
[-- Type: text/plain, Size: 7320 bytes --]

This patch adds the algorithm to pull tasks from RT overloaded runqueues.

When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched.c    |    8 ++
 kernel/sched_rt.c |  161 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 157 insertions(+), 12 deletions(-)

Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-19 12:34:46.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-19 12:36:05.000000000 -0400
@@ -2927,6 +2927,13 @@ static void idle_balance(int this_cpu, s
 	int pulled_task = -1;
 	unsigned long next_balance = jiffies + HZ;
 
+	/*
+	 * pull_rt_task returns true if the run queue changed.
+	 * But this does not mean we have a task to run.
+	 */
+	if (unlikely(pull_rt_task(this_rq)) && this_rq->nr_running)
+		return;
+
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
 
@@ -3614,6 +3621,7 @@ need_resched_nonpreemptible:
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
+	schedule_balance_rt(rq, prev);
 	prev->sched_class->put_prev_task(rq, prev);
 	next = pick_next_task(rq, prev);
 
Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:36:02.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:36:05.000000000 -0400
@@ -158,8 +158,17 @@ static inline void rq_prio_remove_task(s
 	}
 }
 
+static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+	if (!task_running(rq, p) &&
+	    (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)))
+		return 1;
+	return 0;
+}
+
 /* Return the second highest RT task, NULL otherwise */
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
+						     int cpu)
 {
 	struct rt_prio_array *array = &rq->rt.active;
 	struct task_struct *next;
@@ -178,26 +187,36 @@ static struct task_struct *pick_next_hig
 	}
 
 	queue = array->queue + idx;
+	BUG_ON(list_empty(queue));
+
 	next = list_entry(queue->next, struct task_struct, run_list);
-	if (unlikely(next != rq->curr))
-		return next;
+	if (unlikely(pick_rt_task(rq, next, cpu)))
+		goto out;
 
 	if (queue->next->next != queue) {
 		/* same prio task */
 		next = list_entry(queue->next->next, struct task_struct, run_list);
-		return next;
+		if (pick_rt_task(rq, next, cpu))
+			goto out;
 	}
 
+ retry:
 	/* slower, but more flexible */
 	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-	if (unlikely(idx >= MAX_RT_PRIO)) {
-		WARN_ON(1); /* rt_nr_running was 2 and above! */
+	if (unlikely(idx >= MAX_RT_PRIO))
 		return NULL;
-	}
 
 	queue = array->queue + idx;
-	next = list_entry(queue->next, struct task_struct, run_list);
+	BUG_ON(list_empty(queue));
+
+	list_for_each_entry(next, queue, run_list) {
+		if (pick_rt_task(rq, next, cpu))
+			goto out;
+	}
+
+	goto retry;
 
+ out:
 	return next;
 }
 
@@ -286,13 +305,15 @@ static int push_rt_task(struct rq *this_
 
 	assert_spin_locked(&this_rq->lock);
 
-	next_task = pick_next_highest_task_rt(this_rq);
+	next_task = pick_next_highest_task_rt(this_rq, -1);
 	if (!next_task)
 		return 0;
 
  retry:
-	if (unlikely(next_task == this_rq->curr))
+	if (unlikely(next_task == this_rq->curr)) {
+		WARN_ON(1);
 		return 0;
+	}
 
 	/*
 	 * It's possible that the next_task slipped in of
@@ -316,7 +337,7 @@ static int push_rt_task(struct rq *this_
 		 * so it is possible that next_task has changed.
 		 * If it has, then try again.
 		 */
-		task = pick_next_highest_task_rt(this_rq);
+		task = pick_next_highest_task_rt(this_rq, -1);
 		if (unlikely(task != next_task) && task && paranoid--) {
 			put_task_struct(next_task);
 			next_task = task;
@@ -361,6 +382,121 @@ static void push_rt_tasks(struct rq *rq)
 		;
 }
 
+static int pull_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next;
+	struct task_struct *p;
+	struct rq *src_rq;
+	int this_cpu = this_rq->cpu;
+	int cpu;
+	int ret = 0;
+
+	assert_spin_locked(&this_rq->lock);
+
+	if (likely(!atomic_read(&rt_overload)))
+		return 0;
+
+	next = pick_next_task_rt(this_rq);
+
+	for_each_cpu_mask(cpu, rto_cpumask) {
+		if (this_cpu == cpu)
+			continue;
+
+		src_rq = cpu_rq(cpu);
+		if (src_rq->rt_nr_running <= 1)
+			continue;
+
+		/*
+		 * We can potentially drop this_rq's lock in
+		 * double_lock_balance, and another CPU could
+		 * steal our next task - hence we must cause
+		 * the caller to recalculate the next task
+		 * in that case:
+		 */
+		if (double_lock_balance(this_rq, src_rq)) {
+			struct task_struct *old_next = next;
+			next = pick_next_task_rt(this_rq);
+			if (next != old_next)
+				ret = 1;
+		}
+
+		/*
+		 * Are there still pullable RT tasks?
+		 */
+		if (src_rq->rt_nr_running <= 1) {
+			spin_unlock(&src_rq->lock);
+			continue;
+		}
+
+		p = pick_next_highest_task_rt(src_rq, this_cpu);
+
+		/*
+		 * Do we have an RT task that preempts
+		 * the to-be-scheduled task?
+		 */
+		if (p && (!next || (p->prio < next->prio))) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!p->se.on_rq);
+
+			/*
+			 * There's a chance that p is higher in priority
+			 * than what's currently running on its cpu.
+			 * This is just that p is wakeing up and hasn't
+			 * had a chance to schedule. We only pull
+			 * p if it is lower in priority than the
+			 * current task on the run queue or
+			 * this_rq next task is lower in prio than
+			 * the current task on that rq.
+			 */
+			if (p->prio < src_rq->curr->prio ||
+			    (next && next->prio < src_rq->curr->prio))
+				goto bail;
+
+			ret = 1;
+
+			deactivate_task(src_rq, p, 0);
+			set_task_cpu(p, this_cpu);
+			activate_task(this_rq, p, 0);
+			/*
+			 * We continue with the search, just in
+			 * case there's an even higher prio task
+			 * in another runqueue. (low likelyhood
+			 * but possible)
+			 */
+
+			/*
+			 * Update next so that we won't pick a task
+			 * on another cpu with a priority lower (or equal)
+			 * than the one we just picked.
+			 */
+			next = p;
+
+		}
+ bail:
+		spin_unlock(&src_rq->lock);
+	}
+
+	return ret;
+}
+
+static void schedule_balance_rt(struct rq *rq,
+				struct task_struct *prev)
+{
+	struct rt_prio_array *array;
+	int next_prio;
+
+	/* Try to pull RT tasks here if we lower this rq's prio */
+	if (unlikely(rt_task(prev))) {
+		next_prio = MAX_RT_PRIO;
+		if (rq->rt_nr_running) {
+			array = &rq->rt.active;
+			next_prio = sched_find_first_bit(array->bitmap);
+		}
+		if (next_prio > prev->prio)
+			pull_rt_task(rq);
+	}
+}
+
 static void schedule_tail_balance_rt(struct rq *rq)
 {
 	/*
@@ -377,7 +513,8 @@ static void schedule_tail_balance_rt(str
 	}
 }
 #else /* CONFIG_SMP */
-# define schedule_tail_balance_rt(rq) do { } while (0)
+# define schedule_balance_rt(rq)	do { } while (0)
+# define schedule_tail_balance_rt(rq)	do { } while (0)
 #endif /* CONFIG_SMP */
 
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p)

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 7/8] wake up balance RT
  2007-10-19 18:42 [patch 0/8] New RT Task Balancing Steven Rostedt
                   ` (5 preceding siblings ...)
  2007-10-19 18:43 ` [patch 6/8] pull RT tasks Steven Rostedt
@ 2007-10-19 18:43 ` Steven Rostedt
  2007-10-19 18:43 ` [patch 8/8] disable CFS RT load balancing Steven Rostedt
  7 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 18:43 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

[-- Attachment #1: rt-balance-wakeup.patch --]
[-- Type: text/plain, Size: 1668 bytes --]

This patch adds pushing of overloaded RT tasks from a runqueue that is
having tasks (most likely RT tasks) added to the run queue.

TODO: We don't cover the case of waking of new RT tasks (yet).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched.c    |    1 +
 kernel/sched_rt.c |   12 ++++++++++++
 2 files changed, 13 insertions(+)

Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:36:05.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:36:07.000000000 -0400
@@ -512,9 +512,21 @@ static void schedule_tail_balance_rt(str
 		spin_unlock_irq(&rq->lock);
 	}
 }
+
+static void wakeup_balance_rt(struct rq *rq, struct task_struct *p)
+{
+	if (unlikely(rt_task(p)) &&
+	    (p->prio >= rq->curr->prio)) {
+		/* pull_rt_task needs task to be running */
+		p->state = TASK_RUNNING;
+		push_rt_tasks(rq);
+	}
+}
+
 #else /* CONFIG_SMP */
 # define schedule_balance_rt(rq)	do { } while (0)
 # define schedule_tail_balance_rt(rq)	do { } while (0)
+# define wakeup_balance_rt(rq, p)	do { } while (0)
 #endif /* CONFIG_SMP */
 
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-19 12:36:05.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-19 12:36:07.000000000 -0400
@@ -1609,6 +1609,7 @@ out_activate:
 	update_rq_clock(rq);
 	activate_task(rq, p, 1);
 	check_preempt_curr(rq, p);
+	wakeup_balance_rt(rq, p);
 	success = 1;
 
 out_running:

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 8/8] disable CFS RT load balancing.
  2007-10-19 18:42 [patch 0/8] New RT Task Balancing Steven Rostedt
                   ` (6 preceding siblings ...)
  2007-10-19 18:43 ` [patch 7/8] wake up balance RT Steven Rostedt
@ 2007-10-19 18:43 ` Steven Rostedt
  7 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 18:43 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

[-- Attachment #1: disable-CFS-rt-balance.patch --]
[-- Type: text/plain, Size: 3309 bytes --]

Since we now take an active approach to load balancing, we don't need to
balance RT tasks via CFS. In fact, this code was found to pull RT tasks
away from CPUS that the active movement performed, resulting in
large latencies.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched_rt.c |   90 +-----------------------------------------------------
 1 file changed, 2 insertions(+), 88 deletions(-)

Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:36:07.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-19 13:01:31.000000000 -0400
@@ -535,100 +535,14 @@ static void put_prev_task_rt(struct rq *
 	p->se.exec_start = 0;
 }
 
-/*
- * Load-balancing iterator. Note: while the runqueue stays locked
- * during the whole iteration, the current task might be
- * dequeued so the iterator has to be dequeue-safe. Here we
- * achieve that by always pre-iterating before returning
- * the current task:
- */
-static struct task_struct *load_balance_start_rt(void *arg)
-{
-	struct rq *rq = arg;
-	struct rt_prio_array *array = &rq->rt.active;
-	struct list_head *head, *curr;
-	struct task_struct *p;
-	int idx;
-
-	idx = sched_find_first_bit(array->bitmap);
-	if (idx >= MAX_RT_PRIO)
-		return NULL;
-
-	head = array->queue + idx;
-	curr = head->prev;
-
-	p = list_entry(curr, struct task_struct, run_list);
-
-	curr = curr->prev;
-
-	rq->rt.rt_load_balance_idx = idx;
-	rq->rt.rt_load_balance_head = head;
-	rq->rt.rt_load_balance_curr = curr;
-
-	return p;
-}
-
-static struct task_struct *load_balance_next_rt(void *arg)
-{
-	struct rq *rq = arg;
-	struct rt_prio_array *array = &rq->rt.active;
-	struct list_head *head, *curr;
-	struct task_struct *p;
-	int idx;
-
-	idx = rq->rt.rt_load_balance_idx;
-	head = rq->rt.rt_load_balance_head;
-	curr = rq->rt.rt_load_balance_curr;
-
-	/*
-	 * If we arrived back to the head again then
-	 * iterate to the next queue (if any):
-	 */
-	if (unlikely(head == curr)) {
-		int next_idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-
-		if (next_idx >= MAX_RT_PRIO)
-			return NULL;
-
-		idx = next_idx;
-		head = array->queue + idx;
-		curr = head->prev;
-
-		rq->rt.rt_load_balance_idx = idx;
-		rq->rt.rt_load_balance_head = head;
-	}
-
-	p = list_entry(curr, struct task_struct, run_list);
-
-	curr = curr->prev;
-
-	rq->rt.rt_load_balance_curr = curr;
-
-	return p;
-}
-
 static unsigned long
 load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
 			unsigned long max_nr_move, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *all_pinned, int *this_best_prio)
 {
-	int nr_moved;
-	struct rq_iterator rt_rq_iterator;
-	unsigned long load_moved;
-
-	rt_rq_iterator.start = load_balance_start_rt;
-	rt_rq_iterator.next = load_balance_next_rt;
-	/* pass 'busiest' rq argument into
-	 * load_balance_[start|next]_rt iterators
-	 */
-	rt_rq_iterator.arg = busiest;
-
-	nr_moved = balance_tasks(this_rq, this_cpu, busiest, max_nr_move,
-			max_load_move, sd, idle, all_pinned, &load_moved,
-			this_best_prio, &rt_rq_iterator);
-
-	return load_moved;
+	/* don't touch RT tasks */
+	return 0;
 }
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/8] track highest prio queued on runqueue
  2007-10-19 18:42 ` [patch 2/8] track highest prio queued on runqueue Steven Rostedt
@ 2007-10-19 19:19   ` Steven Rostedt
  2007-10-19 19:45   ` Gregory Haskins
  2007-10-20 18:14   ` Dmitry Adamushko
  2 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 19:19 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner,
	Gregory Haskins, Peter Zijlstra

On Fri, Oct 19, 2007 at 02:42:56PM -0400, Steven Rostedt wrote:
> This patch adds accounting to each runqueue to keep track of the
> highest prio task queued on the run queue. We only care about
> RT tasks, so if the run queue does not contain any active RT tasks
> its priority will be considered MAX_RT_PRIO.
> 
> This information will be used for later patches.
> 

[...]

> @@ -972,6 +974,8 @@ static void activate_task(struct rq *rq,
>  
>  	enqueue_task(rq, p, wakeup);
>  	inc_nr_running(p, rq);
> +
> +	rq_prio_add_task(rq, p);
>  }
>  
>  /*
> @@ -984,6 +988,8 @@ static void deactivate_task(struct rq *r
>  
>  	dequeue_task(rq, p, sleep);
>  	dec_nr_running(p, rq);
> +
> +	rq_prio_remove_task(rq, p);
>  }
>  
>  /**
> @@ -6619,6 +6625,7 @@ void __init sched_init(void)
>  		rq->cpu = i;
>  		rq->migration_thread = NULL;
>  		INIT_LIST_HEAD(&rq->migration_queue);
> +		rq->highest_prio = MAX_RT_PRIO;
>  #endif
>  		atomic_set(&rq->nr_iowait, 0);
>  
> Index: linux-test.git/kernel/sched_rt.c
> ===================================================================
> --- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:33:09.000000000 -0400
> +++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:33:23.000000000 -0400
> @@ -110,6 +110,31 @@ static struct task_struct *pick_next_tas
>  	return next;
>  }
>  
> +#ifdef CONFIG_SMP
> +static inline void rq_prio_add_task(struct rq *rq, struct task_struct *p)
> +{
> +	if (unlikely(rt_task(p)) && p->prio < rq->highest_prio)
> +		rq->highest_prio = p->prio;
> +}
> +
> +static inline void rq_prio_remove_task(struct rq *rq, struct task_struct *p)
> +{
> +	struct rt_prio_array *array;
> +
> +	if (unlikely(rt_task(p))) {
> +		if (rq->rt_nr_running) {
> +			if (p->prio >= rq->highest_prio) {
> +				/* recalculate */
> +				array = &rq->rt.active;
> +				rq->highest_prio =
> +					sched_find_first_bit(array->bitmap);
> +			} /* otherwise leave rq->highest prio alone */
> +		} else
> +			rq->highest_prio = MAX_RT_PRIO;
> +	}
> +}
> +#endif /* CONFIG_SMP */
> +

Sorry, I forgot to test this on UP. Seems to be missing a 

#define rq_prio_remove_task(rq, p) do { } while(0)
#define rq_prio_add(rq, p) do { } while(0)

for the !CONFIG_SMP case.

Will fix.

-- Steve



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 6/8] pull RT tasks
  2007-10-19 18:43 ` [patch 6/8] pull RT tasks Steven Rostedt
@ 2007-10-19 19:24   ` Peter Zijlstra
  2007-10-19 19:35     ` Peter Zijlstra
  2007-10-21  9:35   ` Dmitry Adamushko
  2007-10-21 11:59   ` Dmitry Adamushko
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2007-10-19 19:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins


On Fri, 2007-10-19 at 14:43 -0400, Steven Rostedt wrote:
> plain text document attachment (rt-balance-pull-tasks.patch)

> +static int pull_rt_task(struct rq *this_rq)
> +{
> +	struct task_struct *next;
> +	struct task_struct *p;
> +	struct rq *src_rq;
> +	int this_cpu = this_rq->cpu;
> +	int cpu;
> +	int ret = 0;
> +
> +	assert_spin_locked(&this_rq->lock);
> +
> +	if (likely(!atomic_read(&rt_overload)))
> +		return 0;

This seems to be the only usage of rt_overload. I'm not sure its worth
keeping it around for this.

> +	next = pick_next_task_rt(this_rq);
> +
> +	for_each_cpu_mask(cpu, rto_cpumask) {
> +		if (this_cpu == cpu)
> +			continue;

...

> +	}
> +
> +	return ret;
> +}



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 6/8] pull RT tasks
  2007-10-19 19:24   ` Peter Zijlstra
@ 2007-10-19 19:35     ` Peter Zijlstra
  2007-10-19 19:43       ` Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2007-10-19 19:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Paul Jackson


On Fri, 2007-10-19 at 21:24 +0200, Peter Zijlstra wrote:
> On Fri, 2007-10-19 at 14:43 -0400, Steven Rostedt wrote:
> > plain text document attachment (rt-balance-pull-tasks.patch)
> 
> > +static int pull_rt_task(struct rq *this_rq)
> > +{
> > +	struct task_struct *next;
> > +	struct task_struct *p;
> > +	struct rq *src_rq;
> > +	int this_cpu = this_rq->cpu;
> > +	int cpu;
> > +	int ret = 0;
> > +
> > +	assert_spin_locked(&this_rq->lock);
> > +
> > +	if (likely(!atomic_read(&rt_overload)))
> > +		return 0;
> 
> This seems to be the only usage of rt_overload. I'm not sure its worth
> keeping it around for this.

Ingo just brought up a good point. With large smp (where large is >64)
this will all suck chunks.

rt_overload will bounce around the system, and the rto_cpumask updates
might already hurt.

The idea would be to do this per cpuset, these naturally limit the
migraiton posibilities of tasks and would thus be the natural locality
to break this data structure.

> > +	next = pick_next_task_rt(this_rq);
> > +
> > +	for_each_cpu_mask(cpu, rto_cpumask) {
> > +		if (this_cpu == cpu)
> > +			continue;
> 
> ...
> 
> > +	}
> > +
> > +	return ret;
> > +}
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 6/8] pull RT tasks
  2007-10-19 19:35     ` Peter Zijlstra
@ 2007-10-19 19:43       ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 19:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Paul Jackson


--
On Fri, 19 Oct 2007, Peter Zijlstra wrote:

> > > +
> > > +	if (likely(!atomic_read(&rt_overload)))
> > > +		return 0;
> >
> > This seems to be the only usage of rt_overload. I'm not sure its worth
> > keeping it around for this.
>
> Ingo just brought up a good point. With large smp (where large is >64)
> this will all suck chunks.
>
> rt_overload will bounce around the system, and the rto_cpumask updates
> might already hurt.
>
> The idea would be to do this per cpuset, these naturally limit the
> migraiton posibilities of tasks and would thus be the natural locality
> to break this data structure.
>

That sounds like a good idea. RT balancing on >64 CPUs should be limited.
Having a bounding cpuset would help.

I'll try to come up with something.

Thanks!

-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/8] track highest prio queued on runqueue
  2007-10-19 18:42 ` [patch 2/8] track highest prio queued on runqueue Steven Rostedt
  2007-10-19 19:19   ` Steven Rostedt
@ 2007-10-19 19:45   ` Gregory Haskins
  2007-10-19 19:57     ` Steven Rostedt
  2007-10-20 18:14   ` Dmitry Adamushko
  2 siblings, 1 reply; 25+ messages in thread
From: Gregory Haskins @ 2007-10-19 19:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 3601 bytes --]

On Fri, 2007-10-19 at 14:42 -0400, Steven Rostedt wrote:
> plain text document attachment (add-rq-highest-prio.patch)
> This patch adds accounting to each runqueue to keep track of the
> highest prio task queued on the run queue. We only care about
> RT tasks, so if the run queue does not contain any active RT tasks
> its priority will be considered MAX_RT_PRIO.
> 
> This information will be used for later patches.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> ---
>  kernel/sched.c    |    7 +++++++
>  kernel/sched_rt.c |   25 +++++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
> 
> Index: linux-test.git/kernel/sched.c
> ===================================================================
> --- linux-test.git.orig/kernel/sched.c	2007-10-19 12:33:09.000000000 -0400
> +++ linux-test.git/kernel/sched.c	2007-10-19 12:34:32.000000000 -0400
> @@ -324,6 +324,8 @@ struct rq {
>  	int push_cpu;
>  	/* cpu of this runqueue: */
>  	int cpu;
> +	/* highest queued rt task prio */
> +	int highest_prio;
>  
>  	struct task_struct *migration_thread;
>  	struct list_head migration_queue;
> @@ -972,6 +974,8 @@ static void activate_task(struct rq *rq,
>  
>  	enqueue_task(rq, p, wakeup);
>  	inc_nr_running(p, rq);
> +
> +	rq_prio_add_task(rq, p);
>  }

We are better off doing this in enqueue_task() or PI boosting will fail
to alter the prio of the RQ.

>  
>  /*
> @@ -984,6 +988,8 @@ static void deactivate_task(struct rq *r
>  
>  	dequeue_task(rq, p, sleep);
>  	dec_nr_running(p, rq);
> +
> +	rq_prio_remove_task(rq, p);

Ditto for dequeue_task()

>  }
>  
>  /**
> @@ -6619,6 +6625,7 @@ void __init sched_init(void)
>  		rq->cpu = i;
>  		rq->migration_thread = NULL;
>  		INIT_LIST_HEAD(&rq->migration_queue);
> +		rq->highest_prio = MAX_RT_PRIO;
>  #endif
>  		atomic_set(&rq->nr_iowait, 0);
>  
> Index: linux-test.git/kernel/sched_rt.c
> ===================================================================
> --- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:33:09.000000000 -0400
> +++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:33:23.000000000 -0400
> @@ -110,6 +110,31 @@ static struct task_struct *pick_next_tas
>  	return next;
>  }
>  
> +#ifdef CONFIG_SMP
> +static inline void rq_prio_add_task(struct rq *rq, struct task_struct *p)
> +{
> +	if (unlikely(rt_task(p)) && p->prio < rq->highest_prio)
> +		rq->highest_prio = p->prio;
> +}
> +
> +static inline void rq_prio_remove_task(struct rq *rq, struct task_struct *p)
> +{
> +	struct rt_prio_array *array;
> +
> +	if (unlikely(rt_task(p))) {
> +		if (rq->rt_nr_running) {
> +			if (p->prio >= rq->highest_prio) {

<= ?

We only need to recalc if the task being removed was the highest prio
(or higher, if that were possible but it shouldnt be).

I think this logic will technically work as-is because every task is
always >= highest in a properly accounted system, so you will just
simply recalc for every remove.  But I do like the optimization that you
were trying to add.

So for the paranoid:

				BUG_ON(p->prio < rq->highest_prio);
				if (p->prio == rq->highest_prio) {
					/* recalc */
				}


> +				/* recalculate */
> +				array = &rq->rt.active;
> +				rq->highest_prio =
> +					sched_find_first_bit(array->bitmap);
> +			} /* otherwise leave rq->highest prio alone */
> +		} else
> +			rq->highest_prio = MAX_RT_PRIO;
> +	}
> +}
> +#endif /* CONFIG_SMP */
> +
>  static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>  {
>  	update_curr_rt(rq);
> 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/8] track highest prio queued on runqueue
  2007-10-19 19:45   ` Gregory Haskins
@ 2007-10-19 19:57     ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-19 19:57 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra


--
On Fri, 19 Oct 2007, Gregory Haskins wrote:

> >
> >  	struct task_struct *migration_thread;
> >  	struct list_head migration_queue;
> > @@ -972,6 +974,8 @@ static void activate_task(struct rq *rq,
> >
> >  	enqueue_task(rq, p, wakeup);
> >  	inc_nr_running(p, rq);
> > +
> > +	rq_prio_add_task(rq, p);
> >  }
>
> We are better off doing this in enqueue_task() or PI boosting will fail
> to alter the prio of the RQ.

I thought about that too, but thought it was also too -rt base. But I
think I'll move it regardless.

>
> >
> >  /*
> > @@ -984,6 +988,8 @@ static void deactivate_task(struct rq *r
> >
> >  	dequeue_task(rq, p, sleep);
> >  	dec_nr_running(p, rq);
> > +
> > +	rq_prio_remove_task(rq, p);
>
> Ditto for dequeue_task()

Will move.

>
> >  }
> >
> >  /**
> > @@ -6619,6 +6625,7 @@ void __init sched_init(void)
> >  		rq->cpu = i;
> >  		rq->migration_thread = NULL;
> >  		INIT_LIST_HEAD(&rq->migration_queue);
> > +		rq->highest_prio = MAX_RT_PRIO;
> >  #endif
> >  		atomic_set(&rq->nr_iowait, 0);
> >
> > Index: linux-test.git/kernel/sched_rt.c
> > ===================================================================
> > --- linux-test.git.orig/kernel/sched_rt.c	2007-10-19 12:33:09.000000000 -0400
> > +++ linux-test.git/kernel/sched_rt.c	2007-10-19 12:33:23.000000000 -0400
> > @@ -110,6 +110,31 @@ static struct task_struct *pick_next_tas
> >  	return next;
> >  }
> >
> > +#ifdef CONFIG_SMP
> > +static inline void rq_prio_add_task(struct rq *rq, struct task_struct *p)
> > +{
> > +	if (unlikely(rt_task(p)) && p->prio < rq->highest_prio)
> > +		rq->highest_prio = p->prio;
> > +}
> > +
> > +static inline void rq_prio_remove_task(struct rq *rq, struct task_struct *p)
> > +{
> > +	struct rt_prio_array *array;
> > +
> > +	if (unlikely(rt_task(p))) {
> > +		if (rq->rt_nr_running) {
> > +			if (p->prio >= rq->highest_prio) {
>
> <= ?
>
> We only need to recalc if the task being removed was the highest prio
> (or higher, if that were possible but it shouldnt be).

That's what I meant to do. (/me had another confusion between > and < for
the lower the prio the higher the priority!!!!).

>
> I think this logic will technically work as-is because every task is

That wasn't planned. I wanted to only recalc if the priority was >= the
the rq priority, which would have been. p->prio <= rq->highest_prio.
Yes, I thought to myself (that should never be higher!) but I was
paranoid. So the test is not what I meant.

> always >= highest in a properly accounted system, so you will just
> simply recalc for every remove.  But I do like the optimization that you
> were trying to add.
>
> So for the paranoid:
>
> 				BUG_ON(p->prio < rq->highest_prio);
> 				if (p->prio == rq->highest_prio) {
> 					/* recalc */
> 				}

I'll add that with a switch to WARN_ON.

-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/8] Add rt_nr_running accounting
  2007-10-19 18:42 ` [patch 1/8] Add rt_nr_running accounting Steven Rostedt
@ 2007-10-20 16:45   ` Dmitry Adamushko
  2007-10-21  2:13     ` Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Dmitry Adamushko @ 2007-10-20 16:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra

On 19/10/2007, Steven Rostedt <rostedt@goodmis.org> wrote:
> [ ... ]
> Index: linux-test.git/kernel/sched.c
> ===================================================================
> --- linux-test.git.orig/kernel/sched.c  2007-10-19 12:32:39.000000000 -0400
> +++ linux-test.git/kernel/sched.c       2007-10-19 12:33:09.000000000 -0400
> @@ -300,6 +300,8 @@ struct rq {
>          */
>         unsigned long nr_uninterruptible;
>
> +       unsigned long rt_nr_running;

could it be a part of the 'struct rt_rq' instead?

>
> +static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
> +{
> +       if (rt_task(p))
> +               rq->rt_nr_running++;
> +}
> +
> +static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
> +{
> +       if (rt_task(p)) {
> +               WARN_ON(!rq->rt_nr_running);
> +               rq->rt_nr_running--;
> +       }
> +}
> +
>  static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
>  {
>         struct rt_prio_array *array = &rq->rt.active;
>
>         list_add_tail(&p->run_list, array->queue + p->prio);
>         __set_bit(p->prio, array->bitmap);
> +
> +       inc_rt_tasks(p, rq);

why do you need the rt_task(p) check in {inc,dec}_rt_tasks() ?

{enqueue,dequeue}_task_rt() seem to be the only callers and they will
crash (or corrupt memory) anyway in the case of ! rt_task(p) (sure,
this case would mean something is broken somewhere wrt sched_class
handling).


-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/8] track highest prio queued on runqueue
  2007-10-19 18:42 ` [patch 2/8] track highest prio queued on runqueue Steven Rostedt
  2007-10-19 19:19   ` Steven Rostedt
  2007-10-19 19:45   ` Gregory Haskins
@ 2007-10-20 18:14   ` Dmitry Adamushko
  2007-10-21  2:19     ` Steven Rostedt
  2 siblings, 1 reply; 25+ messages in thread
From: Dmitry Adamushko @ 2007-10-20 18:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra

On 19/10/2007, Steven Rostedt <rostedt@goodmis.org> wrote:
> [ ... ]
>
===================================================================
> --- linux-test.git.orig/kernel/sched.c  2007-10-19 12:33:09.000000000 -0400
> +++ linux-test.git/kernel/sched.c       2007-10-19 12:34:32.000000000 -0400
> @@ -324,6 +324,8 @@ struct rq {
>         int push_cpu;
>         /* cpu of this runqueue: */
>         int cpu;
> +       /* highest queued rt task prio */
> +       int highest_prio;

again, could it be moved to 'struct rt_rq' ?
(so we want to cache it as we don't want to trash more per-cpu bytes
calling smth like
if (!rt_nr_running) sched_find_first_bit() from other CPUs)


> @@ -972,6 +974,8 @@ static void activate_task(struct rq *rq,
>
>         enqueue_task(rq, p, wakeup);
>         inc_nr_running(p, rq);
> +
> +       rq_prio_add_task(rq, p);
>  }
>
>  /*
> @@ -984,6 +988,8 @@ static void deactivate_task(struct rq *r
>
>         dequeue_task(rq, p, sleep);
>         dec_nr_running(p, rq);
> +
> +       rq_prio_remove_task(rq, p);
>  }

{enqueue,dequeue}_task_rt() would be more appropriate places.


> +static inline void rq_prio_add_task(struct rq *rq, struct task_struct *p)
> +{
> +       if (unlikely(rt_task(p)) && p->prio < rq->highest_prio)
> +               rq->highest_prio = p->prio;
> +}
> +
> +static inline void rq_prio_remove_task(struct rq *rq, struct task_struct *p)
> +{
> +       struct rt_prio_array *array;
> +
> +       if (unlikely(rt_task(p))) {
> +               if (rq->rt_nr_running) {
> +                       if (p->prio >= rq->highest_prio) {
> +                               /* recalculate */
> +                               array = &rq->rt.active;
> +                               rq->highest_prio =
> +                                       sched_find_first_bit(array->bitmap);
> +                       } /* otherwise leave rq->highest prio alone */
> +               } else
> +                       rq->highest_prio = MAX_RT_PRIO;
> +       }
> +}
> +#endif /* CONFIG_SMP */
> +

(just a few thoughts)

we call sched_find_first_bit() in pick_next_task_rt() in the case when
rt_nr_running != 0.

So if we can tolerate the 'latency' of updating the 'highest_prio' ==
the interval of time between deactivate_task() and pick_next_task() in
schedule() then rq_prio_remove_task() would just need to do a single
thing:

/* No more RT tasks: */
if (!rt_nr_running)
        highest_prio = MAX_RT_PRIO;

and then,

static struct task_struct *pick_next_task_rt(struct rq *rq)
{
        struct rt_prio_array *array = &rq->rt.active;
        struct task_struct *next;
        struct list_head *queue;
        int idx;

-        idx = sched_find_first_bit(array->bitmap);
+       rq->highest_prio = idx = sched_find_first_bit(array->bitmap);

[ ... ]

additionally, if we can tolerate the 'latency' (of updating
highest_prio) == the worst case scheduling latency, then
rq_prio_add_task() is not necessary at all.


-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/8] Add rt_nr_running accounting
  2007-10-20 16:45   ` Dmitry Adamushko
@ 2007-10-21  2:13     ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-21  2:13 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra


Hi Dmitry,

--
On Sat, 20 Oct 2007, Dmitry Adamushko wrote:

> On 19/10/2007, Steven Rostedt <rostedt@goodmis.org> wrote:
> > [ ... ]
> > Index: linux-test.git/kernel/sched.c
> > ===================================================================
> > --- linux-test.git.orig/kernel/sched.c  2007-10-19 12:32:39.000000000 -0400
> > +++ linux-test.git/kernel/sched.c       2007-10-19 12:33:09.000000000 -0400
> > @@ -300,6 +300,8 @@ struct rq {
> >          */
> >         unsigned long nr_uninterruptible;
> >
> > +       unsigned long rt_nr_running;
>
> could it be a part of the 'struct rt_rq' instead?

Maybe. I didn't really look too hard to where to put it. Currently, in the
-rt patch, it is located in struct rq, so I just did the same. I may be
able to move it.

>
> >
> > +static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
> > +{
> > +       if (rt_task(p))
> > +               rq->rt_nr_running++;
> > +}
> > +
> > +static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
> > +{
> > +       if (rt_task(p)) {
> > +               WARN_ON(!rq->rt_nr_running);
> > +               rq->rt_nr_running--;
> > +       }
> > +}
> > +
> >  static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
> >  {
> >         struct rt_prio_array *array = &rq->rt.active;
> >
> >         list_add_tail(&p->run_list, array->queue + p->prio);
> >         __set_bit(p->prio, array->bitmap);
> > +
> > +       inc_rt_tasks(p, rq);
>
> why do you need the rt_task(p) check in {inc,dec}_rt_tasks() ?

Me being paranoid ;-)

>
> {enqueue,dequeue}_task_rt() seem to be the only callers and they will
> crash (or corrupt memory) anyway in the case of ! rt_task(p) (sure,
> this case would mean something is broken somewhere wrt sched_class
> handling).

Exactly, I was just being safe. We could add a WARN_ON(!rt_task) there.

-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/8] track highest prio queued on runqueue
  2007-10-20 18:14   ` Dmitry Adamushko
@ 2007-10-21  2:19     ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-21  2:19 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra


--
On Sat, 20 Oct 2007, Dmitry Adamushko wrote:

> On 19/10/2007, Steven Rostedt <rostedt@goodmis.org> wrote:
> > [ ... ]
> >
> ===================================================================
> > --- linux-test.git.orig/kernel/sched.c  2007-10-19 12:33:09.000000000 -0400
> > +++ linux-test.git/kernel/sched.c       2007-10-19 12:34:32.000000000 -0400
> > @@ -324,6 +324,8 @@ struct rq {
> >         int push_cpu;
> >         /* cpu of this runqueue: */
> >         int cpu;
> > +       /* highest queued rt task prio */
> > +       int highest_prio;
>
> again, could it be moved to 'struct rt_rq' ?
> (so we want to cache it as we don't want to trash more per-cpu bytes
> calling smth like
> if (!rt_nr_running) sched_find_first_bit() from other CPUs)

Thanks Dmitry, I'll look into moving these around.

>
>
> > @@ -972,6 +974,8 @@ static void activate_task(struct rq *rq,
> >
> >         enqueue_task(rq, p, wakeup);
> >         inc_nr_running(p, rq);
> > +
> > +       rq_prio_add_task(rq, p);
> >  }
> >
> >  /*
> > @@ -984,6 +988,8 @@ static void deactivate_task(struct rq *r
> >
> >         dequeue_task(rq, p, sleep);
> >         dec_nr_running(p, rq);
> > +
> > +       rq_prio_remove_task(rq, p);
> >  }
>
> {enqueue,dequeue}_task_rt() would be more appropriate places.

Yep, I already have that done in my second queue (not yet posted).

>
>
> > +static inline void rq_prio_add_task(struct rq *rq, struct task_struct *p)
> > +{
> > +       if (unlikely(rt_task(p)) && p->prio < rq->highest_prio)
> > +               rq->highest_prio = p->prio;
> > +}
> > +
> > +static inline void rq_prio_remove_task(struct rq *rq, struct task_struct *p)
> > +{
> > +       struct rt_prio_array *array;
> > +
> > +       if (unlikely(rt_task(p))) {
> > +               if (rq->rt_nr_running) {
> > +                       if (p->prio >= rq->highest_prio) {
> > +                               /* recalculate */
> > +                               array = &rq->rt.active;
> > +                               rq->highest_prio =
> > +                                       sched_find_first_bit(array->bitmap);
> > +                       } /* otherwise leave rq->highest prio alone */
> > +               } else
> > +                       rq->highest_prio = MAX_RT_PRIO;
> > +       }
> > +}
> > +#endif /* CONFIG_SMP */
> > +
>
> (just a few thoughts)
>
> we call sched_find_first_bit() in pick_next_task_rt() in the case when
> rt_nr_running != 0.
>
> So if we can tolerate the 'latency' of updating the 'highest_prio' ==
> the interval of time between deactivate_task() and pick_next_task() in
> schedule() then rq_prio_remove_task() would just need to do a single
> thing:
>
> /* No more RT tasks: */
> if (!rt_nr_running)
>         highest_prio = MAX_RT_PRIO;
>
> and then,
>
> static struct task_struct *pick_next_task_rt(struct rq *rq)
> {
>         struct rt_prio_array *array = &rq->rt.active;
>         struct task_struct *next;
>         struct list_head *queue;
>         int idx;
>
> -        idx = sched_find_first_bit(array->bitmap);
> +       rq->highest_prio = idx = sched_find_first_bit(array->bitmap);
>
> [ ... ]
>
> additionally, if we can tolerate the 'latency' (of updating
> highest_prio) == the worst case scheduling latency, then
> rq_prio_add_task() is not necessary at all.

In my logging of test runs, having this 'latency' of highest_prio caused
missed migrations. I tried various things to do like what you said, but
they failed the rt-migrate-test program.

Seemed like the only place to modify highest_prio is from the queue and
dequeue.  Othrewise, my tests failed.

Thanks!

-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 6/8] pull RT tasks
  2007-10-19 18:43 ` [patch 6/8] pull RT tasks Steven Rostedt
  2007-10-19 19:24   ` Peter Zijlstra
@ 2007-10-21  9:35   ` Dmitry Adamushko
  2007-10-22 13:55     ` Steven Rostedt
  2007-10-21 11:59   ` Dmitry Adamushko
  2 siblings, 1 reply; 25+ messages in thread
From: Dmitry Adamushko @ 2007-10-21  9:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra

Hi Steven,

> When a pull RT is initiated, all overloaded runqueues are examined for
> a RT task that is higher in prio than the highest prio task queued on the
> target runqueue. If another runqueue holds a RT task that is of higher
> prio than the highest prio task on the target runqueue is found it is pulled
> to the target runqueue.

I think, 2 things should be accomplished here :

(1) be sure to pull the _highest_ prio task ;

i.e. the _highest_ prio task amongst all runnable (but not running) RT
tasks across all the run-queues which is capable of running on
this_cpu ;

(2) don't pull more than 1 task at once.

that said, just pull the highest prio task and run it.

---

why (2)? Just to avoid situations when tasks are being pulled/pushed
back and forth between run-queues.

Let's say we have 4 cpu system:

0:  task(10) , task(92)
1:  task(10), task(91)
2:  task(10), task(90)
3:  task(10)

when task(10) on cpu#3 is inactive, we pull task(92), task(91),
task(90) and then run task(90)... in the mean time, some of cpu[0..2]
becomes inactive and pull task(91) and task(92) back and run them...
that may repeat again and again depending on when/how long task(10)
run on their corresponding cpus...

so it seems to me that the more optimal behavior would be "don't pull
more than you can run at the moment -- that's 1".

to this goal, something like find_lock_highest_rq() would be necessary.

and I guess, {get,put}_task_struct() should be used in pull_rt_task()
for the 'next' in a similar way as it's done in push_rt_task() .

>
> [ ... ]
>

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 6/8] pull RT tasks
  2007-10-19 18:43 ` [patch 6/8] pull RT tasks Steven Rostedt
  2007-10-19 19:24   ` Peter Zijlstra
  2007-10-21  9:35   ` Dmitry Adamushko
@ 2007-10-21 11:59   ` Dmitry Adamushko
  2007-10-22 14:05     ` Steven Rostedt
  2 siblings, 1 reply; 25+ messages in thread
From: Dmitry Adamushko @ 2007-10-21 11:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra

On 19/10/2007, Steven Rostedt <rostedt@goodmis.org> wrote:

> [ ... ]
>
> @@ -2927,6 +2927,13 @@ static void idle_balance(int this_cpu, s
>         int pulled_task = -1;
>         unsigned long next_balance = jiffies + HZ;
>
> +       /*
> +        * pull_rt_task returns true if the run queue changed.
> +        * But this does not mean we have a task to run.
> +        */
> +       if (unlikely(pull_rt_task(this_rq)) && this_rq->nr_running)
> +               return;
> +
>         for_each_domain(this_cpu, sd) {
>                 unsigned long interval;
>
> @@ -3614,6 +3621,7 @@ need_resched_nonpreemptible:
>         if (unlikely(!rq->nr_running))
>                 idle_balance(cpu, rq);
>
> +       schedule_balance_rt(rq, prev);

we do pull_rt_task() in idle_balance() so, I think, there is no need
to do it twice.
i.e.
         if (unlikely(!rq->nr_running))
                 idle_balance(cpu, rq);
+       else
+               schedule_balance_rt(rq, prev);

hum?

moreover (continuing my previous idea on "don't pull more than 1 task
at once"), I wonder whether you really see cases when more than 1 task
have been successfully _pushed_ over to other run-queues at once...

I'd expect the push/pull algorithm to naturally avoid such a
possibility. Let's say we have a few RT tasks on our run-queue that
are currently runnable (but not running)... the question is 'why do
they still here?'

(1) because the previous attempt to _push_ them failed;
(2) because they were not _pulled_ from other run-queues.

both cases should mean that other run-queues have tasks with higher
prios running at the moment.

yes, there is a tiny window in schedule() between deactivate_task() [
which can make this run-queue to look like we can push over to it ]
and idle_balance() -> pull_rt_task() _or_ schedule_balance_rt() ->
pull_rt_task() [ where this run-queue will try to pull tasks on its
own ]

_but_ the run-queue is locked in this case so we wait in
double_lock_balance() (from push_rt_task()) and run into the
competition with 'src_rq' (which is currently in the 'tiny window' as
described above trying to run pull_rt_task()) for getting both self_rq
and src_rq locks...

this way, push_rt_task() always knows the task to be pushed (so it can
be a bit optimized) --- as it's either a newly woken up RT task with
(p->prio > rq->curr->prio) _or_ a preempted RT task (so we know 'task'
for both cases).

To sum it up, I think that the pull/push algorithm should be able to
naturally accomplish the proper job pushing/pulling 1 task at once (as
described above)... any additional actions are just overhead or there
is some problem with the algorithm (ah well, or with my understanding
:-/ )

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 6/8] pull RT tasks
  2007-10-21  9:35   ` Dmitry Adamushko
@ 2007-10-22 13:55     ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-22 13:55 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra

Hi Dmitry,

On Sun, 2007-10-21 at 11:35 +0200, Dmitry Adamushko wrote:
> Hi Steven,
> 
> > When a pull RT is initiated, all overloaded runqueues are examined for
> > a RT task that is higher in prio than the highest prio task queued on the
> > target runqueue. If another runqueue holds a RT task that is of higher
> > prio than the highest prio task on the target runqueue is found it is pulled
> > to the target runqueue.
> 
> I think, 2 things should be accomplished here :
> 
> (1) be sure to pull the _highest_ prio task ;
> 
> i.e. the _highest_ prio task amongst all runnable (but not running) RT
> tasks across all the run-queues which is capable of running on
> this_cpu ;
> 
> (2) don't pull more than 1 task at once.

I've thought about this, and played a little. The problem we have is
that we don't take locks when searching each run queue. So by the time
you got the "highest rq" to pull from, it may no longer be the highest.
So what do we do in that case? search again?

If we are lucky and pull the highest first, then we will not pull
another one. Since we are not comparing the rq prio to current, but to
the highest task on the current rq.  So once we pull a high prio task,
we only pull another one if we find a even higher prio task.

Yes, it's a little inefficient, and can cause shuffling of task around.
But these tasks haven't actually run yet, so it isn't too much harm. But
the benefit is that when we finish, we have the highest task that can
run on the runqueue.

> 
> that said, just pull the highest prio task and run it.
> 
> ---
> 
> why (2)? Just to avoid situations when tasks are being pulled/pushed
> back and forth between run-queues.
> 
> Let's say we have 4 cpu system:
> 
> 0:  task(10) , task(92)
> 1:  task(10), task(91)
> 2:  task(10), task(90)
> 3:  task(10)
> 
> when task(10) on cpu#3 is inactive, we pull task(92), task(91),
> task(90) and then run task(90)... in the mean time, some of cpu[0..2]
> becomes inactive and pull task(91) and task(92) back and run them...
> that may repeat again and again depending on when/how long task(10)
> run on their corresponding cpus...

I'm not sure that's too much of an issue, since we are just switch tasks
on lists. As long has they haven't run yet, then it shouldn't be too
much harm. We do have a little bit of cache hits in the queues
themselves, but alternatives to fix this would probably cause the same
effects.

> 
> so it seems to me that the more optimal behavior would be "don't pull
> more than you can run at the moment -- that's 1".
> 
> to this goal, something like find_lock_highest_rq() would be necessary.

The problem is that we have too many races to find the highest rq at the
moment. But by pulling the highest at the time, we should end up with
what we want.

> 
> and I guess, {get,put}_task_struct() should be used in pull_rt_task()
> for the 'next' in a similar way as it's done in push_rt_task() .

No, we don't need to do the get task on pull. With push, we start with a
task and we want to push it somewhere. On the double_lock_balance, we
might lose our rq lock. Which means that next could have been put on
another run queue, run to completion, and then exited destroying the
task. We still reference that task after the double_lock_balance, and if
it is not active anymore, we finish the push.

With the pull, we are focusing on the run queue. If we had to release
the rq lock on double_lock_balance, we just pick the highest "next" on
the current run queue (which could have changed) and take the current
task on the src rq, which also could have changed. If the current task
on the src rq is higher in priority we move it over to the dst rq. So no
get/put_task_struct is needed.

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 6/8] pull RT tasks
  2007-10-21 11:59   ` Dmitry Adamushko
@ 2007-10-22 14:05     ` Steven Rostedt
  2007-10-22 22:34       ` Dmitry Adamushko
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2007-10-22 14:05 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra

On Sun, 2007-10-21 at 13:59 +0200, Dmitry Adamushko wrote:
> On 19/10/2007, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > [ ... ]
> >
> > @@ -2927,6 +2927,13 @@ static void idle_balance(int this_cpu, s
> >         int pulled_task = -1;
> >         unsigned long next_balance = jiffies + HZ;
> >
> > +       /*
> > +        * pull_rt_task returns true if the run queue changed.
> > +        * But this does not mean we have a task to run.
> > +        */
> > +       if (unlikely(pull_rt_task(this_rq)) && this_rq->nr_running)
> > +               return;
> > +
> >         for_each_domain(this_cpu, sd) {
> >                 unsigned long interval;
> >
> > @@ -3614,6 +3621,7 @@ need_resched_nonpreemptible:
> >         if (unlikely(!rq->nr_running))
> >                 idle_balance(cpu, rq);
> >
> > +       schedule_balance_rt(rq, prev);
> 
> we do pull_rt_task() in idle_balance() so, I think, there is no need
> to do it twice.
> i.e.
>          if (unlikely(!rq->nr_running))
>                  idle_balance(cpu, rq);
> +       else
> +               schedule_balance_rt(rq, prev);
> 
> hum?

Ah, yes. That is probably better. We don't need to do the second pull if
the idle_balance is run. Thanks.

> 
> moreover (continuing my previous idea on "don't pull more than 1 task
> at once"), I wonder whether you really see cases when more than 1 task
> have been successfully _pushed_ over to other run-queues at once...
> 
> I'd expect the push/pull algorithm to naturally avoid such a
> possibility. Let's say we have a few RT tasks on our run-queue that
> are currently runnable (but not running)... the question is 'why do
> they still here?'
> 
> (1) because the previous attempt to _push_ them failed;
> (2) because they were not _pulled_ from other run-queues.
> 
> both cases should mean that other run-queues have tasks with higher
> prios running at the moment.
> 
> yes, there is a tiny window in schedule() between deactivate_task() [
> which can make this run-queue to look like we can push over to it ]
> and idle_balance() -> pull_rt_task() _or_ schedule_balance_rt() ->
> pull_rt_task() [ where this run-queue will try to pull tasks on its
> own ]
> 
> _but_ the run-queue is locked in this case so we wait in
> double_lock_balance() (from push_rt_task()) and run into the
> competition with 'src_rq' (which is currently in the 'tiny window' as
> described above trying to run pull_rt_task()) for getting both self_rq
> and src_rq locks...
> 
> this way, push_rt_task() always knows the task to be pushed (so it can
> be a bit optimized) --- as it's either a newly woken up RT task with
> (p->prio > rq->curr->prio) _or_ a preempted RT task (so we know 'task'
> for both cases).
> 
> To sum it up, I think that the pull/push algorithm should be able to
> naturally accomplish the proper job pushing/pulling 1 task at once (as
> described above)... any additional actions are just overhead or there
> is some problem with the algorithm (ah well, or with my understanding
> :-/ )

On wakeup, we can wake up several RT tasks (as my test case does) and if
we only push one task, then the other tasks may not migrate over to the
other run queues. I logged this happening in my tests.

The pull happens when we lower our priority in the scheduler. So we only
pull when we lower the priority since the push of rt tasks would not
push to a rq of higher priority. A waiting rt task may be able to run as
soon as we lower our priority. The only reason we pull more than one is
to cover the race between finding the highest prio runqueue, and having
that still be the highest task on the run queue when we go to pull it.
Although, I admit there are still races here. But hopefully the pushes
cover them.

We then again push on finish_task_switch, simply because we may need to
push the previous running task. On wake up and schedule, we can't push a
running task, so a rt task may wake up a higher priority rt task on the
same CPU as it is running, so when the higher priority rt task preempts
the current rt task, we want to push that current rt task off to another
CPU if possible. The first time this is possible is from
finish_task_switch when that current rt task is no longer running.

I'm currently working on getting RT overload logic to use cpusets, for
better NUMA and large CPU handling. So some of this logic will change in
the next series.

Thanks,

-- Steve

> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 6/8] pull RT tasks
  2007-10-22 14:05     ` Steven Rostedt
@ 2007-10-22 22:34       ` Dmitry Adamushko
  2007-10-23  1:16         ` Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Dmitry Adamushko @ 2007-10-22 22:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra

Hi Steven,

agreed with your comments in the previous message. Indeed, I missed some points.

> On wakeup, we can wake up several RT tasks (as my test case does) and if
> we only push one task, then the other tasks may not migrate over to the
> other run queues. I logged this happening in my tests.

I guess, what may happen is that while we are running push_rt_tasks()
on CPU-k (say, as a result on try_to_wake_up(task_1))
and as this_rq->lock may get released (in double_lock_balance()) , we
may get in a 'race'
with try_to_wake_up(task_2) from (another) CPU-m.
It places a woken up task on the same run-queue (for which
push_rt_task() is already running on CPU-k) and, actually, run
push_rt_task() for the same rq again!

So it may happen that both task_1 and task_2 will be pushed from the same CPU...

Do you see an error in my description? (it's a late hour so I can miss
something again ... sure, otherwise I'm almost perfect :-/ ) Can it
correlate with what you have observed in your tests?

Otherwise, there is 1:1 relation : push_rt_task() is called for every
new (single) task activated by try_to_wake_up() and for a preempted
task... so it's not like a few tasks are placed on the run-queue and
then push_rt_tasks() is called once.

btw., if this scenario may take place... maybe it would make sense to
have something like RTLB_INPROGRESS/PENDING and to avoid competing
push_rt_tasks() calls for the same 'rq' from different CPUs?
(although, there can be some disadvantages here as well. e.g. we would
likely need to remove 'max 3 tasks at once' limit and get,
theoretically, unbounded time spent in push_rt_tasks() on a single
CPU).

>
> -- Steve
>

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 6/8] pull RT tasks
  2007-10-22 22:34       ` Dmitry Adamushko
@ 2007-10-23  1:16         ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2007-10-23  1:16 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, Peter Zijlstra

--
On Tue, 23 Oct 2007, Dmitry Adamushko wrote:

> Hi Steven,
>
> agreed with your comments in the previous message. Indeed, I missed some points.
>
> > On wakeup, we can wake up several RT tasks (as my test case does) and if
> > we only push one task, then the other tasks may not migrate over to the
> > other run queues. I logged this happening in my tests.
>
> I guess, what may happen is that while we are running push_rt_tasks()
> on CPU-k (say, as a result on try_to_wake_up(task_1))
> and as this_rq->lock may get released (in double_lock_balance()) , we
> may get in a 'race'
> with try_to_wake_up(task_2) from (another) CPU-m.
> It places a woken up task on the same run-queue (for which
> push_rt_task() is already running on CPU-k) and, actually, run
> push_rt_task() for the same rq again!
>
> So it may happen that both task_1 and task_2 will be pushed from the same CPU...
>
> Do you see an error in my description? (it's a late hour so I can miss
> something again ... sure, otherwise I'm almost perfect :-/ ) Can it
> correlate with what you have observed in your tests?
>
> Otherwise, there is 1:1 relation : push_rt_task() is called for every
> new (single) task activated by try_to_wake_up() and for a preempted
> task... so it's not like a few tasks are placed on the run-queue and
> then push_rt_tasks() is called once.
>
> btw., if this scenario may take place... maybe it would make sense to
> have something like RTLB_INPROGRESS/PENDING and to avoid competing
> push_rt_tasks() calls for the same 'rq' from different CPUs?
> (although, there can be some disadvantages here as well. e.g. we would
> likely need to remove 'max 3 tasks at once' limit and get,
> theoretically, unbounded time spent in push_rt_tasks() on a single
> CPU).

I think I see what you're saying now. That we really should do the push on
wakeup, since only the one rt task that is woken up will be able to be
pushed.

My code may seem a bit aggressive, but from logging, I see that we seldom
push more than one task. But if we don't go ahead and push more than one,
the rt-migrate-test will sometimes fail.  There's funny cases when we need
to push more than one. I can't remember exactly what they were, but
sometimes because of races between the pushes and pulls where one will
miss the fact that an RT process is being queued while its searching to
pull a task, but the push will catch it. Or vice versa.

The max tries is just a paranoid case where I don't want heavy scheduling
to cause an unbounded trying to push or pull tasks. According to the
logging while running the rt-migrate-tests, the loop there repeated
approximately 1 in 20, and iterated twice 1 in 100 and hit the third loop
twice in all my tests (over a 1000 iterations being logged). But logging
slows it down and modifies the results itself, and perhaps I'll add
statistics soon.

I'm about to post a second batch of patches.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-10-23  1:17 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-19 18:42 [patch 0/8] New RT Task Balancing Steven Rostedt
2007-10-19 18:42 ` [patch 1/8] Add rt_nr_running accounting Steven Rostedt
2007-10-20 16:45   ` Dmitry Adamushko
2007-10-21  2:13     ` Steven Rostedt
2007-10-19 18:42 ` [patch 2/8] track highest prio queued on runqueue Steven Rostedt
2007-10-19 19:19   ` Steven Rostedt
2007-10-19 19:45   ` Gregory Haskins
2007-10-19 19:57     ` Steven Rostedt
2007-10-20 18:14   ` Dmitry Adamushko
2007-10-21  2:19     ` Steven Rostedt
2007-10-19 18:42 ` [patch 3/8] push RT tasks Steven Rostedt
2007-10-19 18:42 ` [patch 4/8] RT overloaded runqueues accounting Steven Rostedt
2007-10-19 18:42 ` [patch 5/8] Move prototypes together Steven Rostedt
2007-10-19 18:43 ` [patch 6/8] pull RT tasks Steven Rostedt
2007-10-19 19:24   ` Peter Zijlstra
2007-10-19 19:35     ` Peter Zijlstra
2007-10-19 19:43       ` Steven Rostedt
2007-10-21  9:35   ` Dmitry Adamushko
2007-10-22 13:55     ` Steven Rostedt
2007-10-21 11:59   ` Dmitry Adamushko
2007-10-22 14:05     ` Steven Rostedt
2007-10-22 22:34       ` Dmitry Adamushko
2007-10-23  1:16         ` Steven Rostedt
2007-10-19 18:43 ` [patch 7/8] wake up balance RT Steven Rostedt
2007-10-19 18:43 ` [patch 8/8] disable CFS RT load balancing Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox