[PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements
@ 2007-10-11 21:59 Gregory Haskins
  2007-10-11 21:59 ` [PATCH 1/7] RT: Push waiting rt tasks Gregory Haskins
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Gregory Haskins @ 2007-10-11 21:59 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra; +Cc: linux-rt-users, LKML, Gregory Haskins

The current series applies to 23-rt1-pre1.

This is a snapshot of the current work-in-progress for the rt-overload
enhancements.  The primary motivation for the series to to improve the
algorithm for distributing RT tasks to keep the highest tasks active.  The
current system tends to blast IPIs "shotgun" style, and we aim to reduce that
overhead where possible.  We mitigate this behavior by trying to place tasks
on the ideal runqueue before an overload even occurs.

Note that this series is *not* currently stable.  There is at
least one bug resulting in a hard-lock.  And the hard-lock could be masking
other yet-to-be-discovered issues.

My primary motivation for sending it out right now is to share the latest
series with Peter Zijlstra and Steven Rostedt.  However, in the interest of
keeping the development open we are sending to a wider distribution.
Comments/suggestions from anyone are, of course, welcome.  But please note
this is not quite ready for prime-time in any capacity. 

The series includes patches from both Steven and myself, with serious
input/guidance/discussion from Peter.

Regards,
-Greg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/7] RT: Push waiting rt tasks
  2007-10-11 21:59 [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Gregory Haskins
@ 2007-10-11 21:59 ` Gregory Haskins
  2007-10-11 21:59 ` [PATCH 2/7] RT: Peter Zijlstra's suggested improvements to rt-push patch Gregory Haskins
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Gregory Haskins @ 2007-10-11 21:59 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra; +Cc: linux-rt-users, LKML, Gregory Haskins

From: Steven Rostedt <rostedt@goodmis.org>

This has been complied tested (and no more ;-)


The idea here is when we find a situation that we just scheduled in an
RT task and we either pushed a lesser RT task away or more than one RT
task was scheduled on this CPU before scheduling occurred.

The answer that this patch does is to do a O(n) search of CPUs for the
CPU with the lowest prio task running. When that CPU is found the next
highest RT task is pushed to that CPU.

Some notes:

1) no lock is taken while looking for the lowest priority CPU. When one
is found, only that CPU's lock is taken and after that a check is made
to see if it is still a candidate to push the RT task over. If not, we
try the search again, for a max of 3 tries.

2) I only do this for the second highest RT task on the CPU queue. This
can be easily changed to do it for all RT tasks until no more can be
pushed off to other CPUs.

This is a simple approach right now, and is only being posted for
comments.  I'm sure more can be done to make this more efficient or just
simply better.

-- Steve

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---

 kernel/sched.c    |   85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_rt.c |   42 ++++++++++++++++++++++++++
 2 files changed, 126 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 93fd6de..0f0af6d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -304,6 +304,7 @@ struct rq {
 #ifdef CONFIG_PREEMPT_RT
 	unsigned long rt_nr_running;
 	unsigned long rt_nr_uninterruptible;
+	int curr_prio;
 #endif
 
 	unsigned long switch_timestamp;
@@ -1485,6 +1486,87 @@ next_in_queue:
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 
 /*
+ * If the current CPU has more than one RT task, see if the non
+ * running task can migrate over to a CPU that is running a task
+ * of lesser priority.
+ */
+static int push_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next_task;
+	struct rq *lowest_rq = NULL;
+	int tries;
+	int cpu;
+	int dst_cpu = -1;
+	int ret = 0;
+
+	BUG_ON(!spin_is_locked(&this_rq->lock));
+
+	next_task = rt_next_highest_task(this_rq);
+	if (!next_task)
+		return 0;
+
+	/* We might release this_rq lock */
+	get_task_struct(next_task);
+
+	/* Only try this algorithm three times */
+	for (tries = 0; tries < 3; tries++) {
+		/*
+		 * Scan each rq for the lowest prio.
+		 */
+		for_each_cpu_mask(cpu, next_task->cpus_allowed) {
+			struct rq *rq = &per_cpu(runqueues, cpu);
+
+			if (cpu == smp_processor_id())
+				continue;
+
+			/* no locking for now */
+			if (rq->curr_prio > next_task->prio &&
+			    (!lowest_rq || rq->curr_prio < lowest_rq->curr_prio)) {
+				dst_cpu = cpu;
+				lowest_rq = rq;
+			}
+		}
+
+		if (!lowest_rq)
+			break;
+
+		if (double_lock_balance(this_rq, lowest_rq)) {
+			/*
+			 * We had to unlock the run queue. In
+			 * the mean time, next_task could have
+			 * migrated already or had its affinity changed.
+			 */
+			if (unlikely(task_rq(next_task) != this_rq ||
+				     !cpu_isset(dst_cpu, next_task->cpus_allowed))) {
+				spin_unlock(&lowest_rq->lock);
+				break;
+			}
+		}
+
+		/* if the prio of this runqueue changed, try again */
+		if (lowest_rq->curr_prio <= next_task->prio) {
+			spin_unlock(&lowest_rq->lock);
+			continue;
+		}
+
+		deactivate_task(this_rq, next_task, 0);
+		set_task_cpu(next_task, dst_cpu);
+		activate_task(lowest_rq, next_task, 0);
+
+		set_tsk_need_resched(lowest_rq->curr);
+
+		spin_unlock(&lowest_rq->lock);
+		ret = 1;
+
+		break;
+	}
+
+	put_task_struct(next_task);
+
+	return ret;
+}
+
+/*
  * Pull RT tasks from other CPUs in the RT-overload
  * case. Interrupts are disabled, local rq is locked.
  */
@@ -2207,7 +2289,8 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 * If we pushed an RT task off the runqueue,
 	 * then kick other CPUs, they might run it:
 	 */
-	if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
+	rq->curr_prio = current->prio;
+	if (unlikely(rt_task(current) && push_rt_task(rq))) {
 		schedstat_inc(rq, rto_schedule);
 		smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
 	}
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 369827b..6eca644 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -96,6 +96,48 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
 	return next;
 }
 
+#ifdef CONFIG_PREEMPT_RT
+static struct task_struct *rt_next_highest_task(struct rq *rq)
+{
+	struct rt_prio_array *array = &rq->rt.active;
+	struct task_struct *next;
+	struct list_head *queue;
+	int idx;
+
+	if (likely (rq->rt_nr_running < 2))
+		return NULL;
+
+	idx = sched_find_first_bit(array->bitmap);
+	if (idx >= MAX_RT_PRIO) {
+		WARN_ON(1); /* rt_nr__running is bad */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	if (queue->next->next != queue) {
+		/* same prio task */
+		next = list_entry(queue->next->next, struct task_struct, run_list);
+		goto out;
+	}
+
+	/* slower, but more flexible */
+	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
+	if (idx >= MAX_RT_PRIO) {
+		WARN_ON(1); /* rt_nr_running was 2 and above! */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+
+ out:
+	return next;
+	
+}
+#else  /* CONFIG_PREEMPT_RT */
+
+#endif /* CONFIG_PREEMPT_RT */
+
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/7] RT: Peter Zijlstra's suggested improvements to rt-push patch
  2007-10-11 21:59 [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Gregory Haskins
  2007-10-11 21:59 ` [PATCH 1/7] RT: Push waiting rt tasks Gregory Haskins
@ 2007-10-11 21:59 ` Gregory Haskins
  2007-10-11 21:59 ` [PATCH 3/7] RT: Only consider online CPUs when pushing rt tasks Gregory Haskins
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Gregory Haskins @ 2007-10-11 21:59 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra; +Cc: linux-rt-users, LKML, Gregory Haskins

This is my own interpretation of Peter's recommended changes Steven's push-rt
patch.  Just to be clear, Peter does not endorse this patch unless he himself
specifically says so ;).

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0f0af6d..24b38b4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1499,7 +1499,7 @@ static int push_rt_task(struct rq *this_rq)
 	int dst_cpu = -1;
 	int ret = 0;
 
-	BUG_ON(!spin_is_locked(&this_rq->lock));
+	assert_spin_locked(&this_rq->lock);
 
 	next_task = rt_next_highest_task(this_rq);
 	if (!next_task)
@@ -1553,7 +1553,8 @@ static int push_rt_task(struct rq *this_rq)
 		set_task_cpu(next_task, dst_cpu);
 		activate_task(lowest_rq, next_task, 0);
 
-		set_tsk_need_resched(lowest_rq->curr);
+		resched_task(lowest_rq->curr);
+		schedstat_inc(this_rq, rto_schedule);
 
 		spin_unlock(&lowest_rq->lock);
 		ret = 1;
@@ -2290,10 +2291,9 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 * then kick other CPUs, they might run it:
 	 */
 	rq->curr_prio = current->prio;
-	if (unlikely(rt_task(current) && push_rt_task(rq))) {
-		schedstat_inc(rq, rto_schedule);
-		smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
-	}
+	if (unlikely(rq->rt_nr_running > 1))
+		push_rt_task(rq);
+
 #endif
 	prev_state = prev->state;
 	_finish_arch_switch(prev);


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/7] RT: Only consider online CPUs when pushing rt tasks
  2007-10-11 21:59 [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Gregory Haskins
  2007-10-11 21:59 ` [PATCH 1/7] RT: Push waiting rt tasks Gregory Haskins
  2007-10-11 21:59 ` [PATCH 2/7] RT: Peter Zijlstra's suggested improvements to rt-push patch Gregory Haskins
@ 2007-10-11 21:59 ` Gregory Haskins
  2007-10-12 12:10   ` Steven Rostedt
  2007-10-11 22:00 ` [PATCH 4/7] RT: Add a per-cpu rt_overload indication Gregory Haskins
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Gregory Haskins @ 2007-10-11 21:59 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra; +Cc: linux-rt-users, LKML, Gregory Haskins

task->cpus_allowed can have bit positions that are set for CPUs that are
not currently online.  So we optimze our search by ANDing against the online
set.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 24b38b4..0ca3905 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1510,10 +1510,14 @@ static int push_rt_task(struct rq *this_rq)
 
 	/* Only try this algorithm three times */
 	for (tries = 0; tries < 3; tries++) {
+		cpumask_t mask;
+
+		cpus_and(mask, cpu_online_map, next_task->cpus_allowed);
+
 		/*
 		 * Scan each rq for the lowest prio.
 		 */
-		for_each_cpu_mask(cpu, next_task->cpus_allowed) {
+		for_each_cpu_mask(cpu, mask) {
 			struct rq *rq = &per_cpu(runqueues, cpu);
 
 			if (cpu == smp_processor_id())


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/7] RT: Add a per-cpu rt_overload indication
  2007-10-11 21:59 [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Gregory Haskins
                   ` (2 preceding siblings ...)
  2007-10-11 21:59 ` [PATCH 3/7] RT: Only consider online CPUs when pushing rt tasks Gregory Haskins
@ 2007-10-11 22:00 ` Gregory Haskins
  2007-10-11 22:00 ` [PATCH 5/7] RT: CPU priority management Gregory Haskins
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Gregory Haskins @ 2007-10-11 22:00 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra; +Cc: linux-rt-users, LKML, Gregory Haskins

The system currently evaluates all online CPUs whenever one or more enters
an rt_overload condition.  This suffers from scalability limitations as
the # of online CPUs increases.  So we introduce a cpumask to track
exactly which CPUs need RT balancing.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Peter W. Morreale <pmorreale@novell.com>
---

 kernel/sched.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0ca3905..41b0e9c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -632,6 +632,7 @@ static inline struct rq *this_rq_lock(void)
 
 #if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
 static __cacheline_aligned_in_smp atomic_t rt_overload;
+static cpumask_t rto_cpus;
 #endif
 
 static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -640,8 +641,11 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 	if (rt_task(p)) {
 		rq->rt_nr_running++;
 # ifdef CONFIG_SMP
-		if (rq->rt_nr_running == 2)
+		if (rq->rt_nr_running == 2) {
+			cpu_set(rq->cpu, rto_cpus);
+			smp_wmb();
 			atomic_inc(&rt_overload);
+		}
 # endif
 	}
 #endif
@@ -654,8 +658,10 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 		WARN_ON(!rq->rt_nr_running);
 		rq->rt_nr_running--;
 # ifdef CONFIG_SMP
-		if (rq->rt_nr_running == 1)
+		if (rq->rt_nr_running == 1) {
 			atomic_dec(&rt_overload);
+			cpu_clear(rq->cpu, rto_cpus);
+		}
 # endif
 	}
 #endif
@@ -1590,7 +1596,7 @@ static void balance_rt_tasks(struct rq *this_rq, int this_cpu)
 	 */
 	next = pick_next_task(this_rq, this_rq->curr);
 
-	for_each_online_cpu(cpu) {
+	for_each_cpu_mask(cpu, rto_cpus) {
 		if (cpu == this_cpu)
 			continue;
 		src_rq = cpu_rq(cpu);


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/7] RT: CPU priority management
  2007-10-11 21:59 [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Gregory Haskins
                   ` (3 preceding siblings ...)
  2007-10-11 22:00 ` [PATCH 4/7] RT: Add a per-cpu rt_overload indication Gregory Haskins
@ 2007-10-11 22:00 ` Gregory Haskins
  2007-10-11 22:00 ` [PATCH 6/7] RT: Convert Steve's rt-push infrastructure to cpupri Gregory Haskins
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Gregory Haskins @ 2007-10-11 22:00 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra; +Cc: linux-rt-users, LKML, Gregory Haskins

This code tracks the priority of each CPU so that global migration
  decisions are easy to calculate.  Each CPU can be in a state as follows:

                 (INVALID), IDLE, NORMAL, RT1, ... RT99

  going from the lowest priority to the highest.  CPUs in the INVALID state
  are not eligible for routing.  The system maintains this state with
  a 2 dimensional bitmap (the first for priority class, the second for cpus
  in that class).  Therefore a typical application without affinity
  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
  searches).  For tasks with affinity restrictions, the algorithm has a
  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
  yields the worst case search is fairly contrived.

  Because this type of data structure is going to be cache/lock hot,
  certain design considerations were made to mitigate this overhead, such
  as:  seqlocks, per_cpu data to avoid cacheline contention, avoiding locks
  in the update code when possible, etc.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/cpupri.h |   25 +++++
 kernel/Kconfig.preempt |   11 ++
 kernel/Makefile        |    1 
 kernel/cpupri.c        |  220 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c         |   36 ++++++++
 5 files changed, 293 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpupri.h b/include/linux/cpupri.h
new file mode 100644
index 0000000..1ca5007
--- /dev/null
+++ b/include/linux/cpupri.h
@@ -0,0 +1,25 @@
+#ifndef _LINUX_CPUPRI_H
+#define _LINUX_CPUPRI_H
+
+#include <linux/sched.h>
+
+#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO
+
+#define CPUPRI_INVALID -2
+#define CPUPRI_IDLE    -1
+#define CPUPRI_NORMAL   0
+/* values 1-99 are RT priorities */
+
+#ifdef CONFIG_CPU_PRIORITIES
+int cpupri_find_best(int cpu, int pri, struct task_struct *p);
+void cpupri_set(int cpu, int pri);
+int cpupri_get(int cpu);
+void cpupri_init(void);
+#else
+#define cpupri_find_best(cpu, pri, tsk) cpu
+#define cpupri_set(cpu, pri)            do { } while(0)
+#define cpupri_get(cpu)                 CPUPRI_INVALID
+#define cpupri_init()                   do { } while(0)
+#endif /* CONFIG_CPU_PRIORITIES */
+
+#endif /* _LINUX_CPUPRI_H */
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 2316f28..5397e59 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -197,3 +197,14 @@ config RCU_TRACE
 	  Say Y here if you want to enable RCU tracing
 	  Say N if you are unsure.
 
+config CPU_PRIORITIES
+       bool "Enable CPU priority management"
+       default n
+       help
+         This option allows the scheduler to efficiently track the absolute
+	 priority of the current task on each CPU.  This helps it to make
+	 global decisions for real-time tasks before a overload conflict
+	 actually occurs.
+
+	 Say Y here if you want to enable priority management
+	 Say N if you are unsure.
\ No newline at end of file
diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..63aaaf5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_CPU_PRIORITIES) += cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/cpupri.c b/kernel/cpupri.c
new file mode 100644
index 0000000..46a8758
--- /dev/null
+++ b/kernel/cpupri.c
@@ -0,0 +1,220 @@
+/*
+ *  kernel/cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins <ghaskins@novell.com>
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ *                 (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  Because this type of data structure is going to be cache/lock hot,
+ *  certain design considerations were made to mitigate this overhead, such
+ *  as:  seqlocks, per_cpu data to avoid cacheline contention, avoiding locks
+ *  in the update code when possible, etc.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include <linux/cpupri.h>
+#include <asm/idle.h>
+
+struct cpu_priority {
+	raw_seqlock_t lock;
+	cpumask_t     pri_to_cpu[CPUPRI_NR_PRIORITIES];
+	long          pri_active[CPUPRI_NR_PRIORITIES/BITS_PER_LONG];
+};
+
+static DEFINE_PER_CPU(int, cpu_to_pri);
+
+static __cacheline_aligned_in_smp struct cpu_priority cpu_priority;
+
+#define for_each_cpupri_active(array, idx)                   \
+  for( idx = find_first_bit(array, CPUPRI_NR_PRIORITIES);    \
+       idx < CPUPRI_NR_PRIORITIES;                           \
+       idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find_best - find the best (lowest-pri) CPU in the system
+ * @cpu: The recommended/default CPU
+ * @task_pri: The priority of the task being scheduled (IDLE-RT99)
+ * @p: The task being scheduled
+ *
+ * Note: This function returns the recommeded CPU as calculated during the
+ * current invokation.  By the time the call returns, the CPUs may have in
+ * fact changed priorities any number of times.  While not ideal, it is not
+ * an issue of correctness since the normal rebalancer logic will correct
+ * any discrepancies created by racing against the uncertainty of the current
+ * priority configuration.
+ *
+ * Returns: (int)cpu - The recommended cpu to accept the task
+ */
+int cpupri_find_best(int def_cpu, int task_pri, struct task_struct *p)
+{
+	int                  idx      = 0;
+	struct cpu_priority *cp       = &cpu_priority;
+	int                  this_cpu = smp_processor_id();
+	int                  cpu      = def_cpu;
+	unsigned long        seq;
+
+	do {
+		seq = read_seqbegin(&cp->lock);
+
+		for_each_cpupri_active(cp->pri_active, idx) {
+			cpumask_t mask;
+			int       lowest_pri = idx-1;
+			
+			if (lowest_pri > task_pri)
+				break;
+			
+			cpus_and(mask, p->cpus_allowed, cp->pri_to_cpu[idx]);
+			
+			if (!cpus_empty(mask)) {
+				/*
+				 * We select a CPU in the following priority:
+				 *
+				 *       def_cpu, this_cpu, first_cpu
+				 *
+				 * for efficiency.  def_cpu preserves cache
+				 * affinity, and this_cpu is cheaper to preempt
+				 * (note that sometimes they are the same).
+				 * Finally, we will take whatever is available
+				 * if the first two don't pan out.
+				 */
+				if (cpu_isset(def_cpu, mask))
+					break;
+				
+				if (cpu_isset(this_cpu, mask)) {
+					cpu = this_cpu;
+					break;
+				}
+
+				cpu = first_cpu(mask);
+				break;
+			}
+		}
+	} while (unlikely(read_seqretry(&cp->lock, seq)));
+
+	return cpu;
+}
+
+/**
+ * cpupri_set - update the cpu priority setting
+ * @cpu: The target cpu
+ * @pri: The priority (INVALID-RT99) to assign to this CPU
+ *
+ * Note: Assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpupri_set(int cpu, int pri)
+{
+	struct cpu_priority *cp   = &cpu_priority;
+	int                 *cpri = &per_cpu(cpu_to_pri, cpu);
+
+	/*
+	 * Its safe to check the CPU priority outside the seqlock because 
+	 * it can only be modified while holding the rq->lock for this cpu
+	 */
+	if (*cpri != pri) {
+		int           oldpri = *cpri;
+		unsigned long flags;
+		
+		write_seqlock_irqsave(&cp->lock, flags);
+
+		/*
+		 * If the cpu was currently mapped to a different value, we
+		 * first need to unmap the old value
+		 */
+		if (likely(oldpri != CPUPRI_INVALID)) {
+			int        idx  = oldpri+1;
+			cpumask_t *mask = &cp->pri_to_cpu[idx];
+
+			cpu_clear(cpu, *mask);
+			if (cpus_empty(*mask))
+				__clear_bit(idx, cp->pri_active);
+		}
+
+		if (likely(pri != CPUPRI_INVALID)) {
+			int        idx  = pri+1;
+			cpumask_t *mask = &cp->pri_to_cpu[idx];
+
+			cpu_set(cpu, *mask);
+			__set_bit(idx, cp->pri_active);
+		}
+
+		write_sequnlock_irqrestore(&cp->lock, flags);
+
+		*cpri = pri;
+	}
+}
+
+/**
+ * cpupri_get - read the current cpu priority setting
+ * @cpu: The target cpu
+ *
+ * Note: Assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (int)pri - The current priority of this cpu
+ */
+int cpupri_get(int cpu)
+{
+	return per_cpu(cpu_to_pri, cpu);
+}
+
+#if 0
+static int cpupri_idle(struct notifier_block *b, unsigned long event, void *v)
+{
+	// FIXME: We now need to hold the rq->lock before calling!
+	if (event == IDLE_START)
+		cpupri_set(raw_smp_processor_id(), CPUPRI_IDLE);
+
+	return 0;
+}
+
+static struct notifier_block cpupri_idle_notifier = {
+	.notifier_call = cpupri_idle
+};
+#endif
+
+/**
+ * cpupri_init - initialize the cpupri subsystem
+ *
+ * This must be called during the scheduler initialization before the 
+ * other methods may be used.
+ *
+ * Returns: (void)
+ */
+void cpupri_init(void)
+{
+	struct cpu_priority *cp = &cpu_priority;
+	int i;
+
+y	memset(cp, 0, sizeof(*cp));
+
+	raw_seqlock_init(&cp->lock);
+
+	for_each_possible_cpu(i) {
+		per_cpu(cpu_to_pri, i) = CPUPRI_INVALID;
+	}
+
+	// idle_notifier_register(&cpupri_idle_notifier);
+}
+
+
diff --git a/kernel/sched.c b/kernel/sched.c
index 41b0e9c..0065551 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -64,6 +64,7 @@
 #include <linux/delayacct.h>
 #include <linux/reciprocal_div.h>
 #include <linux/unistd.h>
+#include <linux/cpupri.h>
 
 #include <asm/tlb.h>
 
@@ -474,6 +475,37 @@ static inline void set_task_cfs_rq(struct task_struct *p)
 }
 #endif
 
+static int calc_task_cpupri(struct rq *rq, struct task_struct *p)
+{
+	int pri;
+
+	if (rt_task(p))
+		pri = p->rt_priority;
+	else if (p == rq->idle)
+		pri = CPUPRI_IDLE;
+	else
+		pri = CPUPRI_NORMAL;
+
+	return pri;
+}
+
+#ifdef CONFIG_CPU_PRIORITIES
+static void set_cpupri(struct rq *rq, struct task_struct *p)
+{
+	int pri = calc_task_cpupri(rq, p);
+	cpupri_set(rq->cpu, pri);
+}
+
+static int get_cpupri(struct rq *rq)
+{
+	return cpupri_get(rq->cpu);
+}
+
+#else
+#define set_cpupri(rq, task) do { } while (0)
+#define get_cpupri(rq)         CPUPRI_INVALID
+#endif
+
 /*
  * We really dont want to do anything complex within switch_to()
  * on PREEMPT_RT - this check enforces this.
@@ -2259,6 +2291,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
+	set_cpupri(rq, next);
 }
 
 /**
@@ -4603,6 +4636,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 		 * this runqueue and our priority is higher than the current's
 		 */
 		if (task_running(rq, p)) {
+			set_cpupri(rq, p);
 			if (p->prio > oldprio)
 				resched_task(rq->curr);
 		} else {
@@ -7290,6 +7324,8 @@ void __init sched_init(void)
 	int highest_cpu = 0;
 	int i, j;
 
+	cpupri_init();
+
 	/*
 	 * Link up the scheduling class hierarchy:
 	 */


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/7] RT: Convert Steve's rt-push infrastructure to cpupri
  2007-10-11 21:59 [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Gregory Haskins
                   ` (4 preceding siblings ...)
  2007-10-11 22:00 ` [PATCH 5/7] RT: CPU priority management Gregory Haskins
@ 2007-10-11 22:00 ` Gregory Haskins
  2007-10-11 22:00 ` [PATCH 7/7] RT: push-rt enhancements Gregory Haskins
  2007-10-12 10:29 ` [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Peter Zijlstra
  7 siblings, 0 replies; 13+ messages in thread
From: Gregory Haskins @ 2007-10-11 22:00 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra; +Cc: linux-rt-users, LKML, Gregory Haskins

Normalize the CPU priority system between the two search algorithms, and
modularlize the search function within push_rt_tasks.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   91 ++++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 61 insertions(+), 30 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0065551..e8942c5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -24,6 +24,7 @@
  *              by Peter Williams
  *  2007-05-06  Interactivity improvements to CFS by Mike Galbraith
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
+ *  2007-10-11  RT overload enhancements by Steven Rostedt, Greg Haskins
  */
 
 #include <linux/mm.h>
@@ -305,8 +306,10 @@ struct rq {
 #ifdef CONFIG_PREEMPT_RT
 	unsigned long rt_nr_running;
 	unsigned long rt_nr_uninterruptible;
+#ifndef CONFIG_CPU_PRIORITIES
 	int curr_prio;
 #endif
+#endif
 
 	unsigned long switch_timestamp;
 	unsigned long slice_avg;
@@ -501,9 +504,48 @@ static int get_cpupri(struct rq *rq)
 	return cpupri_get(rq->cpu);
 }
 
+static int find_lowest_cpu(int defcpu, int pri, struct task_struct *p)
+{
+	return cpupri_find_best(defcpu, pri, p);
+}
 #else
-#define set_cpupri(rq, task) do { } while (0)
-#define get_cpupri(rq)         CPUPRI_INVALID
+static void set_cpupri(struct rq *rq, struct task_struct *p)
+{
+	rq->curr_prio = calc_task_cpupri(rq, p);
+}
+
+static int get_cpupri(struct rq *rq)
+{
+	return rq->curr_prio;
+}
+
+static int find_lowest_cpu(int defcpu, int pri, struct task_struct *p)
+{
+	cpumask_t  mask;
+	int        cpu;
+	struct rq *lowest_rq = NULL;
+
+	cpus_and(mask, cpu_online_map, p->cpus_allowed);
+
+	/*
+	 * Scan each rq for the lowest prio.
+	 */
+	for_each_cpu_mask(cpu, mask) {
+		struct rq *rq = &per_cpu(runqueues, cpu);
+		
+		if (cpu == smp_processor_id())
+			continue;
+		
+		/* no locking for now */
+		if (rq->curr_prio < pri &&
+		    (!lowest_rq || rq->curr_prio < lowest_rq->curr_prio)) {
+			lowest_rq = rq;
+		}
+	}
+
+	return lowest_rq ? lowest_rq->cpu : defcpu;
+}
+
 #endif
 
 /*
@@ -1531,11 +1573,9 @@ static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 static int push_rt_task(struct rq *this_rq)
 {
 	struct task_struct *next_task;
-	struct rq *lowest_rq = NULL;
 	int tries;
-	int cpu;
-	int dst_cpu = -1;
 	int ret = 0;
+	int pri;
 
 	assert_spin_locked(&this_rq->lock);
 
@@ -1546,32 +1586,19 @@ static int push_rt_task(struct rq *this_rq)
 	/* We might release this_rq lock */
 	get_task_struct(next_task);
 
+	pri = calc_task_cpupri(this_rq, next_task);
+
 	/* Only try this algorithm three times */
 	for (tries = 0; tries < 3; tries++) {
-		cpumask_t mask;
-
-		cpus_and(mask, cpu_online_map, next_task->cpus_allowed);
-
-		/*
-		 * Scan each rq for the lowest prio.
-		 */
-		for_each_cpu_mask(cpu, mask) {
-			struct rq *rq = &per_cpu(runqueues, cpu);
-
-			if (cpu == smp_processor_id())
-				continue;
-
-			/* no locking for now */
-			if (rq->curr_prio > next_task->prio &&
-			    (!lowest_rq || rq->curr_prio < lowest_rq->curr_prio)) {
-				dst_cpu = cpu;
-				lowest_rq = rq;
-			}
-		}
+		struct rq *lowest_rq;
+		int        cpu;
 
-		if (!lowest_rq)
+		cpu = find_lowest_cpu(this_rq->cpu, pri, next_task); 
+		if (cpu == this_rq->cpu)
 			break;
 
+		lowest_rq = cpu_rq(cpu);
+
 		if (double_lock_balance(this_rq, lowest_rq)) {
 			/*
 			 * We had to unlock the run queue. In
@@ -1579,20 +1606,20 @@ static int push_rt_task(struct rq *this_rq)
 			 * migrated already or had its affinity changed.
 			 */
 			if (unlikely(task_rq(next_task) != this_rq ||
-				     !cpu_isset(dst_cpu, next_task->cpus_allowed))) {
+				     !cpu_isset(cpu, next_task->cpus_allowed))) {
 				spin_unlock(&lowest_rq->lock);
 				break;
 			}
 		}
 
 		/* if the prio of this runqueue changed, try again */
-		if (lowest_rq->curr_prio <= next_task->prio) {
+		if (get_cpupri(lowest_rq) > pri) {
 			spin_unlock(&lowest_rq->lock);
 			continue;
 		}
 
 		deactivate_task(this_rq, next_task, 0);
-		set_task_cpu(next_task, dst_cpu);
+		set_task_cpu(next_task, lowest_rq->cpu);
 		activate_task(lowest_rq, next_task, 0);
 
 		resched_task(lowest_rq->curr);
@@ -2333,7 +2360,7 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 * If we pushed an RT task off the runqueue,
 	 * then kick other CPUs, they might run it:
 	 */
-	rq->curr_prio = current->prio;
+
 	if (unlikely(rq->rt_nr_running > 1))
 		push_rt_task(rq);
 
@@ -7371,6 +7398,10 @@ void __init sched_init(void)
 		highest_cpu = i;
 		/* delimiter for bitsearch: */
 		__set_bit(MAX_RT_PRIO, array->bitmap);
+
+#if defined(CONFIG_PREEMPT_RT) && !defined(CONFIG_CPU_PRIORITIES)
+		rq->curr_prio = CPUPRI_INVALID;
+#endif
 	}
 
 	set_load_weight(&init_task);


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/7] RT: push-rt enhancements
  2007-10-11 21:59 [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Gregory Haskins
                   ` (5 preceding siblings ...)
  2007-10-11 22:00 ` [PATCH 6/7] RT: Convert Steve's rt-push infrastructure to cpupri Gregory Haskins
@ 2007-10-11 22:00 ` Gregory Haskins
  2007-10-12 10:29 ` [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Peter Zijlstra
  7 siblings, 0 replies; 13+ messages in thread
From: Gregory Haskins @ 2007-10-11 22:00 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra; +Cc: linux-rt-users, LKML, Gregory Haskins

There are three events that require consideration for redistributing RT tasks:

1) When one or more higher-priority tasks preempts a lower-one from a RQ
2) When a lower-priority task is woken up on a RQ
3) When a RQ downgrades its current priority

Steve Rostedt's push_rt patch addresses (1).  It hooks in right after a new
task has been switched-in.  If this was the result of an RT preemption, or if
more than one task was awoken at the same time, we can try to push some of
those other tasks away.

This patch addresses (2).  When we wake up a task, we check to see if it would
preempt the current task on the queue.  If it will not, we attempt to find a
better suited CPU (e.g. one running something lower priority than the task
being woken) and try to activate the task there.

Finally, we have (3).  In theory, we only need to balance_rt_tasks() if the
following conditions are met:
   1) One or more CPUs are in overload, AND
   2) We are about to switch to a task that lowers our priority.

(3) will be addressed in a later patch.

In addtion, this patch also enhances the behavior of the push_rt feature such
that it will try to drain as many tasks as it can until there are no more
tasks, or equilibrium is achieved.  The orignal logic only tried to push one
task per event.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   69 ++++++++++++++++++++++++++++++++++----------------------
 1 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e8942c5..8a94a07 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1636,6 +1636,12 @@ static int push_rt_task(struct rq *this_rq)
 	return ret;
 }
 
+/* Push all tasks that we can to other CPUs */
+static void push_rt_tasks(struct rq *this_rq)
+{
+	while (push_rt_task(this_rq));
+}
+
 /*
  * Pull RT tasks from other CPUs in the RT-overload
  * case. Interrupts are disabled, local rq is locked.
@@ -1996,7 +2002,31 @@ out_set_cpu:
 		this_cpu = smp_processor_id();
 		cpu = task_cpu(p);
 	}
-
+	
+	/*
+	 * If a newly woken up RT task cannot preempt the
+	 * current (RT) task (on a target runqueue) then try
+	 * to find another CPU it can preempt:
+	 */
+	if (rt_task(p) && !TASK_PREEMPTS_CURR(p, rq)) {
+		int pri = calc_task_cpupri(rq, p);
+
+		new_cpu = find_lowest_cpu(rq->cpu, pri, p);
+		if (new_cpu != cpu) {
+			set_task_cpu(p, new_cpu);
+			task_rq_unlock(rq, &flags);
+			/* might preempt at this point */
+			rq = task_rq_lock(p, &flags);
+			old_state = p->state;
+			if (!(old_state & state))
+				goto out;
+			if (p->se.on_rq)
+				goto out_running;
+			
+			this_cpu = smp_processor_id();
+			cpu = task_cpu(p);
+		}
+	}
 out_activate:
 #endif /* CONFIG_SMP */
 	update_rq_clock(rq);
@@ -2010,30 +2040,13 @@ out_activate:
 	 * to find another CPU it can preempt:
 	 */
 	if (rt_task(p) && !TASK_PREEMPTS_CURR(p, rq)) {
-		struct rq *this_rq = cpu_rq(this_cpu);
 		/*
-		 * Special-case: the task on this CPU can be
-		 * preempted. In that case there's no need to
-		 * trigger reschedules on other CPUs, we can
-		 * mark the current task for reschedule.
-		 *
-		 * (Note that it's safe to access this_rq without
-		 * extra locking in this particular case, because
-		 * we are on the current CPU.)
+		 * FIXME: Do we still need to do this here anymore, or
+		 * does the preemption-check above suffice.  The path
+		 * that makes my head hurt is when we have the
+		 * task_running->out_activate path
 		 */
-		if (TASK_PREEMPTS_CURR(p, this_rq))
-			set_tsk_need_resched(this_rq->curr);
-		else
-			/*
-			 * Neither the intended target runqueue
-			 * nor the current CPU can take this task.
-			 * Trigger a reschedule on all other CPUs
-			 * nevertheless, maybe one of them can take
-			 * this task:
-			 */
-			smp_send_reschedule_allbutself_cpumask(p->cpus_allowed);
-
-		schedstat_inc(this_rq, rto_wakeup);
+		push_rt_tasks(rq);
 	} else {
 		/*
 		 * Sync wakeups (i.e. those types of wakeups where the waker
@@ -2357,12 +2370,14 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 */
 #if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
 	/*
-	 * If we pushed an RT task off the runqueue,
-	 * then kick other CPUs, they might run it:
+	 * If there are more than one RT tasks pending, try to push some off
+	 * to other lower-pri CPUs
+	 *
+	 * This covers the case where a high-pri has preempted a lower-pri
+	 * or if multiple RTs are awoken at the same time.
 	 */
-
 	if (unlikely(rq->rt_nr_running > 1))
-		push_rt_task(rq);
+		push_rt_tasks(rq);
 
 #endif
 	prev_state = prev->state;


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements
  2007-10-11 21:59 [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Gregory Haskins
                   ` (6 preceding siblings ...)
  2007-10-11 22:00 ` [PATCH 7/7] RT: push-rt enhancements Gregory Haskins
@ 2007-10-12 10:29 ` Peter Zijlstra
  2007-10-12 11:47   ` Steven Rostedt
  2007-10-12 14:05   ` Gregory Haskins
  7 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2007-10-12 10:29 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Steven Rostedt, linux-rt-users, LKML

Hi Gregory,

On Thu, 2007-10-11 at 17:59 -0400, Gregory Haskins wrote:
> The current series applies to 23-rt1-pre1.
> 
> This is a snapshot of the current work-in-progress for the rt-overload
> enhancements.  The primary motivation for the series to to improve the
> algorithm for distributing RT tasks to keep the highest tasks active.  The
> current system tends to blast IPIs "shotgun" style, and we aim to reduce that
> overhead where possible.  We mitigate this behavior by trying to place tasks
> on the ideal runqueue before an overload even occurs.
> 
> Note that this series is *not* currently stable.  There is at
> least one bug resulting in a hard-lock.  And the hard-lock could be masking
> other yet-to-be-discovered issues.
> 
> My primary motivation for sending it out right now is to share the latest
> series with Peter Zijlstra and Steven Rostedt.  However, in the interest of
> keeping the development open we are sending to a wider distribution.
> Comments/suggestions from anyone are, of course, welcome.  But please note
> this is not quite ready for prime-time in any capacity. 
> 
> The series includes patches from both Steven and myself, with serious
> input/guidance/discussion from Peter.

I'm wondering why we need the cpu prio management stuff. I'm thinking we
might just use any cpus_allowed cpu that has a lesser priority than the
task we're trying to migrate.

And for that, steve's rq->curr_prio field seems quite suitable.

so instead of the:
  for (3 tries)
    find lowest cpu
    try push

we do:

  cpu_hotplug_lock();
  cpus_and(mask, p->cpus_allowed, online_cpus);
  for_each_cpu_mask(i, mask) {
    if (cpu_rq(i)->curr_prio > p->prio && push_task(p, i))
      break;
  }
  cpu_hotplug_unlock();




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements
  2007-10-12 10:29 ` [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Peter Zijlstra
@ 2007-10-12 11:47   ` Steven Rostedt
  2007-10-12 13:20     ` Gregory Haskins
  2007-10-12 14:05   ` Gregory Haskins
  1 sibling, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2007-10-12 11:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Gregory Haskins, linux-rt-users, LKML


--

On Fri, 12 Oct 2007, Peter Zijlstra wrote:

>
> And for that, steve's rq->curr_prio field seems quite suitable.
>
> so instead of the:
>   for (3 tries)
>     find lowest cpu
>     try push
>
> we do:
>
>   cpu_hotplug_lock();
>   cpus_and(mask, p->cpus_allowed, online_cpus);
>   for_each_cpu_mask(i, mask) {
>     if (cpu_rq(i)->curr_prio > p->prio && push_task(p, i))
>       break;
>   }
>   cpu_hotplug_unlock();

The thing I'm worried about is that we pushed off a rt task that is higher
in prio than another rt task on another cpu, and we'll cause a bunch of
rt task bouncing.  That is what I'm trying to avoid.

BTW, my logging has showed that I have yet to hit a 2cd try (but I admit,
this is a very limited test set).

-- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/7] RT: Only consider online CPUs when pushing rt tasks
  2007-10-11 21:59 ` [PATCH 3/7] RT: Only consider online CPUs when pushing rt tasks Gregory Haskins
@ 2007-10-12 12:10   ` Steven Rostedt
  0 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2007-10-12 12:10 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Peter Zijlstra, linux-rt-users, LKML


--

On Thu, 11 Oct 2007, Gregory Haskins wrote:

> task->cpus_allowed can have bit positions that are set for CPUs that are
> not currently online.  So we optimze our search by ANDing against the online
> set.
>
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
>
>  kernel/sched.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 24b38b4..0ca3905 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1510,10 +1510,14 @@ static int push_rt_task(struct rq *this_rq)
>
>  	/* Only try this algorithm three times */
>  	for (tries = 0; tries < 3; tries++) {
> +		cpumask_t mask;
> +
> +		cpus_and(mask, cpu_online_map, next_task->cpus_allowed);
> +

Gag! I definitely sent you an old patch. I fixed this a while ago :-(

-- Steve

>  		/*
>  		 * Scan each rq for the lowest prio.
>  		 */
> -		for_each_cpu_mask(cpu, next_task->cpus_allowed) {
> +		for_each_cpu_mask(cpu, mask) {
>  			struct rq *rq = &per_cpu(runqueues, cpu);
>
>  			if (cpu == smp_processor_id())
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements
  2007-10-12 11:47   ` Steven Rostedt
@ 2007-10-12 13:20     ` Gregory Haskins
  0 siblings, 0 replies; 13+ messages in thread
From: Gregory Haskins @ 2007-10-12 13:20 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Peter Zijlstra, linux-rt-users, LKML

On Fri, 2007-10-12 at 07:47 -0400, Steven Rostedt wrote:
> --
> 
> On Fri, 12 Oct 2007, Peter Zijlstra wrote:
> 
> >
> > And for that, steve's rq->curr_prio field seems quite suitable.
> >
> > so instead of the:
> >   for (3 tries)
> >     find lowest cpu
> >     try push
> >
> > we do:
> >
> >   cpu_hotplug_lock();
> >   cpus_and(mask, p->cpus_allowed, online_cpus);
> >   for_each_cpu_mask(i, mask) {
> >     if (cpu_rq(i)->curr_prio > p->prio && push_task(p, i))
> >       break;
> >   }
> >   cpu_hotplug_unlock();
> 

IMO we should try to logically separate the "search" and "push"
functionality (for instance, see my series).  The search is useful
outside the "push" operation for cases where we are waking a lower-task
that doesn't preempt on the current RQ.  In that case, we can try to
wake it directly on the optimal queue instead of waking locally and then
pushing away.

> The thing I'm worried about is that we pushed off a rt task that is higher
> in prio than another rt task on another cpu, and we'll cause a bunch of
> rt task bouncing.  That is what I'm trying to avoid.

I agree, though I think we will achieve that once the final 3rd part of
the algorithm is implemented.  Stay tuned for an update in that area.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements
  2007-10-12 10:29 ` [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Peter Zijlstra
  2007-10-12 11:47   ` Steven Rostedt
@ 2007-10-12 14:05   ` Gregory Haskins
  1 sibling, 0 replies; 13+ messages in thread
From: Gregory Haskins @ 2007-10-12 14:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Steven Rostedt, linux-rt-users, LKML

On Fri, 2007-10-12 at 12:29 +0200, Peter Zijlstra wrote:

> I'm wondering why we need the cpu prio management stuff.

I know we covered most of this on IRC, but let me recap so everyone can
follow the thread:

1) The cpupri alg is just one search alg vs the other.  I think we are
all in agreement that it is not clear if one has an advantage over the
other without some empirical data.  So that aspect still remains to be
investigated.  For now, either can be interchanged as the main
functionality is what uses the search, not the search alg itself.

2) The patch series makes some innovations above the current state of
the push-rt patch, which I will try to summarize here for consideration:

	A) CPU Priority should be updated due to PI changes as well as
ctx-switch

	B) The search algorithm in the cpupri alg employs a priority amongst
eligible CPUs: last-run (for cache-affinity), this_cpu (for
lower-overhead preemption), and finally any other cpu in the
cpus_allowed.  It would be ideal to see the other alg provide similar
priority.

	C) The search and pusher functions are separated.  Search is useful in
circumstances outside the push_rt_task functions (see (D))

	D) The primary patch addresses one case where we need to redistribute
(high-pri preemption).  There are 3 in total.  The series adds support
for a second case (low-pri RT wakeup).

	E) We push until equilibrium instead of just a single task.

3) Having a distinct cpu-priority layer (regardless of search-arg) will
have (IMO) interesting potential going forward.  For instance, we could
have an optional notifier that gets kicked whenever we change
priority-class.  This would allow for some interesting RT related
enhancements by allowing system-level components to register for
priority changes.  For example: APIC TPR, or KVM hypercalls (for RT
guests, on an RT host).  This is more theoretical and half-baked at this
point, but it was something I have been kicking around.

That's all I can think of for now.

Regards,
-Greg

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-10-12 14:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-11 21:59 [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Gregory Haskins
2007-10-11 21:59 ` [PATCH 1/7] RT: Push waiting rt tasks Gregory Haskins
2007-10-11 21:59 ` [PATCH 2/7] RT: Peter Zijlstra's suggested improvements to rt-push patch Gregory Haskins
2007-10-11 21:59 ` [PATCH 3/7] RT: Only consider online CPUs when pushing rt tasks Gregory Haskins
2007-10-12 12:10   ` Steven Rostedt
2007-10-11 22:00 ` [PATCH 4/7] RT: Add a per-cpu rt_overload indication Gregory Haskins
2007-10-11 22:00 ` [PATCH 5/7] RT: CPU priority management Gregory Haskins
2007-10-11 22:00 ` [PATCH 6/7] RT: Convert Steve's rt-push infrastructure to cpupri Gregory Haskins
2007-10-11 22:00 ` [PATCH 7/7] RT: push-rt enhancements Gregory Haskins
2007-10-12 10:29 ` [PATCH 0/7] RT: (RFC) RT-Overload/Sched enhancements Peter Zijlstra
2007-10-12 11:47   ` Steven Rostedt
2007-10-12 13:20     ` Gregory Haskins
2007-10-12 14:05   ` Gregory Haskins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox