[PATCH 0/2] sched: refreshes

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] sched: refreshes
@ 2008-04-21 18:10 Gregory Haskins
  2008-04-21 18:10 ` [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added Gregory Haskins
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-04-21 18:10 UTC (permalink / raw)
  To: mingo
  Cc: suresh.b.siddha, rostedt, chinang.ma, arjan, willy, ghaskins,
	linux-kernel, linux-rt-users

Hi Ingo,
  Here is a refresh of the two patches you asked for:

#1 is "v3" of the "pushed" patch which fixes the compilation issue on
 !CONFIG_SMP

#2 is a resend of the 2-d bitmap, which includes a fix for that typo I
 mentioned.

Regards,
-Greg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-21 18:10 [PATCH 0/2] sched: refreshes Gregory Haskins
@ 2008-04-21 18:10 ` Gregory Haskins
  2008-04-22 15:30   ` Dmitry Adamushko
  2008-04-21 18:10 ` [PATCH 2/2] sched: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
  2008-04-21 19:33 ` [PATCH 0/2] sched: refreshes Ingo Molnar
  2 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-04-21 18:10 UTC (permalink / raw)
  To: mingo
  Cc: suresh.b.siddha, rostedt, chinang.ma, arjan, willy, ghaskins,
	linux-kernel, linux-rt-users

SCHED_RR can context-switch many times without having changed the run-queue.
Therefore trying to push on each context switch can just be wasted effort
since if it failed the first time, it will likely fail any subsequent times
as well.  Instead, set a flag when we have successfully pushed as many tasks
away as possible, and only clear it when the runqueue adds new tasks
(effectively becoming a run-queue "dirty bit").  If new tasks are added we
should try again.  If any remote run-queues downgrade their priority in the
meantime, they will try to pull from us (as they always do).

This attempts to address a performance regression reported by Suresh Siddha,
et. al. in the 2.6.25 series.  It applies to 2.6.25.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: suresh.b.siddha@intel.com
CC: mingo@elte.hu
CC: rostedt@goodmis.org
CC: chinang.ma@intel.com
CC: arjan@linux.intel.com
CC: willy@linux.intel.com
---

 kernel/sched.c    |    2 ++
 kernel/sched_rt.c |   24 ++++++++++++++++++++++--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0067fc8..d859649 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -462,6 +462,7 @@ struct rt_rq {
 #ifdef CONFIG_SMP
 	unsigned long rt_nr_migratory;
 	int overloaded;
+	int pushed;
 #endif
 	int rt_throttled;
 	u64 rt_time;
@@ -8073,6 +8074,7 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 #ifdef CONFIG_SMP
 	rt_rq->rt_nr_migratory = 0;
 	rt_rq->overloaded = 0;
+	rt_rq->pushed = 0;
 #endif
 
 	rt_rq->rt_time = 0;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index c2730a5..3cc98be 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -515,6 +515,15 @@ static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
 		enqueue_rt_entity(rt_se);
 
 	inc_cpu_load(rq, p->se.load.weight);
+
+#ifdef CONFIG_SMP
+	/*
+	 * Clear the "pushed" state since we have changed the run-queue and
+	 * may need to migrate some of these tasks (see push_rt_tasks() for
+	 * details)
+	 */
+	rq->rt.pushed = 0;
+#endif
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
@@ -913,7 +922,7 @@ static int push_rt_task(struct rq *rq)
 	int ret = 0;
 	int paranoid = RT_MAX_TRIES;
 
-	if (!rq->rt.overloaded)
+	if (!rq->rt.overloaded || rq->rt.pushed)
 		return 0;
 
 	next_task = pick_next_highest_task_rt(rq, -1);
@@ -973,6 +982,15 @@ out:
 }
 
 /*
+ * push_rt_tasks()
+ *
+ * Push as many tasks away as possible.  We only need to try once whenever
+ * one or more new tasks are added to our runqueue.  After an inital attempt,
+ * further changes in remote runqueue state will be accounted for with pull
+ * operations.  Therefore, we mark the run-queue as "pushed" here, and clear it
+ * during enqueue.  This primarily helps SCHED_RR tasks which will tend to
+ * have a higher context-switch to enqueue ratio.
+ *
  * TODO: Currently we just use the second highest prio task on
  *       the queue, and stop when it can't migrate (or there's
  *       no more RT tasks).  There may be a case where a lower
@@ -987,6 +1005,8 @@ static void push_rt_tasks(struct rq *rq)
 	/* push_rt_task will return true if it moved an RT */
 	while (push_rt_task(rq))
 		;
+
+	rq->rt.pushed = 1;
 }
 
 static int pull_rt_task(struct rq *this_rq)
@@ -1091,7 +1111,7 @@ static void post_schedule_rt(struct rq *rq)
 	 * the lock was owned by prev, we need to release it
 	 * first via finish_lock_switch and then reaquire it here.
 	 */
-	if (unlikely(rq->rt.overloaded)) {
+	if (unlikely(rq->rt.overloaded && !rq->rt.pushed)) {
 		spin_lock_irq(&rq->lock);
 		push_rt_tasks(rq);
 		spin_unlock_irq(&rq->lock);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] sched: Use a 2-d bitmap for searching lowest-pri CPU
  2008-04-21 18:10 [PATCH 0/2] sched: refreshes Gregory Haskins
  2008-04-21 18:10 ` [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added Gregory Haskins
@ 2008-04-21 18:10 ` Gregory Haskins
  2008-04-21 19:33 ` [PATCH 0/2] sched: refreshes Ingo Molnar
  2 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-04-21 18:10 UTC (permalink / raw)
  To: mingo
  Cc: suresh.b.siddha, rostedt, chinang.ma, arjan, willy, ghaskins,
	linux-kernel, linux-rt-users

The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/Makefile       |    1 
 kernel/sched.c        |    7 ++
 kernel/sched_cpupri.c |  174 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_cpupri.h |   36 ++++++++++
 kernel/sched_rt.c     |   98 ++++++----------------------
 5 files changed, 239 insertions(+), 77 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index 7071d44..8c55d9e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_MARKERS) += marker.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_FTRACE) += trace/
 obj-$(CONFIG_TRACING) += trace/
+obj-$(CONFIG_SMP) += sched_cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index d859649..5899c29 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -75,6 +75,8 @@
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
 
+#include "sched_cpupri.h"
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -501,6 +503,9 @@ struct root_domain {
 	 */
 	cpumask_t rto_mask;
 	atomic_t rto_count;
+#ifdef CONFIG_SMP
+	struct cpupri cpupri;
+#endif
 };
 
 /*
@@ -6923,6 +6928,8 @@ static void init_rootdomain(struct root_domain *rd)
 
 	cpus_clear(rd->span);
 	cpus_clear(rd->online);
+
+	cpupri_init(&rd->cpupri);
 }
 
 static void init_defrootdomain(void)
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 0000000..52154fe
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,174 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007-2008 Novell
+ *
+ *  Author: Gregory Haskins <ghaskins@novell.com>
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ *                 (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, nr_domcpus)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include "sched_cpupri.h"
+
+/* Convert between a 140 based task->prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+	int cpupri;
+
+	if (prio == CPUPRI_INVALID)
+		cpupri = CPUPRI_INVALID;
+	else if (prio == MAX_PRIO)
+		cpupri = CPUPRI_IDLE;
+	else if (prio >= MAX_RT_PRIO)
+		cpupri = CPUPRI_NORMAL;
+	else
+		cpupri = MAX_RT_PRIO - prio + 1;
+
+	return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)                    \
+  for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES);     \
+       idx < CPUPRI_NR_PRIORITIES;                            \
+       idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @cp: The cpupri context
+ * @p: The task
+ * @lowest_mask: A mask to fill in with selected CPUs
+ *
+ * Note: This function returns the recommended CPUs as calculated during the
+ * current invokation.  By the time the call returns, the CPUs may have in
+ * fact changed priorities any number of times.  While not ideal, it is not
+ * an issue of correctness since the normal rebalancer logic will correct
+ * any discrepancies created by racing against the uncertainty of the current
+ * priority configuration.
+ *
+ * Returns: (int)bool - CPUs were found
+ */
+int cpupri_find(struct cpupri *cp, struct task_struct *p,
+		cpumask_t *lowest_mask)
+{
+	int                  idx      = 0;
+	int                  task_pri = convert_prio(p->prio);
+
+	for_each_cpupri_active(cp->pri_active, idx) {
+		struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
+		cpumask_t mask;
+
+		if (idx >= task_pri)
+			break;
+
+		cpus_and(mask, p->cpus_allowed, vec->mask);
+
+		if (cpus_empty(mask))
+			continue;
+
+		*lowest_mask = mask;
+		return 1;
+	}
+
+	return 0;
+}
+
+/**
+ * cpupri_set - update the cpu priority setting
+ * @cp: The cpupri context
+ * @cpu: The target cpu
+ * @pri: The priority (INVALID-RT99) to assign to this CPU
+ *
+ * Note: Assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpupri_set(struct cpupri *cp, int cpu, int newpri)
+{
+	int                 *currpri = &cp->cpu_to_pri[cpu];
+	int                  oldpri  = *currpri;
+	unsigned long        flags;
+
+	newpri = convert_prio(newpri);
+
+	BUG_ON(newpri >= CPUPRI_NR_PRIORITIES);
+
+	if (newpri == oldpri)
+		return;
+
+	/*
+	 * If the cpu was currently mapped to a different value, we
+	 * first need to unmap the old value
+	 */
+	if (likely(oldpri != CPUPRI_INVALID)) {
+		struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];
+
+		spin_lock_irqsave(&vec->lock, flags);
+
+		vec->count--;
+		if (!vec->count)
+			clear_bit(oldpri, cp->pri_active);
+		cpu_clear(cpu, vec->mask);
+
+		spin_unlock_irqrestore(&vec->lock, flags);
+	}
+
+	if (likely(newpri != CPUPRI_INVALID)) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
+
+		spin_lock_irqsave(&vec->lock, flags);
+
+		cpu_set(cpu, vec->mask);
+		vec->count++;
+		if (vec->count == 1)
+			set_bit(newpri, cp->pri_active);
+
+		spin_unlock_irqrestore(&vec->lock, flags);
+	}
+
+	*currpri = newpri;
+}
+
+/**
+ * cpupri_init - initialize the cpupri structure
+ * @cp: The cpupri context
+ *
+ * Returns: (void)
+ */
+void cpupri_init(struct cpupri *cp)
+{
+	int i;
+
+	memset(cp, 0, sizeof(*cp));
+
+	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+		spin_lock_init(&vec->lock);
+		vec->count = 0;
+		cpus_clear(vec->mask);
+	}
+
+	for_each_possible_cpu(i)
+		cp->cpu_to_pri[i] = CPUPRI_INVALID;
+}
+
+
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
new file mode 100644
index 0000000..0b6a3d1
--- /dev/null
+++ b/kernel/sched_cpupri.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CPUPRI_H
+#define _LINUX_CPUPRI_H
+
+#include <linux/sched.h>
+
+#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO
+#define CPUPRI_NR_PRI_WORDS CPUPRI_NR_PRIORITIES/BITS_PER_LONG
+
+#define CPUPRI_INVALID -1
+#define CPUPRI_IDLE     0
+#define CPUPRI_NORMAL   1
+/* values 2-101 are RT priorities 0-99 */
+
+struct cpupri_vec {
+	spinlock_t lock;
+	int        count;
+	cpumask_t  mask;
+};
+
+struct cpupri {
+	struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
+	long              pri_active[CPUPRI_NR_PRI_WORDS];
+	int               cpu_to_pri[NR_CPUS];
+};
+
+#ifdef CONFIG_SMP
+int  cpupri_find(struct cpupri *cp,
+		 struct task_struct *p, cpumask_t *lowest_mask);
+void cpupri_set(struct cpupri *cp, int cpu, int pri);
+void cpupri_init(struct cpupri *cp);
+#else
+#define cpupri_set(cp, cpu, pri) do { } while (0)
+#define cpupri_init() do { } while (0)
+#endif
+
+#endif /* _LINUX_CPUPRI_H */
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 3cc98be..79e5ad5 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -391,8 +391,11 @@ void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 	WARN_ON(!rt_prio(rt_se_prio(rt_se)));
 	rt_rq->rt_nr_running++;
 #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
-	if (rt_se_prio(rt_se) < rt_rq->highest_prio)
+	if (rt_se_prio(rt_se) < rt_rq->highest_prio) {
+		struct rq *rq = rq_of_rt_rq(rt_rq);
 		rt_rq->highest_prio = rt_se_prio(rt_se);
+		cpupri_set(&rq->rd->cpupri, rq->cpu, rt_se_prio(rt_se));
+	}
 #endif
 #ifdef CONFIG_SMP
 	if (rt_se->nr_cpus_allowed > 1) {
@@ -416,6 +419,10 @@ void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 static inline
 void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
+#ifdef CONFIG_SMP
+	int highest_prio = rt_rq->highest_prio;
+#endif
+
 	WARN_ON(!rt_prio(rt_se_prio(rt_se)));
 	WARN_ON(!rt_rq->rt_nr_running);
 	rt_rq->rt_nr_running--;
@@ -439,6 +446,11 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 		rq->rt.rt_nr_migratory--;
 	}
 
+	if (rt_rq->highest_prio != highest_prio) {
+		struct rq *rq = rq_of_rt_rq(rt_rq);
+		cpupri_set(&rq->rd->cpupri, rq->cpu, rt_rq->highest_prio);
+	}
+
 	update_rt_migration(rq_of_rt_rq(rt_rq));
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -723,73 +735,6 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu)
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 
-static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
-{
-	int       lowest_prio = -1;
-	int       lowest_cpu  = -1;
-	int       count       = 0;
-	int       cpu;
-
-	cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
-
-	/*
-	 * Scan each rq for the lowest prio.
-	 */
-	for_each_cpu_mask(cpu, *lowest_mask) {
-		struct rq *rq = cpu_rq(cpu);
-
-		/* We look for lowest RT prio or non-rt CPU */
-		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			/*
-			 * if we already found a low RT queue
-			 * and now we found this non-rt queue
-			 * clear the mask and set our bit.
-			 * Otherwise just return the queue as is
-			 * and the count==1 will cause the algorithm
-			 * to use the first bit found.
-			 */
-			if (lowest_cpu != -1) {
-				cpus_clear(*lowest_mask);
-				cpu_set(rq->cpu, *lowest_mask);
-			}
-			return 1;
-		}
-
-		/* no locking for now */
-		if ((rq->rt.highest_prio > task->prio)
-		    && (rq->rt.highest_prio >= lowest_prio)) {
-			if (rq->rt.highest_prio > lowest_prio) {
-				/* new low - clear old data */
-				lowest_prio = rq->rt.highest_prio;
-				lowest_cpu = cpu;
-				count = 0;
-			}
-			count++;
-		} else
-			cpu_clear(cpu, *lowest_mask);
-	}
-
-	/*
-	 * Clear out all the set bits that represent
-	 * runqueues that were of higher prio than
-	 * the lowest_prio.
-	 */
-	if (lowest_cpu > 0) {
-		/*
-		 * Perhaps we could add another cpumask op to
-		 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);
-		 * Then that could be optimized to use memset and such.
-		 */
-		for_each_cpu_mask(cpu, *lowest_mask) {
-			if (cpu >= lowest_cpu)
-				break;
-			cpu_clear(cpu, *lowest_mask);
-		}
-	}
-
-	return count;
-}
-
 static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
 {
 	int first;
@@ -811,17 +756,12 @@ static int find_lowest_rq(struct task_struct *task)
 	cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask);
 	int this_cpu = smp_processor_id();
 	int cpu      = task_cpu(task);
-	int count    = find_lowest_cpus(task, lowest_mask);
 
-	if (!count)
-		return -1; /* No targets found */
+	if (task->rt.nr_cpus_allowed == 1)
+		return -1; /* No other targets possible */
 
-	/*
-	 * There is no sense in performing an optimal search if only one
-	 * target is found.
-	 */
-	if (count == 1)
-		return first_cpu(*lowest_mask);
+	if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
+		return -1; /* No targets found */
 
 	/*
 	 * At this point we have built a mask of cpus representing the
@@ -1178,6 +1118,8 @@ static void join_domain_rt(struct rq *rq)
 {
 	if (rq->rt.overloaded)
 		rt_set_overload(rq);
+
+	cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio);
 }
 
 /* Assumes rq->lock is held */
@@ -1185,6 +1127,8 @@ static void leave_domain_rt(struct rq *rq)
 {
 	if (rq->rt.overloaded)
 		rt_clear_overload(rq);
+
+	cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] sched: refreshes
  2008-04-21 18:10 [PATCH 0/2] sched: refreshes Gregory Haskins
  2008-04-21 18:10 ` [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added Gregory Haskins
  2008-04-21 18:10 ` [PATCH 2/2] sched: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
@ 2008-04-21 19:33 ` Ingo Molnar
  2 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2008-04-21 19:33 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: suresh.b.siddha, rostedt, chinang.ma, arjan, willy, linux-kernel,
	linux-rt-users


* Gregory Haskins <ghaskins@novell.com> wrote:

> Hi Ingo,
>   Here is a refresh of the two patches you asked for:
> 
> #1 is "v3" of the "pushed" patch which fixes the compilation issue on
>  !CONFIG_SMP
> 
> #2 is a resend of the 2-d bitmap, which includes a fix for that typo I
>  mentioned.

thanks Gregory, applied. We'll see whether they'll make v2.6.26. (but 
given that they existed for a longer time i'd lean towards including 
them - barring some unexpected test failure.)

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-21 18:10 ` [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added Gregory Haskins
@ 2008-04-22 15:30   ` Dmitry Adamushko
  2008-04-22 15:51     ` Steven Rostedt
  2008-04-22 16:38     ` Gregory Haskins
  0 siblings, 2 replies; 24+ messages in thread
From: Dmitry Adamushko @ 2008-04-22 15:30 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: mingo, suresh.b.siddha, rostedt, chinang.ma, arjan, willy,
	linux-kernel, linux-rt-users

Hi Gregory,

consider the following 2-cpu system: cpu0 and cpu1.

cpu0: is idle --> in such a state, it never pulls RT tasks on its own.

T0 and T1 are RT tasks

square#0:

cpu1:  T0 is running

T1 is of the same prio as T0 (shouldn't really matter but to get the
same result it would require altering the flow of events slightly)

T1's affinity allows it to be run only on cpu1.
T0 can run on both.

try_to_wake_up() is called for T1.
|
--> select_task_rq_rt() => gives cpu1
|
--> task_wake_up_rt()
   |
   ---> push_rt_tasks() -> rq->rt.pushed = 1

now, neither T1 (due to its affinity), nor T0 (it's running) can be
pushed away to cpu0.

[ btw., (1) I'd expect that this task_wake_up_rt() thing should be
redundant, logically-wise... I'll check once more and comment later
on.
(2) any example when (p->prio >= rq->rt.highest_prio) is not true in
task_wake_up_rt() ?
]

as a result, rq->rt.pushed == 1.

Now, post_schedule_rt() won't call push_rt_tasks().

T0 and T1 are both running for some time on cpu1 (possibly
context-switching if they are both of SCHED_RR type).

Then they both block, _first_ T1 and then T0.

After some interval of time, they wake up (let's say they are
periodic) in the following order: _first_ T0 and then T1.

rq->rt.pushed becomes 0 and here we are back to square#0. The whole
story repeats again.

cpu0 is idle so it won't pull T0. Both T0 and T1 are competing for the
same cpu. Not good.

am I missing smth?

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-22 15:30   ` Dmitry Adamushko
@ 2008-04-22 15:51     ` Steven Rostedt
  2008-04-23  8:05       ` Dmitry Adamushko
  2008-04-22 16:38     ` Gregory Haskins
  1 sibling, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2008-04-22 15:51 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: Gregory Haskins, mingo, suresh.b.siddha, chinang.ma, arjan, willy,
	linux-kernel, linux-rt-users


Hi Dmitry,

Did you see this behaviour or is this an analysis of the code?

On Tue, 22 Apr 2008, Dmitry Adamushko wrote:
>
> consider the following 2-cpu system: cpu0 and cpu1.
>
> cpu0: is idle --> in such a state, it never pulls RT tasks on its own.
>
> T0 and T1 are RT tasks

This is the important part. How did cpu0 get to "is idle --> in such a
state, it never pulls RT tasks on its own" ? And then have cpu1 have T0
and T1 running.

When ever a cpu goes to idle, it should definitely try to do a RT pull
before going idle. When a


>
>
> square#0:
>
> cpu1:  T0 is running
>
> T1 is of the same prio as T0 (shouldn't really matter but to get the
> same result it would require altering the flow of events slightly)
>
> T1's affinity allows it to be run only on cpu1.
> T0 can run on both.
>
> try_to_wake_up() is called for T1.
> |
> --> select_task_rq_rt() => gives cpu1
> |
> --> task_wake_up_rt()
>    |
>    ---> push_rt_tasks() -> rq->rt.pushed = 1
>
> now, neither T1 (due to its affinity), nor T0 (it's running) can be
> pushed away to cpu0.

Ah, this may be what you are talking about. T0 was running, but because
T1 has its affinity set to cpu1 it wont cause a push. When T0 schedules
away to give T1 its cpu time, T0 wont push away because of the pushed
flag.

Hmm, interesting. Of course my response is "Don't use SCHED_RR! It's
evil!" ;-)

Perhaps we need logic to reset the push flag when a sched RR task is
scheduled out to give way for another SCHED_RR task. This should fix this
scenario. Unless I'm missing something else.

-- Steve

> [ btw., (1) I'd expect that this task_wake_up_rt() thing should be
> redundant, logically-wise... I'll check once more and comment later
> on.
> (2) any example when (p->prio >= rq->rt.highest_prio) is not true in
> task_wake_up_rt() ?
> ]
>
> as a result, rq->rt.pushed == 1.
>
> Now, post_schedule_rt() won't call push_rt_tasks().
>
> T0 and T1 are both running for some time on cpu1 (possibly
> context-switching if they are both of SCHED_RR type).
>
> Then they both block, _first_ T1 and then T0.
>
> After some interval of time, they wake up (let's say they are
> periodic) in the following order: _first_ T0 and then T1.
>
> rq->rt.pushed becomes 0 and here we are back to square#0. The whole
> story repeats again.
>
> cpu0 is idle so it won't pull T0. Both T0 and T1 are competing for the
> same cpu. Not good.
>
> am I missing smth?
>
>
> --
> Best regards,
> Dmitry Adamushko
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-22 15:30   ` Dmitry Adamushko
  2008-04-22 15:51     ` Steven Rostedt
@ 2008-04-22 16:38     ` Gregory Haskins
  2008-04-23 11:13       ` [RFC PATCH 0/2] sched fixes for suboptimal balancing Gregory Haskins
  1 sibling, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-04-22 16:38 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: mingo, rostedt, chinang.ma, suresh.b.siddha, arjan, willy,
	linux-kernel, linux-rt-users

Hi Dmitry,

(Disclaimer: I am sick with a fever today, so hopefully I'm groking your email properly and not about to say something stupid ;)

>>> On Tue, Apr 22, 2008 at 11:30 AM, in message
<b647ffbd0804220830h6524e788n1467b027bc5bc4d2@mail.gmail.com>, "Dmitry
Adamushko" <dmitry.adamushko@gmail.com> wrote: 
> Hi Gregory,
> 
> 
> consider the following 2-cpu system: cpu0 and cpu1.
> 
> cpu0: is idle --> in such a state, it never pulls RT tasks on its own.
> 
> T0 and T1 are RT tasks
> 
> 
> square#0:
> 
> cpu1:  T0 is running
> 
> T1 is of the same prio as T0 (shouldn't really matter but to get the
> same result it would require altering the flow of events slightly)
> 
> T1's affinity allows it to be run only on cpu1.
> T0 can run on both.
> 
> try_to_wake_up() is called for T1.
> |
> --> select_task_rq_rt() => gives cpu1
> |
> --> task_wake_up_rt()
>    |
>    ---> push_rt_tasks() -> rq->rt.pushed = 1
> 
> now, neither T1 (due to its affinity), nor T0 (it's running) can be
> pushed away to cpu0.
> 
> [ btw., (1) I'd expect that this task_wake_up_rt() thing should be
> redundant, logically-wise... I'll check once more and comment later
> on.

They are both necessary, but the key is that the select_task_rq() is a best-effort route attempt, whereas the task_wake_up() routine is the authoritative router.  By doing the push after activation, it allowed us to utilize a very clever and significant optimization on the pull side that Steven came up with.  The details of the optimization escape me now, but I do remember it was substantial to the design.  Then later we put the select_task_rq() logic in (see git-id 318e0893) to further optimize the routing by finding a likely good home before the activation takes place (saving an activation/deactivation cycle), but it still needs the post-router to protect against race conditions since its just best-effort.

> (2) any example when (p->prio >= rq->rt.highest_prio) is not true in
> task_wake_up_rt() ?

Hmm...good catch.   Looks like it should be "p->prio >= rq->curr->prio" since we only need be concerned with pushing here if the task is not going to preempt current.  Do you agree Steven, or am I missing something? 

> ]
> 
> as a result, rq->rt.pushed == 1.
> 
> Now, post_schedule_rt() won't call push_rt_tasks().
> 
> T0 and T1 are both running for some time on cpu1 (possibly
> context-switching if they are both of SCHED_RR type).
> 
> Then they both block, _first_ T1 and then T0.
> 
> After some interval of time, they wake up (let's say they are
> periodic) in the following order: _first_ T0 and then T1.
> 
> rq->rt.pushed becomes 0 and here we are back to square#0. The whole
> story repeats again.
> 
> cpu0 is idle so it won't pull T0. Both T0 and T1 are competing for the
> same cpu. Not good.
> 
> am I missing smth?

No, I think you are indeed correct.  However, I would consider the root cause of the problem to have existed prior to the "pushed" flag, so perhaps we need to address this at a different level.  The case you present would have always been problematic for FIFO, and would have "worked" for RR eventually prior to the "pushed" patch.  But I dont know if I like relying on how it worked before to fix up the system.  At the very best, T1 would have experienced a latency equal to the remainder of T0's timeslice.

Rather, I think we need to address the preemptive behavior for the case where a migratory task is on the cpu and a non-migratory task tries to wake up.  If they are equal in numerical priority, perhaps we need to treat "non-migratory" as the tie breaker.  In this case, T1 would preempt T0 from cpu1, and then we would push T0 to cpu0.  I don't quite have all the details about how this would work thought through yet.  Perhaps I should wait until my fever lifts. ;)  Thoughts?

-Greg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-22 15:51     ` Steven Rostedt
@ 2008-04-23  8:05       ` Dmitry Adamushko
  2008-04-23  9:53         ` Dmitry Adamushko
  0 siblings, 1 reply; 24+ messages in thread
From: Dmitry Adamushko @ 2008-04-23  8:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, mingo, suresh.b.siddha, chinang.ma, arjan, willy,
	linux-kernel, linux-rt-users

Hi Steven,

> [ ... ]
>  > square#0:
>  >
>  > cpu1:  T0 is running
>  >
>  > T1 is of the same prio as T0 (shouldn't really matter but to get the
>  > same result it would require altering the flow of events slightly)
>  >
>  > T1's affinity allows it to be run only on cpu1.
>  > T0 can run on both.
>  >
>  > try_to_wake_up() is called for T1.
>  > |
>  > --> select_task_rq_rt() => gives cpu1
>  > |
>  > --> task_wake_up_rt()
>  >    |
>  >    ---> push_rt_tasks() -> rq->rt.pushed = 1
>  >
>  > now, neither T1 (due to its affinity), nor T0 (it's running) can be
>  > pushed away to cpu0.
>
>  Ah, this may be what you are talking about. T0 was running, but because
>  T1 has its affinity set to cpu1 it wont cause a push. When T0 schedules
>  away to give T1 its cpu time, T0 wont push away because of the pushed
>  flag.
>
>  Hmm, interesting. Of course my response is "Don't use SCHED_RR! It's
>  evil!" ;-)

It's not just SCHED_RR ;-) They both can be of SCHED_FIFO.

T1 _preempts_ T0 and again

 --> task_wake_up_rt()
 |
 ---> push_rt_tasks() -> rq->rt.pushed = 1

and T0 won't be pushed away to cpu0 by post_schedule_rt().

As Gregory has pointed out, at the very least it's a test in
task_wake_up_rt() which is wrong.

push_rt_tasks() should not be called when 'p' (a newly woken up task)
is the next one to run.

IOW, it should be (p->prio < rq->curr->prio) instead of (p->prio >=
rq->rt.highest_prio).


-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-23  8:05       ` Dmitry Adamushko
@ 2008-04-23  9:53         ` Dmitry Adamushko
  2008-04-23 10:03           ` Gregory Haskins
  0 siblings, 1 reply; 24+ messages in thread
From: Dmitry Adamushko @ 2008-04-23  9:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, mingo, suresh.b.siddha, chinang.ma, arjan, willy,
	linux-kernel, linux-rt-users

2008/4/23 Dmitry Adamushko <dmitry.adamushko@gmail.com>:
> Hi Steven,
>
>  > [ ... ]
>
> >  > square#0:
>  >  >
>  >  > cpu1:  T0 is running
>  >  >
>  >  > T1 is of the same prio as T0 (shouldn't really matter but to get the
>  >  > same result it would require altering the flow of events slightly)
>  >  >
>  >  > T1's affinity allows it to be run only on cpu1.
>  >  > T0 can run on both.
>  >  >
>  >  > try_to_wake_up() is called for T1.
>  >  > |
>  >  > --> select_task_rq_rt() => gives cpu1
>  >  > |
>  >  > --> task_wake_up_rt()
>  >  >    |
>  >  >    ---> push_rt_tasks() -> rq->rt.pushed = 1
>  >  >
>  >  > now, neither T1 (due to its affinity), nor T0 (it's running) can be
>  >  > pushed away to cpu0.
>  >
>  >  Ah, this may be what you are talking about. T0 was running, but because
>  >  T1 has its affinity set to cpu1 it wont cause a push. When T0 schedules
>  >  away to give T1 its cpu time, T0 wont push away because of the pushed
>  >  flag.
>  >
>  >  Hmm, interesting. Of course my response is "Don't use SCHED_RR! It's
>  >  evil!" ;-)
>
>  It's not just SCHED_RR ;-) They both can be of SCHED_FIFO.
>
>  T1 _preempts_ T0 and again
>
>
>   --> task_wake_up_rt()
>   |
>   ---> push_rt_tasks() -> rq->rt.pushed = 1
>
>  and T0 won't be pushed away to cpu0 by post_schedule_rt().
>
>  As Gregory has pointed out, at the very least it's a test in
>  task_wake_up_rt() which is wrong.
>
>  push_rt_tasks() should not be called when 'p' (a newly woken up task)
>  is the next one to run.
>
>  IOW, it should be (p->prio < rq->curr->prio) instead of (p->prio >=
>  rq->rt.highest_prio).

No, this argument is wrong indeed.

Something like this:
(white-spaces are broken)

--- sched_rt-prev.c     2008-04-23 11:26:39.000000000 +0200
+++ sched_rt.c  2008-04-23 11:36:20.000000000 +0200
@@ -1121,9 +1121,13 @@ static void post_schedule_rt(struct rq *

 static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
 {
-       if (!task_running(rq, p) &&
-           (p->prio >= rq->rt.highest_prio) &&
-           rq->rt.overloaded)
+       /*
+        * Consider pushing 'p' off to other CPUS only
+        * if it's not the next task to run on this CPU.
+        */
+       if (rq->rt.overloaded &&
+           p->prio > rq->rt.highest_prio &&
+           pick_rt_task(rq, p, -1))
                push_rt_tasks(rq);
 }


or even this (although, it's a bit heavier)

--- sched_rt-prev.c     2008-04-23 11:26:39.000000000 +0200
+++ sched_rt.c  2008-04-23 11:49:03.000000000 +0200
@@ -1118,12 +1118,22 @@ static void post_schedule_rt(struct rq *
        }
 }

 static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
 {
-       if (!task_running(rq, p) &&
-           (p->prio >= rq->rt.highest_prio) &&
-           rq->rt.overloaded)
+       if (!rq->rt.overloaded)
+               return;
+
+       /*
+        * Consider pushing 'p' off to other CPUS only
+        * if it's not the next task to run on this CPU.
+        * i.e. it's not a single task with the highest prio
+        * on the queue.
+        */
+       if (p->prio == rq->rt.highest_prio &&
+           p->rt.run_list.prev == p->rt.run_list.next)
+               return;
+
+       if (pick_rt_task(rq, p, -1))
                push_rt_tasks(rq);
 }


-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-23  9:53         ` Dmitry Adamushko
@ 2008-04-23 10:03           ` Gregory Haskins
  2008-04-23 10:23             ` Dmitry Adamushko
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-04-23 10:03 UTC (permalink / raw)
  To: Dmitry Adamushko, Steven Rostedt
  Cc: mingo, chinang.ma, suresh.b.siddha, arjan, willy, linux-kernel,
	linux-rt-users

>>> On Wed, Apr 23, 2008 at  5:53 AM, in message
<b647ffbd0804230253i32f48fcgb5dc7cf5b55607ac@mail.gmail.com>, "Dmitry
Adamushko" <dmitry.adamushko@gmail.com> wrote: 
> 2008/4/23 Dmitry Adamushko <dmitry.adamushko@gmail.com>:
>> Hi Steven,
>>
>>  > [ ... ]
>>
>> >  > square#0:
>>  >  >
>>  >  > cpu1:  T0 is running
>>  >  >
>>  >  > T1 is of the same prio as T0 (shouldn't really matter but to get the
>>  >  > same result it would require altering the flow of events slightly)
>>  >  >
>>  >  > T1's affinity allows it to be run only on cpu1.
>>  >  > T0 can run on both.
>>  >  >
>>  >  > try_to_wake_up() is called for T1.
>>  >  > |
>>  >  > --> select_task_rq_rt() => gives cpu1
>>  >  > |
>>  >  > --> task_wake_up_rt()
>>  >  >    |
>>  >  >    ---> push_rt_tasks() -> rq->rt.pushed = 1
>>  >  >
>>  >  > now, neither T1 (due to its affinity), nor T0 (it's running) can be
>>  >  > pushed away to cpu0.
>>  >
>>  >  Ah, this may be what you are talking about. T0 was running, but because
>>  >  T1 has its affinity set to cpu1 it wont cause a push. When T0 schedules
>>  >  away to give T1 its cpu time, T0 wont push away because of the pushed
>>  >  flag.
>>  >
>>  >  Hmm, interesting. Of course my response is "Don't use SCHED_RR! It's
>>  >  evil!" ;-)
>>
>>  It's not just SCHED_RR ;-) They both can be of SCHED_FIFO.
>>
>>  T1 _preempts_ T0 and again
>>
>>
>>   --> task_wake_up_rt()
>>   |
>>   ---> push_rt_tasks() -> rq->rt.pushed = 1
>>
>>  and T0 won't be pushed away to cpu0 by post_schedule_rt().
>>
>>  As Gregory has pointed out, at the very least it's a test in
>>  task_wake_up_rt() which is wrong.
>>
>>  push_rt_tasks() should not be called when 'p' (a newly woken up task)
>>  is the next one to run.
>>
>>  IOW, it should be (p->prio < rq->curr->prio) instead of (p->prio >=
>>  rq->rt.highest_prio).
> 
> No, this argument is wrong indeed.
> 
> Something like this:
> (white-spaces are broken)
> 
> --- sched_rt-prev.c     2008-04-23 11:26:39.000000000 +0200
> +++ sched_rt.c  2008-04-23 11:36:20.000000000 +0200
> @@ -1121,9 +1121,13 @@ static void post_schedule_rt(struct rq *
> 
>  static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
>  {
> -       if (!task_running(rq, p) &&
> -           (p->prio >= rq->rt.highest_prio) &&
> -           rq->rt.overloaded)
> +       /*
> +        * Consider pushing 'p' off to other CPUS only
> +        * if it's not the next task to run on this CPU.
> +        */
> +       if (rq->rt.overloaded &&
> +           p->prio > rq->rt.highest_prio &&
> +           pick_rt_task(rq, p, -1))
>                 push_rt_tasks(rq);
>  }
> 
> 
> or even this (although, it's a bit heavier)
> 
> --- sched_rt-prev.c     2008-04-23 11:26:39.000000000 +0200
> +++ sched_rt.c  2008-04-23 11:49:03.000000000 +0200
> @@ -1118,12 +1118,22 @@ static void post_schedule_rt(struct rq *
>         }
>  }
> 
>  static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
>  {
> -       if (!task_running(rq, p) &&
> -           (p->prio >= rq->rt.highest_prio) &&
> -           rq->rt.overloaded)
> +       if (!rq->rt.overloaded)
> +               return;
> +
> +       /*
> +        * Consider pushing 'p' off to other CPUS only
> +        * if it's not the next task to run on this CPU.
> +        * i.e. it's not a single task with the highest prio
> +        * on the queue.
> +        */
> +       if (p->prio == rq->rt.highest_prio &&
> +           p->rt.run_list.prev == p->rt.run_list.next)
> +               return;
> +
> +       if (pick_rt_task(rq, p, -1))
>                 push_rt_tasks(rq);
>  }
> 


I think we can simplify this further.  We really only need to push here if we are not going to reschedule anytime soon (probably white-space damaged):


--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1058,11 +1058,14 @@ static void post_schedule_rt(struct rq *rq)
        }
 }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-23 10:03           ` Gregory Haskins
@ 2008-04-23 10:23             ` Dmitry Adamushko
  2008-04-23 10:54               ` Gregory Haskins
  0 siblings, 1 reply; 24+ messages in thread
From: Dmitry Adamushko @ 2008-04-23 10:23 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Steven Rostedt, mingo, chinang.ma, suresh.b.siddha, arjan, willy,
	linux-kernel, linux-rt-users

2008/4/23 Gregory Haskins <ghaskins@novell.com>:
> [ ... ]
>
>  I think we can simplify this further.  We really only need to push here if we are not going to reschedule anytime soon (probably white-space damaged):
>
>
>  --- a/kernel/sched_rt.c
>
> +++ b/kernel/sched_rt.c
>  @@ -1058,11 +1058,14 @@ static void post_schedule_rt(struct rq *rq)
>         }
>   }
>
>  -
>  +/*
>  + * If we are not running and we are not going to reschedule soon, we should
>  + * try to push tasks away now
>  + */
>
>  static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
>   {
>         if (!task_running(rq, p) &&
>  -           (p->prio >= rq->rt.highest_prio) &&
>  +           !test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED) &&
>             rq->rt.overloaded)
>                 push_rt_tasks(rq);
>   }


It's somewhat suboptimal as it doesn't guarantee that 'p' gets control next.

e.g. 2 tasks (T0 and T1) have been woken up before an actual
re-schedule takes place. Even if T1 is of lower prio than T0,
task_wake_up_rt() will see the NEED_RESCHED flag and bail out while it
would make sense at this moment to push T1 off this cpu.


p.s. hope you are better today. get well! :-)


-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-23 10:23             ` Dmitry Adamushko
@ 2008-04-23 10:54               ` Gregory Haskins
  2008-04-23 11:20                 ` Dmitry Adamushko
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-04-23 10:54 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: mingo, Steven Rostedt, chinang.ma, suresh.b.siddha, arjan, willy,
	linux-kernel, linux-rt-users

>>> On Wed, Apr 23, 2008 at  6:23 AM, in message
<b647ffbd0804230323s20d25697qcf80ca9996e143a7@mail.gmail.com>, "Dmitry
Adamushko" <dmitry.adamushko@gmail.com> wrote: 
> 2008/4/23 Gregory Haskins <ghaskins@novell.com>:
>> [ ... ]
>>
>>  I think we can simplify this further.  We really only need to push here if 
> we are not going to reschedule anytime soon (probably white-space damaged):
>>
>>
>>  --- a/kernel/sched_rt.c
>>
>> +++ b/kernel/sched_rt.c
>>  @@ -1058,11 +1058,14 @@ static void post_schedule_rt(struct rq *rq)
>>         }
>>   }
>>
>>  -
>>  +/*
>>  + * If we are not running and we are not going to reschedule soon, we 
> should
>>  + * try to push tasks away now
>>  + */
>>
>>  static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
>>   {
>>         if (!task_running(rq, p) &&
>>  -           (p->prio >= rq->rt.highest_prio) &&
>>  +           !test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED) &&
>>             rq->rt.overloaded)
>>                 push_rt_tasks(rq);
>>   }
> 
> 
> It's somewhat suboptimal as it doesn't guarantee that 'p' gets control next.

No it doesn't, but I don't see that as a requirement to work properly.  All we really need is to make sure push_rt_tasks() will be executed in the near future, and a reschedule will do that.  Perhaps I am missing something?

> 
> e.g. 2 tasks (T0 and T1) have been woken up before an actual
> re-schedule takes place. Even if T1 is of lower prio than T0,
> task_wake_up_rt() will see the NEED_RESCHED flag and bail out while it
> would make sense at this moment to push T1 off this cpu.

I think this is "ok".  IMO, we really want to prioritize the push operation from post_schedule() over task_wake_up() because it is the most accurate (since the scheduling decision would be atomic to the rq->lock).  This way we avoid pushing the wrong task away before the schedule has a chance to run.  I think your alg probably approximates this same accuracy, albeit in a more expensive way.  The one thing I can say is an advantage to your approach is that it potentially would migrate the task faster (though it shouldn't be by very much, since the resched is technically pending).  This argument is further compounded by the fact that you would have to subtract the extra overhead of running pick_next_task() et. al. from the gain of avoiding the wait for the reschedule, of course.  This would 
 errode some of the lower-latency argument.

Hmmm...  Im not sure which I like better...simplicity is often nice IMO, but I think either could work.

> 
> 
> p.s. hope you are better today. get well! :-)

Thanks!  I am feeling much better today, though I don't think I am fully out of the woods yet.

> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 0/2] sched fixes for suboptimal balancing
  2008-04-22 16:38     ` Gregory Haskins
@ 2008-04-23 11:13       ` Gregory Haskins
  2008-04-23 11:13         ` [PATCH 1/2] sched: fix RT task-wakeup logic Gregory Haskins
                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-04-23 11:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gregory Haskins, Dmitry Adamushko, Steven Rostedt, linux-kernel,
	linux-rt-users

Hi Ingo,

Here are my proposed fixes for the issue Dmitry has highlighted in:

http://lkml.org/lkml/2008/4/22/296

This is an RFC, and it applies to sched-devel.

Regards,
-Greg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] sched: fix RT task-wakeup logic
  2008-04-23 11:13       ` [RFC PATCH 0/2] sched fixes for suboptimal balancing Gregory Haskins
@ 2008-04-23 11:13         ` Gregory Haskins
  2008-04-23 12:54           ` Steven Rostedt
                             ` (2 more replies)
  2008-04-23 11:13         ` [PATCH 2/2] sched: prioritize non-migratable tasks over migratable ones Gregory Haskins
  2008-04-28 18:55         ` [RFC PATCH 0/2] sched fixes for suboptimal balancing Ingo Molnar
  2 siblings, 3 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-04-23 11:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gregory Haskins, Dmitry Adamushko, Steven Rostedt, linux-kernel,
	linux-rt-users

Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we
will always evaluate to "true".  You can find the thread here:

http://lkml.org/lkml/2008/4/22/296

In reality, we only want to try to push tasks away when a wake up request is
not going to preempt the current task.  So lets fix it.

Note: We introduce test_tsk_need_resched() instead of open-coding the flag
check so that the merge-conflict with -rt should help remind us that we
may need to support NEEDS_RESCHED_DELAYED in the future, too.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Steven Rostedt <rostedt@goodmis.org>
---

 include/linux/sched.h |    7 ++++++-
 kernel/sched_rt.c     |    7 +++++--
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 949bd54..55d4bd4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1966,6 +1966,11 @@ static inline void clear_tsk_need_resched(struct task_struct *tsk)
 	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
 }
 
+static inline int test_tsk_need_resched(struct task_struct *tsk)
+{
+	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
+}
+
 static inline int signal_pending(struct task_struct *p)
 {
 	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
@@ -1980,7 +1985,7 @@ static inline int fatal_signal_pending(struct task_struct *p)
 
 static inline int need_resched(void)
 {
-	return unlikely(test_thread_flag(TIF_NEED_RESCHED));
+	return unlikely(test_tsk_need_resched(current));
 }
 
 /*
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 79e5ad5..ae6b0e6 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1058,11 +1058,14 @@ static void post_schedule_rt(struct rq *rq)
 	}
 }
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] sched: prioritize non-migratable tasks over migratable ones
  2008-04-23 11:13       ` [RFC PATCH 0/2] sched fixes for suboptimal balancing Gregory Haskins
  2008-04-23 11:13         ` [PATCH 1/2] sched: fix RT task-wakeup logic Gregory Haskins
@ 2008-04-23 11:13         ` Gregory Haskins
  2008-04-23 12:58           ` Steven Rostedt
  2008-04-28 18:55         ` [RFC PATCH 0/2] sched fixes for suboptimal balancing Ingo Molnar
  2 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-04-23 11:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gregory Haskins, Dmitry Adamushko, Steven Rostedt, linux-kernel,
	linux-rt-users

Dmitry Adamushko pointed out a known flaw in the rt-balancing algorithm
that could allow suboptimal balancing if a non-migratable task gets
queued behind a running migratable one.  It is discussed in this thread:

http://lkml.org/lkml/2008/4/22/296

This issue has been further exacerbated by a recent checkin to
sched-devel (git-id 5eee63a5ebc19a870ac40055c0be49457f3a89a3).

>From a pure priority standpoint, the run-queue is doing the "right"
thing. Using Dmitry's nomenclature, if T0 is on cpu1 first, and T1
wakes up at equal or lower priority (affined only to cpu1) later, it
*should* wait for T0 to finish.  However, in reality that is likely
suboptimal from a system perspective if there are other cores that
could allow T0 and T1 to run concurrently.  Since T1 can not migrate,
the only choice for higher concurrency is to try to move T0.  This is
not something we addessed in the recent rt-balancing re-work.

This patch tries to enhance the balancing algorithm by accomodating this
scenario.  It accomplishes this by incorporating the migratability of a
task into its priority calculation.  Within a numerical tsk->prio, a
non-migratable task is logically higher than a migratable one.  We
maintain this by introducing a new per-priority queue (xqueue, or
exclusive-queue) for holding non-migratable tasks.  The scheduler will
draw from the xqueue over the standard shared-queue (squeue) when
available.

There are several details for utilizing this properly.

1) During task-wake-up, we not only need to check if the priority
   preempts the current task, but we also need to check for this
   non-migratable condition.  Therefore, if a non-migratable task wakes
   up and sees an equal priority migratable task already running, it
   will attempt to preempt it *if* there is a likelyhood that the
   current task will find an immediate home.

2) Tasks only get this non-migratable "priority boost" on wake-up.  Any
   requeuing will result in the non-migratable task being queued to the
   end of the shared queue.  This is an attempt to prevent the system
   from being completely unfair to migratable tasks during things like
   SCHED_RR timeslicing.

I am sure this patch introduces potentially "odd" behavior if you
concoct a scenario where a bunch of non-migratable threads could starve
migratable ones given the right pattern.  I am not yet convinced that
this is a problem since we are talking about tasks of equal RT priority
anyway, and there never is much in the way of guarantees against
starvation under that scenario anyway. (e.g. you could come up with a
similar scenario with a specific timing environment verses an affinity
environment).  I can be convinced otherwise, but for now I think this is
"ok". 

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Steven Rostedt <rostedt@goodmis.org>
---

 kernel/sched.c    |    6 +++-
 kernel/sched_rt.c |   75 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 72 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 5899c29..6971ee9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -164,7 +164,8 @@ static inline int task_has_rt_policy(struct task_struct *p)
  */
 struct rt_prio_array {
 	DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
-	struct list_head queue[MAX_RT_PRIO];
+	struct list_head xqueue[MAX_RT_PRIO]; /* exclusive queue */
+	struct list_head squeue[MAX_RT_PRIO];  /* shared queue */
 };
 
 struct rt_bandwidth {
@@ -8069,7 +8070,8 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 
 	array = &rt_rq->active;
 	for (i = 0; i < MAX_RT_PRIO; i++) {
-		INIT_LIST_HEAD(array->queue + i);
+		INIT_LIST_HEAD(array->xqueue + i);
+		INIT_LIST_HEAD(array->squeue + i);
 		__clear_bit(i, array->bitmap);
 	}
 	/* delimiter for bitsearch: */
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index ae6b0e6..75cd8e4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -470,7 +470,13 @@ static void enqueue_rt_entity(struct sched_rt_entity *rt_se)
 	if (group_rq && rt_rq_throttled(group_rq))
 		return;
 
-	list_add_tail(&rt_se->run_list, array->queue + rt_se_prio(rt_se));
+	if (rt_se->nr_cpus_allowed == 1)
+		list_add_tail(&rt_se->run_list,
+			      array->xqueue + rt_se_prio(rt_se));
+	else
+		list_add_tail(&rt_se->run_list,
+			      array->squeue + rt_se_prio(rt_se));
+
 	__set_bit(rt_se_prio(rt_se), array->bitmap);
 
 	inc_rt_tasks(rt_se, rt_rq);
@@ -482,7 +488,8 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se)
 	struct rt_prio_array *array = &rt_rq->active;
 
 	list_del_init(&rt_se->run_list);
-	if (list_empty(array->queue + rt_se_prio(rt_se)))
+	if (list_empty(array->squeue + rt_se_prio(rt_se))
+	    && list_empty(array->xqueue + rt_se_prio(rt_se)))
 		__clear_bit(rt_se_prio(rt_se), array->bitmap);
 
 	dec_rt_tasks(rt_se, rt_rq);
@@ -562,13 +569,19 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
 /*
  * Put task to the end of the run list without the overhead of dequeue
  * followed by enqueue.
+ *
+ * Note: We always enqueue the task to the shared-queue, regardless of its
+ * previous position w.r.t. exclusive vs shared.  This is so that exclusive RR
+ * tasks fairly round-robin with all tasks on the runqueue, not just other
+ * exclusive tasks.
  */
 static
 void requeue_rt_entity(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
 {
 	struct rt_prio_array *array = &rt_rq->active;
 
-	list_move_tail(&rt_se->run_list, array->queue + rt_se_prio(rt_se));
+	list_del_init(&rt_se->run_list);
+	list_add_tail(&rt_se->run_list, array->squeue + rt_se_prio(rt_se));
 }
 
 static void requeue_task_rt(struct rq *rq, struct task_struct *p)
@@ -626,13 +639,46 @@ static int select_task_rq_rt(struct task_struct *p, int sync)
 }
 #endif /* CONFIG_SMP */
 
+static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
+						   struct rt_rq *rt_rq);
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p)
 {
-	if (p->prio < rq->curr->prio)
+	if (p->prio < rq->curr->prio) {
 		resched_task(rq->curr);
+		return;
+	}
+
+#ifdef CONFIG_SMP
+	/*
+	 * If:
+	 *
+	 * - the newly woken task is of equal priority to the current task
+	 * - the newly woken task is non-migratable while current is migratable
+	 * - current will be preempted on the next reschedule
+	 *
+	 * we should check to see if current can readily move to a different
+	 * cpu.  If so, we will reschedule to allow the push logic to try
+	 * to move current somewhere else, making room for our non-migratable
+	 * task.
+	 */
+	if((p->prio == rq->curr->prio)
+	   && p->rt.nr_cpus_allowed == 1
+	   && rq->curr->rt.nr_cpus_allowed != 1
+	   && pick_next_rt_entity(rq, &rq->rt) != &rq->curr->rt) {
+		cpumask_t mask;
+
+		if (cpupri_find(&rq->rd->cpupri, rq->curr, &mask))
+			/*
+			 * There appears to be other cpus that can accept
+			 * current, so lets reschedule to try and push it away
+			 */
+			resched_task(rq->curr);
+	}
+#endif
 }
 
 static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
@@ -646,8 +692,15 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
 	idx = sched_find_first_bit(array->bitmap);
 	BUG_ON(idx >= MAX_RT_PRIO);
 
-	queue = array->queue + idx;
-	next = list_entry(queue->next, struct sched_rt_entity, run_list);
+	queue = array->xqueue + idx;
+	if (!list_empty(queue))
+		next = list_entry(queue->next, struct sched_rt_entity,
+				  run_list);
+	else {
+		queue = array->squeue + idx;
+		next = list_entry(queue->next, struct sched_rt_entity,
+				  run_list);
+	}
 
 	return next;
 }
@@ -717,7 +770,7 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu)
 			continue;
 		if (next && next->prio < idx)
 			continue;
-		list_for_each_entry(rt_se, array->queue + idx, run_list) {
+		list_for_each_entry(rt_se, array->squeue + idx, run_list) {
 			struct task_struct *p = rt_task_of(rt_se);
 			if (pick_rt_task(rq, p, cpu)) {
 				next = p;
@@ -1110,6 +1163,14 @@ static void set_cpus_allowed_rt(struct task_struct *p,
 		}
 
 		update_rt_migration(rq);
+
+		if (unlikely(weight == 1 || p->rt.nr_cpus_allowed == 1))
+			/*
+			 * If either the new or old weight is a "1", we need
+			 * to requeue to properly move between shared and
+			 * exclusive queues.
+			 */
+			requeue_task_rt(rq, p);
 	}
 
 	p->cpus_allowed    = *new_mask;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added
  2008-04-23 10:54               ` Gregory Haskins
@ 2008-04-23 11:20                 ` Dmitry Adamushko
  0 siblings, 0 replies; 24+ messages in thread
From: Dmitry Adamushko @ 2008-04-23 11:20 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: mingo, Steven Rostedt, chinang.ma, suresh.b.siddha, arjan, willy,
	linux-kernel, linux-rt-users

2008/4/23 Gregory Haskins <ghaskins@novell.com>:
> >>> On Wed, Apr 23, 2008 at  6:23 AM, in message
>  <b647ffbd0804230323s20d25697qcf80ca9996e143a7@mail.gmail.com>, "Dmitry
>
> Adamushko" <dmitry.adamushko@gmail.com> wrote:
>
> > 2008/4/23 Gregory Haskins <ghaskins@novell.com>:
>  >> [ ... ]
>  >>
>  >>  I think we can simplify this further.  We really only need to push here if
>  > we are not going to reschedule anytime soon (probably white-space damaged):
>  >>
>  >>
>  >>  --- a/kernel/sched_rt.c
>  >>
>  >> +++ b/kernel/sched_rt.c
>  >>  @@ -1058,11 +1058,14 @@ static void post_schedule_rt(struct rq *rq)
>  >>         }
>  >>   }
>  >>
>  >>  -
>  >>  +/*
>  >>  + * If we are not running and we are not going to reschedule soon, we
>  > should
>  >>  + * try to push tasks away now
>  >>  + */
>  >>
>  >>  static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
>  >>   {
>  >>         if (!task_running(rq, p) &&
>  >>  -           (p->prio >= rq->rt.highest_prio) &&
>  >>  +           !test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED) &&
>  >>             rq->rt.overloaded)
>  >>                 push_rt_tasks(rq);
>  >>   }
>  >
>  >
>  > It's somewhat suboptimal as it doesn't guarantee that 'p' gets control next.
>
>  No it doesn't, but I don't see that as a requirement to work properly.  All we really need is to make sure push_rt_tasks() will be executed in the near future, and a reschedule will do that.  Perhaps I am missing something?

No, it's better indeed to delay push_rt_tasks() until
post_rt_schedule() when 'current' (that's a task to be preempted) is
also available for potential migration.

Not considering "current" (as it's running) in task_wake_up_rt()   is
a key of the problem which I illustrated in my first message.

I mean, we set rq->rt.pushed = 1 and that kind of means "nothing to be
push off here any more" but "current" was out of our consideration.

humm... a quick idea (should think more on it):

(1) select_task_rq_rt() + task_wake_up_rt() should consider only 'p'
(a newly woken up task) for pushing off a cpu;

i.e. task_wake_up_rt() calls push_rt_task(..., p) and _never_ sets up
rq->rt.pushed;

the main point where 'pushed' is done :
(2) post_schedule_rt()

here we call push_rt_tasks() and set up rq->rt.pushed to 1.

heh, I can be missing something important... need to verify different
scenarios wrt this algo.



>  > e.g. 2 tasks (T0 and T1) have been woken up before an actual
>  > re-schedule takes place. Even if T1 is of lower prio than T0,
>  > task_wake_up_rt() will see the NEED_RESCHED flag and bail out while it
>  > would make sense at this moment to push T1 off this cpu.
>
>  I think this is "ok".  IMO, we really want to prioritize the push operation from post_schedule() over task_wake_up() because it is the most accurate (since the scheduling decision would be atomic to the rq->lock).

Agreed. And as I noticed above, ex-current is also available for
migration policies at this point + the most eligible task has just got
control (and it's the only task that shouldn't be considered for
migration).


-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: fix RT task-wakeup logic
  2008-04-23 11:13         ` [PATCH 1/2] sched: fix RT task-wakeup logic Gregory Haskins
@ 2008-04-23 12:54           ` Steven Rostedt
  2008-04-23 14:29           ` Dmitry Adamushko
  2008-04-24 11:56           ` Gregory Haskins
  2 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2008-04-23 12:54 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Dmitry Adamushko, linux-kernel, linux-rt-users


On Wed, 23 Apr 2008, Gregory Haskins wrote:

> Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we
> will always evaluate to "true".  You can find the thread here:
>
> http://lkml.org/lkml/2008/4/22/296
>
> In reality, we only want to try to push tasks away when a wake up request is
> not going to preempt the current task.  So lets fix it.
>
> Note: We introduce test_tsk_need_resched() instead of open-coding the flag
> check so that the merge-conflict with -rt should help remind us that we
> may need to support NEEDS_RESCHED_DELAYED in the future, too.
>
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Steven Rostedt <rostedt@goodmis.org>

Acked-by: Steven Rostedt <srostedt@redhat.com>

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] sched: prioritize non-migratable tasks over migratable ones
  2008-04-23 11:13         ` [PATCH 2/2] sched: prioritize non-migratable tasks over migratable ones Gregory Haskins
@ 2008-04-23 12:58           ` Steven Rostedt
  2008-04-23 13:11             ` [PATCH 2/2] sched: prioritize non-migratable tasks over migratableones Gregory Haskins
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2008-04-23 12:58 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Dmitry Adamushko, linux-kernel, linux-rt-users



On Wed, 23 Apr 2008, Gregory Haskins wrote:

>
> There are several details for utilizing this properly.
>
> 1) During task-wake-up, we not only need to check if the priority
>    preempts the current task, but we also need to check for this
>    non-migratable condition.  Therefore, if a non-migratable task wakes
>    up and sees an equal priority migratable task already running, it
>    will attempt to preempt it *if* there is a likelyhood that the
>    current task will find an immediate home.
>
> 2) Tasks only get this non-migratable "priority boost" on wake-up.  Any
>    requeuing will result in the non-migratable task being queued to the
>    end of the shared queue.  This is an attempt to prevent the system
>    from being completely unfair to migratable tasks during things like
>    SCHED_RR timeslicing.
>
> I am sure this patch introduces potentially "odd" behavior if you
> concoct a scenario where a bunch of non-migratable threads could starve
> migratable ones given the right pattern.  I am not yet convinced that
> this is a problem since we are talking about tasks of equal RT priority
> anyway, and there never is much in the way of guarantees against
> starvation under that scenario anyway. (e.g. you could come up with a
> similar scenario with a specific timing environment verses an affinity
> environment).  I can be convinced otherwise, but for now I think this is
> "ok".
>

I'm not giving an Ack on this patch (not yet). My main concern here is
that this is performing on a verge of policy change. As you stated, this
is a known issue. One that can be avoided by giving RT tasks different
priorities.  It is known that two tasks with the same priority fighting
for the same CPUS (even if one is migratable and one is not) has
non-deterministic behaviour.

This patch needs a bit of running through the grind to see what effects it
has before going in.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] sched: prioritize non-migratable tasks over migratableones
  2008-04-23 12:58           ` Steven Rostedt
@ 2008-04-23 13:11             ` Gregory Haskins
  0 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-04-23 13:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Dmitry Adamushko, linux-kernel, linux-rt-users

>>> On Wed, Apr 23, 2008 at  8:58 AM, in message
<Pine.LNX.4.58.0804230854210.23815@gandalf.stny.rr.com>, Steven Rostedt
<rostedt@goodmis.org> wrote: 


> 
> I'm not giving an Ack on this patch (not yet). My main concern here is
> that this is performing on a verge of policy change. As you stated, this
> is a known issue. One that can be avoided by giving RT tasks different
> priorities.  It is known that two tasks with the same priority fighting
> for the same CPUS (even if one is migratable and one is not) has
> non-deterministic behaviour.
> 
> This patch needs a bit of running through the grind to see what effects it
> has before going in.

Perfectly reasonable stance, Steve.  Thanks for the review!

-Greg


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: fix RT task-wakeup logic
  2008-04-23 11:13         ` [PATCH 1/2] sched: fix RT task-wakeup logic Gregory Haskins
  2008-04-23 12:54           ` Steven Rostedt
@ 2008-04-23 14:29           ` Dmitry Adamushko
  2008-04-24 11:56           ` Gregory Haskins
  2 siblings, 0 replies; 24+ messages in thread
From: Dmitry Adamushko @ 2008-04-23 14:29 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Ingo Molnar, Steven Rostedt, linux-kernel, linux-rt-users

2008/4/23 Gregory Haskins <ghaskins@novell.com>:
> Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we
>  will always evaluate to "true".  You can find the thread here:
>
>  http://lkml.org/lkml/2008/4/22/296
>
>  In reality, we only want to try to push tasks away when a wake up request is
>  not going to preempt the current task.  So lets fix it.
>
>  Note: We introduce test_tsk_need_resched() instead of open-coding the flag
>  check so that the merge-conflict with -rt should help remind us that we
>  may need to support NEEDS_RESCHED_DELAYED in the future, too.
>
>  Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>  CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>

Acked-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>

I have to look at the second patch more thoroughly... but I guess,
we'd be better off opting for something more simple/less complex (it
looks a bit heavy at the first glance).

This patch already fixes an 'obvious' problem in task_wake_up_rt() and
narrows down the scope of the original problem.

For the original scenario (as well as for its variants with SCHED_FIFO
and != priorities) push_rt_tasks() is 'delayed' untill a reschedule
(which is already pending and should take place shortly... provided
'bounded/short' latencies :-)

For T0,T1 having equal prios + SCHED_RR, T1 (which can't be migrated
due to its affinity) will wait till T0's timeslice expires
and for SCHED_FIFO -- till T0 releases a CPU.

The 'optimal' way would be moving T0 immediatelly off this cpu.

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] sched: fix RT task-wakeup logic
  2008-04-23 11:13         ` [PATCH 1/2] sched: fix RT task-wakeup logic Gregory Haskins
  2008-04-23 12:54           ` Steven Rostedt
  2008-04-23 14:29           ` Dmitry Adamushko
@ 2008-04-24 11:56           ` Gregory Haskins
  2008-04-28 16:30             ` [(RESEND) PATCH] " Gregory Haskins
  2 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-04-24 11:56 UTC (permalink / raw)
  To: Ingo Molnar, Gregory Haskins
  Cc: Dmitry Adamushko, Steven Rostedt, linux-kernel, linux-rt-users

>>> On Wed, Apr 23, 2008 at  7:13 AM, in message
<20080423111329.4981.88455.stgit@novell1.haskins.net>, Gregory Haskins
<ghaskins@novell.com> wrote: 
> Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we
> will always evaluate to "true".  You can find the thread here:
> 
> http://lkml.org/lkml/2008/4/22/296
> 
> In reality, we only want to try to push tasks away when a wake up request is
> not going to preempt the current task.  So lets fix it.
> 
> Note: We introduce test_tsk_need_resched() instead of open-coding the flag
> check so that the merge-conflict with -rt should help remind us that we
> may need to support NEEDS_RESCHED_DELAYED in the future, too.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Steven Rostedt <rostedt@goodmis.org>


Hi Ingo,
  So based on the discussion and the sign-off on this patch from Dmitry and Steven, I would propose to move patch #1 from "RFC" to "Consider for inclusion" status.  Patch #2 should remain RFC pending further discussion.  I would recommend #1 to be considered for both 25.y stable as well as going into the 26 merge window  (pending positive testing results, and your discretion).

Regards,
-Greg


> ---
> 
>  include/linux/sched.h |    7 ++++++-
>  kernel/sched_rt.c     |    7 +++++--
>  2 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 949bd54..55d4bd4 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1966,6 +1966,11 @@ static inline void clear_tsk_need_resched(struct 
> task_struct *tsk)
>  	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
>  }
>  
> +static inline int test_tsk_need_resched(struct task_struct *tsk)
> +{
> +	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
> +}
> +
>  static inline int signal_pending(struct task_struct *p)
>  {
>  	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
> @@ -1980,7 +1985,7 @@ static inline int fatal_signal_pending(struct 
> task_struct *p)
>  
>  static inline int need_resched(void)
>  {
> -	return unlikely(test_thread_flag(TIF_NEED_RESCHED));
> +	return unlikely(test_tsk_need_resched(current));
>  }
>  
>  /*
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 79e5ad5..ae6b0e6 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -1058,11 +1058,14 @@ static void post_schedule_rt(struct rq *rq)
>  	}
>  }
>  
> -
> +/*
> + * If we are not running and we are not going to reschedule soon, we should
> + * try to push tasks away now
> + */
>  static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
>  {
>  	if (!task_running(rq, p) &&
> -	    (p->prio >= rq->rt.highest_prio) &&
> +	    !test_tsk_need_resched(rq->curr) &&
>  	    rq->rt.overloaded)
>  		push_rt_tasks(rq);
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [(RESEND) PATCH] sched: fix RT task-wakeup logic
  2008-04-24 11:56           ` Gregory Haskins
@ 2008-04-28 16:30             ` Gregory Haskins
  2008-04-29 14:35               ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-04-28 16:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Dmitry Adamushko, Gregory Haskins, linux-kernel,
	linux-rt-users

Hi Ingo,
  Im not sure if this patch fell through the cracks, or if you didn't like it
  for some reason.  If you can, please ACK/NACK it so I know if you have
  reviewed it or not.  Thanks!

(This applies to sched-devel, and should be considered a bug fix for 25.y and
the 26 merge window)

-Greg

------------------------------
sched: fix RT task-wakeup logic

Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we
will always evaluate to "true".  You can find the thread here:

http://lkml.org/lkml/2008/4/22/296

In reality, we only want to try to push tasks away when a wake up request is
not going to preempt the current task.  So lets fix it.

Note: We introduce test_tsk_need_resched() instead of open-coding the flag
check so that the merge-conflict with -rt should help remind us that we
may need to support NEEDS_RESCHED_DELAYED in the future, too.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
---

 include/linux/sched.h |    7 ++++++-
 kernel/sched_rt.c     |    7 +++++--
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 949bd54..55d4bd4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1966,6 +1966,11 @@ static inline void clear_tsk_need_resched(struct task_struct *tsk)
 	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
 }
 
+static inline int test_tsk_need_resched(struct task_struct *tsk)
+{
+	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
+}
+
 static inline int signal_pending(struct task_struct *p)
 {
 	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
@@ -1980,7 +1985,7 @@ static inline int fatal_signal_pending(struct task_struct *p)
 
 static inline int need_resched(void)
 {
-	return unlikely(test_thread_flag(TIF_NEED_RESCHED));
+	return unlikely(test_tsk_need_resched(current));
 }
 
 /*
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 79e5ad5..ae6b0e6 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1058,11 +1058,14 @@ static void post_schedule_rt(struct rq *rq)
 	}
 }
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/2] sched fixes for suboptimal balancing
  2008-04-23 11:13       ` [RFC PATCH 0/2] sched fixes for suboptimal balancing Gregory Haskins
  2008-04-23 11:13         ` [PATCH 1/2] sched: fix RT task-wakeup logic Gregory Haskins
  2008-04-23 11:13         ` [PATCH 2/2] sched: prioritize non-migratable tasks over migratable ones Gregory Haskins
@ 2008-04-28 18:55         ` Ingo Molnar
  2 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2008-04-28 18:55 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Dmitry Adamushko, Steven Rostedt, linux-kernel, linux-rt-users


* Gregory Haskins <ghaskins@novell.com> wrote:

> Hi Ingo,
> 
> Here are my proposed fixes for the issue Dmitry has highlighted in:
> 
> http://lkml.org/lkml/2008/4/22/296
> 
> This is an RFC, and it applies to sched-devel.

thanks Gregory - picked them up for more testing.

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [(RESEND) PATCH] sched: fix RT task-wakeup logic
  2008-04-28 16:30             ` [(RESEND) PATCH] " Gregory Haskins
@ 2008-04-29 14:35               ` Ingo Molnar
  0 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2008-04-29 14:35 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Steven Rostedt, Dmitry Adamushko, linux-kernel, linux-rt-users


* Gregory Haskins <ghaskins@novell.com> wrote:

> Hi Ingo,
>   Im not sure if this patch fell through the cracks, or if you didn't 
>   like it for some reason.  If you can, please ACK/NACK it so I know 
>   if you have reviewed it or not.  Thanks!

Your patch was in testing phase, and it is now in 
sched-devel.git/latest. 20 pull reuqests down the line sched-devel fell 
behind a bit ;-)

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-04-29 14:35 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-21 18:10 [PATCH 0/2] sched: refreshes Gregory Haskins
2008-04-21 18:10 ` [PATCH 1/2] sched: push rt tasks only if newly activated tasks have been added Gregory Haskins
2008-04-22 15:30   ` Dmitry Adamushko
2008-04-22 15:51     ` Steven Rostedt
2008-04-23  8:05       ` Dmitry Adamushko
2008-04-23  9:53         ` Dmitry Adamushko
2008-04-23 10:03           ` Gregory Haskins
2008-04-23 10:23             ` Dmitry Adamushko
2008-04-23 10:54               ` Gregory Haskins
2008-04-23 11:20                 ` Dmitry Adamushko
2008-04-22 16:38     ` Gregory Haskins
2008-04-23 11:13       ` [RFC PATCH 0/2] sched fixes for suboptimal balancing Gregory Haskins
2008-04-23 11:13         ` [PATCH 1/2] sched: fix RT task-wakeup logic Gregory Haskins
2008-04-23 12:54           ` Steven Rostedt
2008-04-23 14:29           ` Dmitry Adamushko
2008-04-24 11:56           ` Gregory Haskins
2008-04-28 16:30             ` [(RESEND) PATCH] " Gregory Haskins
2008-04-29 14:35               ` Ingo Molnar
2008-04-23 11:13         ` [PATCH 2/2] sched: prioritize non-migratable tasks over migratable ones Gregory Haskins
2008-04-23 12:58           ` Steven Rostedt
2008-04-23 13:11             ` [PATCH 2/2] sched: prioritize non-migratable tasks over migratableones Gregory Haskins
2008-04-28 18:55         ` [RFC PATCH 0/2] sched fixes for suboptimal balancing Ingo Molnar
2008-04-21 18:10 ` [PATCH 2/2] sched: Use a 2-d bitmap for searching lowest-pri CPU Gregory Haskins
2008-04-21 19:33 ` [PATCH 0/2] sched: refreshes Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).